Skip to main content
17-Peridot
May 9, 2019
Solved

Query Performance : Data Table VS Stream

  • May 9, 2019
  • 1 reply
  • 3796 views

Hi, 

I have question about the query performance for Data Table and Stream. 

 

In PostgreSQL, the Data Table and Stream has the following table structure: 

  DataTable Stream
Columns entry_id|entity_id|entity_key|source_id|source_type|tags|time|field_values|full_text|location entry_id|entity_id|source_id|source_type|tags|time|field_values|full_text|location
Index entity_key
field_values
(entity_id, time)
entity_id
Constraints entry_id
(entity_id, entity_key)
(entity_id,source_id,time,source_type)

 

According to the  comparing above , the questions are : 

  1. According to the offical document , for the Data Table,  "Anything greater than 100,000 rows will gradually begin to cause queries to perform slower".   Is it same for Stream ?   
  2. The Data Table has index on field_values while Stream doesn't ,  Does it means the query performance on the field of Data Shape for data table should be much faster than Stream using JDBC query as long as not querying on time field ? Does QueryDataTableEntries has the better performance than QueryStreamEntriesWithData as well in this case ? 
  3. If always query on the index field , will query Data Table still become slow after storing more than 100,000 records ? 
  4. Why not add field_values to index for Stream ?    to avoid the increased data volume ?

 

Regards,

Sean

Best answer by Tudor

Thank you for your question.  First, a quick discussion about various use cases for these storage methods:

Data tables are generally meant for static data that changes infrequently (like serial numbers).

 

Streams and value streams are meant for ingesting extremely quickly updating data (e.g. raw device data).  Because Streams and Value Streams are optimized for ingestion, they have been designed with fewer indexes in place.  Updating indexes significantly slows down ingestion, and the design for these Stream tables is meant to absorb tens of thousands of data points per second.  In our testing we have found that adding additional indexes (even if it theoretically helps with fetching data) significantly slows down table writes and potentially leads to index collisions when multiple transactions attempt writes at the same time.

 

In terms of your question for Data Table query performance, there are a couple of notes:

1. Speaking of indexes, we do find that there is an index missing on Data Tables that does significantly boost performance. Please see: https://www.ptc.com/en/support/article?n=CS261063

2. With the index above in place, we do have customers with 200-300k entries in their Data Tables querying this data successfully (note: this is mostly static data)

3. Data Tables (as of current releases of ThingWorx) are not thread safe for writing/updating rows.  Again, these were designed for slower operations and if we perform DT updates during ingestion, we can easily cause threading/update issues.

4. Streams were designed with multi-threaded inserts/updates in mind

5. If we always query by Key or add the index in #2, we can remain performant with 2-3x as much data as documented.  However, if the desire is to maintain 0.5 million rows, then we'll likely be better off designing our own storage structure (you can configure JDBC connections to easily manipulate your own storage with all the needed indexes).

 

Please let me know if I can clarify any of the points mentioned above.

1 reply

Tudor14-AlexandriteAnswer
14-Alexandrite
May 13, 2019

Thank you for your question.  First, a quick discussion about various use cases for these storage methods:

Data tables are generally meant for static data that changes infrequently (like serial numbers).

 

Streams and value streams are meant for ingesting extremely quickly updating data (e.g. raw device data).  Because Streams and Value Streams are optimized for ingestion, they have been designed with fewer indexes in place.  Updating indexes significantly slows down ingestion, and the design for these Stream tables is meant to absorb tens of thousands of data points per second.  In our testing we have found that adding additional indexes (even if it theoretically helps with fetching data) significantly slows down table writes and potentially leads to index collisions when multiple transactions attempt writes at the same time.

 

In terms of your question for Data Table query performance, there are a couple of notes:

1. Speaking of indexes, we do find that there is an index missing on Data Tables that does significantly boost performance. Please see: https://www.ptc.com/en/support/article?n=CS261063

2. With the index above in place, we do have customers with 200-300k entries in their Data Tables querying this data successfully (note: this is mostly static data)

3. Data Tables (as of current releases of ThingWorx) are not thread safe for writing/updating rows.  Again, these were designed for slower operations and if we perform DT updates during ingestion, we can easily cause threading/update issues.

4. Streams were designed with multi-threaded inserts/updates in mind

5. If we always query by Key or add the index in #2, we can remain performant with 2-3x as much data as documented.  However, if the desire is to maintain 0.5 million rows, then we'll likely be better off designing our own storage structure (you can configure JDBC connections to easily manipulate your own storage with all the needed indexes).

 

Please let me know if I can clarify any of the points mentioned above.

5-Regular Member
May 14, 2019

Adding some information to Tudor's post:

 

The Stream and ValueStream are time-series data, so they are better sorted than Datatable, the search could be faster. Especially in 8.4 we introduced InfluxDB as the additional Persistence Package for Time-series data storage, and it's proved that searching with InfluxDB will be a few times faster than PostgreSQL, so for sure it will be faster than using Datatable.

14-Alexandrite
May 14, 2019

Agreed: Stream/Value Stream queries are faster if you need to find time series data, but I don't believe this is the use case here.  If you're looking up relational data (time insensitive), the lack of indexing will degrade performance compared to data tables.