cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Community Tip - Did you get an answer that solved your problem? Please mark it as an Accepted Solution so others with the same problem can find the answer easily. X

Query Performance : Data Table VS Stream

seanccc
17-Peridot

Query Performance : Data Table VS Stream

Hi, 

I have question about the query performance for Data Table and Stream. 

 

In PostgreSQL, the Data Table and Stream has the following table structure: 

  DataTable Stream
Columns entry_id|entity_id|entity_key|source_id|source_type|tags|time|field_values|full_text|location entry_id|entity_id|source_id|source_type|tags|time|field_values|full_text|location
Index entity_key
field_values
(entity_id, time)
entity_id
Constraints entry_id
(entity_id, entity_key)
(entity_id,source_id,time,source_type)

 

According to the  comparing above , the questions are : 

  1. According to the offical document , for the Data Table,  "Anything greater than 100,000 rows will gradually begin to cause queries to perform slower".   Is it same for Stream ?   
  2. The Data Table has index on field_values while Stream doesn't ,  Does it means the query performance on the field of Data Shape for data table should be much faster than Stream using JDBC query as long as not querying on time field ? Does QueryDataTableEntries has the better performance than QueryStreamEntriesWithData as well in this case ? 
  3. If always query on the index field , will query Data Table still become slow after storing more than 100,000 records ? 
  4. Why not add field_values to index for Stream ?    to avoid the increased data volume ?

 

Regards,

Sean

1 ACCEPTED SOLUTION

Accepted Solutions
Tudor
12-Amethyst
(To:seanccc)

Thank you for your question.  First, a quick discussion about various use cases for these storage methods:

Data tables are generally meant for static data that changes infrequently (like serial numbers).

 

Streams and value streams are meant for ingesting extremely quickly updating data (e.g. raw device data).  Because Streams and Value Streams are optimized for ingestion, they have been designed with fewer indexes in place.  Updating indexes significantly slows down ingestion, and the design for these Stream tables is meant to absorb tens of thousands of data points per second.  In our testing we have found that adding additional indexes (even if it theoretically helps with fetching data) significantly slows down table writes and potentially leads to index collisions when multiple transactions attempt writes at the same time.

 

In terms of your question for Data Table query performance, there are a couple of notes:

1. Speaking of indexes, we do find that there is an index missing on Data Tables that does significantly boost performance. Please see: https://www.ptc.com/en/support/article?n=CS261063

2. With the index above in place, we do have customers with 200-300k entries in their Data Tables querying this data successfully (note: this is mostly static data)

3. Data Tables (as of current releases of ThingWorx) are not thread safe for writing/updating rows.  Again, these were designed for slower operations and if we perform DT updates during ingestion, we can easily cause threading/update issues.

4. Streams were designed with multi-threaded inserts/updates in mind

5. If we always query by Key or add the index in #2, we can remain performant with 2-3x as much data as documented.  However, if the desire is to maintain 0.5 million rows, then we'll likely be better off designing our own storage structure (you can configure JDBC connections to easily manipulate your own storage with all the needed indexes).

 

Please let me know if I can clarify any of the points mentioned above.

View solution in original post

7 REPLIES 7
Tudor
12-Amethyst
(To:seanccc)

Thank you for your question.  First, a quick discussion about various use cases for these storage methods:

Data tables are generally meant for static data that changes infrequently (like serial numbers).

 

Streams and value streams are meant for ingesting extremely quickly updating data (e.g. raw device data).  Because Streams and Value Streams are optimized for ingestion, they have been designed with fewer indexes in place.  Updating indexes significantly slows down ingestion, and the design for these Stream tables is meant to absorb tens of thousands of data points per second.  In our testing we have found that adding additional indexes (even if it theoretically helps with fetching data) significantly slows down table writes and potentially leads to index collisions when multiple transactions attempt writes at the same time.

 

In terms of your question for Data Table query performance, there are a couple of notes:

1. Speaking of indexes, we do find that there is an index missing on Data Tables that does significantly boost performance. Please see: https://www.ptc.com/en/support/article?n=CS261063

2. With the index above in place, we do have customers with 200-300k entries in their Data Tables querying this data successfully (note: this is mostly static data)

3. Data Tables (as of current releases of ThingWorx) are not thread safe for writing/updating rows.  Again, these were designed for slower operations and if we perform DT updates during ingestion, we can easily cause threading/update issues.

4. Streams were designed with multi-threaded inserts/updates in mind

5. If we always query by Key or add the index in #2, we can remain performant with 2-3x as much data as documented.  However, if the desire is to maintain 0.5 million rows, then we'll likely be better off designing our own storage structure (you can configure JDBC connections to easily manipulate your own storage with all the needed indexes).

 

Please let me know if I can clarify any of the points mentioned above.

zyuan1
18-Opal
(To:Tudor)

Adding some information to Tudor's post:

 

The Stream and ValueStream are time-series data, so they are better sorted than Datatable, the search could be faster. Especially in 8.4 we introduced InfluxDB as the additional Persistence Package for Time-series data storage, and it's proved that searching with InfluxDB will be a few times faster than PostgreSQL, so for sure it will be faster than using Datatable.

Tudor
12-Amethyst
(To:zyuan1)

Agreed: Stream/Value Stream queries are faster if you need to find time series data, but I don't believe this is the use case here.  If you're looking up relational data (time insensitive), the lack of indexing will degrade performance compared to data tables.

seanccc
17-Peridot
(To:Tudor)

@Tudor ,

 

So,

for Stream , the query performance is good if set query condition on time ONLY, but should avoid to query on the other properties as they're not indexed . 

for DataTable, after fixing https://www.ptc.com/en/support/article?n=CS261063,  the query performance is good to query any property as the JSONB field field_values is indexed (less than 300,000 records), but should avoid to write to DataTable in high concurrency. 

is it Right ? 

Tudor
12-Amethyst
(To:seanccc)

@seanccc  -- your understanding is correct.  The more we can leverage indexes already in the DB, the better our performance.

seanccc
17-Peridot
(To:Tudor)

@Tudor @zyuan1 ,

 

Thank you for your replies , it's clear for me to the use case of the data table and stream.  

 

Regards,

Sean

Another thing to speedup DataTable Queries it's to use "values" parameter instead of "query" parameter, actually (I'm not 100% sure if it's with all TW realeases) .

 

You can add custom indexes on TW DataTables through DataTable > Configuration > Index Settings, then in order to use this indexes you MUST use "values" parameter instead of "query" parameter. This has the limitation that you can only query for exactly the values on the index but it speeds up a lot queries.

 

Sample of querying by index fields (two fields on the same index):

 

// -- you defined myIndexedField1,myIndexedField2 as compound index on myDataTable 

var values = Resources["InfoTableFunctions"].CreateInfoTable();;
values.AddField({ name: "myIndexedField1", baseType:"STRING" });
values.AddField({ name: "myIndexedField2", baseType:"STRING" });
values.AddRow({
   myIndexedField1: valueToQueryIndexedField1,
  myIndexedField2: valueToQueryIndexedField2,
});

myDataTable.QueryDataTableEntries({
    values: values,
    source: something or undefined,
    query: something* or undefined
});

* on this case first it will filter by values and source (both indexed) and then with the results it will apply the query parameter
Top Tags