Solved: Re: Query Performance : Data Table VS Stream

seanccc · ‎May 09, 2019

Hi,

I have question about the query performance for Data Table and Stream.

In PostgreSQL, the Data Table and Stream has the following table structure:

	DataTable	Stream
Columns	entry_id\|entity_id\|entity_key\|source_id\|source_type\|tags\|time\|field_values\|full_text\|location	entry_id\|entity_id\|source_id\|source_type\|tags\|time\|field_values\|full_text\|location
Index	entity_key field_values	(entity_id, time) entity_id
Constraints	entry_id (entity_id, entity_key)	(entity_id,source_id,time,source_type)

According to the comparing above , the questions are :

According to the offical document , for the Data Table, "Anything greater than 100,000 rows will gradually begin to cause queries to perform slower". Is it same for Stream ?
The Data Table has index on field_values while Stream doesn't , Does it means the query performance on the field of Data Shape for data table should be much faster than Stream using JDBC query as long as not querying on time field ? Does QueryDataTableEntries has the better performance than QueryStreamEntriesWithData as well in this case ?
If always query on the index field , will query Data Table still become slow after storing more than 100,000 records ?
Why not add field_values to index for Stream ? to avoid the increased data volume ?

Regards,

Sean

Tudor · ‎May 13, 2019

Thank you for your question. First, a quick discussion about various use cases for these storage methods:

Data tables are generally meant for static data that changes infrequently (like serial numbers).

Streams and value streams are meant for ingesting extremely quickly updating data (e.g. raw device data). Because Streams and Value Streams are optimized for ingestion, they have been designed with fewer indexes in place. Updating indexes significantly slows down ingestion, and the design for these Stream tables is meant to absorb tens of thousands of data points per second. In our testing we have found that adding additional indexes (even if it theoretically helps with fetching data) significantly slows down table writes and potentially leads to index collisions when multiple transactions attempt writes at the same time.

In terms of your question for Data Table query performance, there are a couple of notes:

1. Speaking of indexes, we do find that there is an index missing on Data Tables that does significantly boost performance. Please see: https://www.ptc.com/en/support/article?n=CS261063

2. With the index above in place, we do have customers with 200-300k entries in their Data Tables querying this data successfully (note: this is mostly static data)

3. Data Tables (as of current releases of ThingWorx) are not thread safe for writing/updating rows. Again, these were designed for slower operations and if we perform DT updates during ingestion, we can easily cause threading/update issues.

4. Streams were designed with multi-threaded inserts/updates in mind

5. If we always query by Key or add the index in #2, we can remain performant with 2-3x as much data as documented. However, if the desire is to maintain 0.5 million rows, then we'll likely be better off designing our own storage structure (you can configure JDBC connections to easily manipulate your own storage with all the needed indexes).

Please let me know if I can clarify any of the points mentioned above.

View solution in original post

Tudor · ‎May 13, 2019