Stream entries and query speed

iguerra · ‎Oct 09, 2019

I have a stream containing "alarms"

the source field is the name of the equipment who generated the alarm, like MACHINE1 MACHINE2 ....

I need to query entries for all records with two or more source names (not just one)

I Use QueryStreamEntriesWithData, keeping source parameter as undefined and using a query with the "IN" syntax, like {source IN array_of_elements} (where each element is a machine name).

This query works ... but it is slow if I have millions of rows and searching for many months (I also use time range obviously, and limit result to 2000 records).

is the source field indexed ? is this query executed at the database or thingworx layer ?

is there a faster way ?

smainente · ‎Oct 09, 2019

It is important to use (some of) the following parameters : source, startDate, endDate, sourceTags or Tags when querying large streams with QueryStreamEntriesWithData.

The query filter is always applied in-memory on the platform side.

For example :

var query = {
 "filters": {
   "type": "EQ",
   "fieldName": "name",
   "value": "aa"
 }
};
var result =  Things["myStream"].QueryStreamEntriesWithData({query: query});

The code above is fetching the entire stream from the DB, the filter is then applied on the platform in-memory.

Impacted services : QueryDataTableEntries, QueryStreamData, QueryStreamEntries, QueryStreamEntriesWithData

Note that maxItems is also applied in-memory when used in conjunction with the query filter :

var query = {...};
var result =  Things["myStream"].QueryStreamEntriesWithData({maxItems: 1, query: query});

This code is also fetching the entire stream from the DB

zyuan1 · ‎Oct 09, 2019

In appendix to this answer, I have 2 suggestions that might help.

1. Put your Stream data in InfluxDB (the new persistence provider after 8.4)

The InfluxDB is designed for timeseries data processing, so for Streams, the InfluxDB query speed is about 3 times faster than pgSQL. While since your Index filtering is done in TWX platform side, you'll just get a small improve on the overall process.

2. Give up using Stream and put the data in External JDBC database instead.

In this way, you can store more data, and use StoredProcedure to speed up the search.

The database would operate much faster than TWX platform in Filtering, and when you have a large size of Data, it's recommanded not to save in your primary Persistence provider (it will slow your platform speed, and oversized Stream table may cause a index overflow, it would require a Vacuum process and TWX stop working by the time.)

Constantine · ‎Oct 10, 2019

Hello,

The source field is indexed, but this index is only used when you use the corresponding parameter. Anything that is specified as query is applied in ThingWorx after the result set is fetched from the database.

For querying by multiple sources a series of queries (one per filtered source) + merging results might still work faster than using query parameter (depends on the number of queries of course).

/ Constantine

iguerra · ‎Oct 10, 2019

Thank you all !

so a summary on where filtering is done may be this ?

var params = {
	oldestFirst: undefined /* BOOLEAN */,	// ??? sort done at DB level or platform ?
	maxItems: undefined /* NUMBER */,   	// done ad DB if query is not used, differentry at platform level 
	sourceTags: undefined /* TAGS */,   	// ??? dont know ...
	endDate: undefined /* DATETIME */, 		// INDEXED, filtering done on DB 
	query: undefined /* QUERY */,			// WARNING: done ad platform level (slower)
	source: undefined /* STRING */,   		// INDEXED, filtering done on DB
	startDate: undefined /* DATETIME */,  	// INDEXED, filtering done on DB
	tags: undefined /* TAGS */       		// ??? dont know ...
};
var result = me.QueryStreamEntriesWithData(params);

I'll try what suggested by Costantine.

smainente · ‎Oct 10, 2019

Note that maxItems is "performed" on the DB if you don't use the query parameter.

iguerra · ‎Oct 10, 2019

Yes thanks, I updated the post

smainente · ‎Oct 10, 2019

It may depend on the persistence provider, but in 8.4.5 + Postgresql :

tags is used to query the DB (but tags are not indexed)
oldestFirst is used to query the DB (timestamp is generally indexed)
sourceTags are applied in-memory on the platform