Not as simple a question as it sounds. There more options than some might think and choosing the right one can be the difference between a well performing application and one that struggles as it scales up in size. There are options both internal and external to the Thingworx platform that can be used. Each has their own use cases and cost considerations.
Internal to Thingworx there are currently two options as the storage provider PostGreSQL and DataStax Enterprise Cassandra (Neo4j will only be supported for legacy customers and H2 version is not yet released). PostGreSQL is required for storing the Thingworx model structure and is an open source technology, meaning no additional cost. It performs well up to an estimated 500 Gb of data storage (this is a rough estimate dependant on use case). For very high volume data DSE is the choice, it performs well for large data sets.
External to Thingworx you can use virtually any data storage technology the provides a JDBC connector or even one that has a driver that can be used to create a Thingworx Extension via our SDK or edge SDKs. The platform knows how to use JDBC drivers so this can easily be used to connect to relational data storage like Oracle or SQL Server.
The first real question to ask when making the choice of where to store data is, what does my data look like? Many systems are adapted or migrated from legacy systems which may include relational data, others simply have this structure by necessity. If the data will need to use complex SQL to retrieve (like using joins, like, cursors, temp tables, etc.) then store the data in a true relational database. If it is simple historical data, time series data or data that does not require compounding or recursive calculation to be useful, then keep it in platform data storage.
The second question to ask is, how much data will I be storing. This adds a bit of complexity to where data is best stored. There is no limit to the number of records in any data structure however, the Thingworx Platform storage is optimized to store and retrieve time series data, using the ValueSteam and Stream types built into the Platform. This is the most common IoT data structure and in this case you can refer back to the previous information when choosing PostGreSQL or DSE as the correct backend storage. Data tables can be used when contained in small data sets (around 100,000 records or less) you can use Platform storage for this as these are intended for largely static data structures. Retrieving data when DataTables grow larger than this will begin to slow performance quickly. This is because currently Thingworx will do a full scan of the data, in this specific type of structure, when querying because all of the logic for the query or filter is done on the platform, not on the database (this will likely change in a future version). So small amounts of data can be quickly loaded and parsed in memory.
NOTE (Neo4j specific): In datatables if you add a index to a column, these indexes are used when calling "FindDataTableEntries" but not when using "QueryDataTableEntries".
Streams and ValueStreams, however, are optimized for time series data. In these structures Thingworx has built in datetime filters that allow for very fast retrieval of data based on a date range. When the number of records returned after the date range is applied is still a very large number (100,00 - 200,000) you may see a drop in performance of a query at that point. Just as before, all records, after the date filter is applied, are returned to the Platform and further query and filtering are done in memory.
The querying/retrieval of data is commonly where the greatest performance issues are seen. Using a JDBC connector to send the query to the database (even if it is DSE or PostGreSQL) can help, or if the historical data is not queried regularly you can move this data to a separate Thingworx data store (another DataTable or Stream).
That would leave only large data sets of non-time series data as the outlier. This scenario could perform equally well (or poorly) primarily on how the data will be retrieved. If there are loose relationship between the data that need to be used then a relational system that would allow these to be executed on the database server is preferred. Sequential data that does not need this type of processing could be stored in DSE.
This is a base outline of considerations when designing data storage on your application. Most use cases are unique and may have additional considerations around process and cost.
You say that:
"NOTE: In datatables if you add a index to a column, these indexes are used when calling "FindDataTableEntries" but not when using "QueryDataTableEntries"."
Then the only available comparison to use for quering a DataTable through Indexes it's equality, not greater, lower,... neither an AND/OR condition? as FindDataTableEntries expects as parameter a "JSON" object to compare to... really limited usage of indexes?
We don't have a fast querying mechanism then for any kind of TW standard data ( ValueStream, Stream, DataTable ) ?
As you mentioned we can connect to Oracle DB or any other DB based on the connectors. Are there any challenges of this approach?
Is there any casestudy or success story where multiple persistence providers used effectively?
The primary challenges to this approach is that these are not built in persistence providers on the platform, so you must create/script all of the queries and writes to that database independently of any platform function (like writing property updates). The other main issue we have seen is if there are very large amounts of data moving either direction across this single connection, you can see some latency (since it is a single connection). It is possible to create multiple JDBC connection things to improve this but keep in mind JDBC was not designed for very high volume data transfer.
We have several customers that have used JDBC as their primary data storage, however I do not know of any published case studies on this.
Adam, great post! It was very helpful.
Siddharth/Carles, the main point of Adam's post was to further expound on the old IT answer, "it depends". Thingworx is not the magic bullet to answer all data caching and big data questions. How you persist data depends on how you plan to use it (read), velocity it changes (write), and many other factors. Just like many other data caching and NOSQL platforms, Thingworx is not a replacement in every case for relational databases. And relational databases have limitations on the performance side that NOSQL/data caching platforms like what Thingworx sits on do not have.
The key (to the point that I think Adam was making), is to design your overall data strategy around the needs and requirements of your overall application and the pros/cons of each technology. The answer is not either/or (either Thingworx or RDBMS); it's both implemented to maximize the pro's of both and reduce the impact the con's can cause.
One approach is to think of accessing RDBMS in a more asynchronous method and parse/restructure that data into smaller more digestible structures in Thingworx data stores. This allows the pro's of the RDBMS (complex SQL processing, etc...) to shine and by restructuring the data into faster access shards in Thingworx; your application can benefit from faster data access the nosql backend of Thingworx can provide. Just a thought.
Could you upgrade the information in this article in order to summarize the Data Storage options available now (Postgres, MSSQL, SapHana)? H2 was released, and maybe it would be a good idea to remove any mention about Neo4j completely?