Those who have been working with ThingWorx for many years will have noticed the work done around ingress stress testing and performance optimization. Adding InfluxDB as a time-series data persistence provider really helped level up these capabilities while simultaneously decreasing the overall resources required by the infrastructure. However with this ease comes a hidden challenge: query and data processing performance to work it into something useful.
Often It's Too Much Data
In general most customers that I work with want to collect far too much data -- without knowing what it will be used for, or what processing will be required in order to make it usable and useful. This is a trap in general with how many people envision IoT projects, being told by infrastructure providers that cloud storage and compute resources are abundant and cheap and that they should get as much data as possible. This buildup of data means that more effort needs to be spent working it into something useful (data engineering/feature extraction) and addressing common data issues (quality, gaps, precision, etc.). This might be fine for mature companies with large data analytics teams; however this is a makeup that I've only seen in the largest of our customers. Some advice - figure out what you need and how you'll use it, and then collect that. Work on extracting value today rather than hoping that extra data collected now will provide some insights years from now.
Example - Problem Statement
You got your Thing Model designed, and edge devices connected. Now you've got data flowing in and being stored every 5 seconds in InfluxDB. Great progress! Now on to building the applications which cover the various use cases. The raw data is most likely going to need to be processed and potentially even significantly transformed into other information in order to make it useful. Turning a "powered on and running" BOOLEAN to an "hour meter" INTEGER is a simple example. Then you may need to provide a report showing equipment run time hours by day over a month. The maintenance team may also have asked to look for usage patterns which lead to breakdowns, requiring extracting other data points from the initial one like number of daily starts, average daily run time, average time between restarts. The problem here is that unless you have prepared these new data points and stored them as well (say in a Stream), you are going to have to build these data sets on the fly, and that can be time and resource intensive and not give you the response time expected. As you can imagine, repeatedly querying and processing large volumes of unchanging raw data is going to have resource and time implications - so this is why data collection and data use need to be thought about separately.
In the above examples, the key is actually creating new data points which are calculated progressively throughout normal operation. This not only makes the information that you want available when you need it - in the right format - but it also significantly reduces resource requirements by constantly reprocessing raw data. It also helps managing data purging, because as you create and store usable insights, you can eventually just archive away your old raw data streams.
Direct Database Queries vs. Thingworx Data Services
Despite the above being a rule of thumb, sometimes a simple well structured database query can get you exactly what you need and do so quite quickly. This is especially true for InfluxDB when working with extremely large time-series datasets. The challenge here is that ThingWorx persistence providers abstract away the complexity of writing ones own database queries, so we can't easily get at the databases raw power and are forced to query back more data than needed and work it into a usable format in memory (which is not fast).
Leveraging the InfluxDB API using the ContentLoader Technique
As InfluxDBs API is 100% REST, we can access it using in-built ThingWorx Content Loader services. Check out this demonstration and explanation video where I talk about how to interact directly with InfluxDB in order to crush massive time-series data and get back much more usable and manageable data sets. It is important to note here that you should use a read-only database user here, as you should never modify the ThingWorx databases to avoid untested scenarios which may lead to data corruption.
An interesting post, @geva , thanks for sharing.
Really like how you explain somewhat complex concepts / topics in simple words and vivid examples.
Regarding the "too much data" point...
I usually find myself promoting the "store all the data you have" approach, but I have data science background and have seen some interesting insights from seemingly unrelated and surprising sensors / data.
When talking about industrial equipment and predicting its failures, which are rare events, decision not to store some data or storing with lower granularity can backfire.
Totally agree that, yes, it introduces some overhead - both technical and organizational, which can be costly and is not always justifiable and feasible.
Thanks a lot @DmitryTsarev . I've grown up mostly in an IT and OT context where hardwired networks and continuous power supply don't really limit what is collected and stored so this has been normal for me too. In the telematics/M2M/IoT world the constraints of power, network, bandwidth and network effects when dealing with large quantities of connected devices has gotten me thinking differently to ensure scale and performance. Our analytics team finished the job of convincing me to frame things as an analytical problem which addresses specific business problems.
Expanding on the topic - there are different ways of thinking of that learning/detailed/R&D data. You can capture, collect, analyse and learn from higher fidelity/volumes of data in an alternate channel (like how your equipment operates) than the connected asset status one. We have customers who collect these sorts of datasets on USB memory when they do maintenance visits.