5 Common Mistakes to Developing Scalable IoT Applications
by Tori Firewind and the IoT EDC Team
To build scalable applications, it’s necessary to identify common mistakes and avoid them at the early stages of development. In an expert session this past month, the PTC Enterprise Deployment Team elaborated on why scalability is important and how to avoid the common development pitfalls in IoT. That video presentation has been adapted here for visual consumption of the content as well.
What is Scalability and Why Does it Matter
Enterprise ready applications can scale and easily be maintained, which is important even from day 1 because scalability concerns are the largest cause for delays to Go Lives.
Applications balance many competing requirements, and performance testing is crucial to ensure an application is ready for Go Live. However, don't just test how many remote assets can connect at once, but also any metrics that are expected to increase in time, like the number of remote properties per thing, the frequency of reporting from those properties, or the number of users accessing the system at once. Also consider how connecting more assets will affect the user experience and business logic, and not just the ability to ingest data.
Common Mistake 1: Edge Property Updates
Because ThingWorx is always listening for updates pushed from the Edge and those resources are always in use, pulling updates from the Foundation side wastes resources.
Fetch from remote every read is essentially a round trip, so it's slower and more memory intensive, but there are reasons to do it, like if the quality tag is needed since the cache doesn't store it. Say a property is pushed at 11:01, and then there's a network issue at 11:02. If the property is pulled from the cache, it will pull the value sent at 11:01 without any indication of there being a more recent value on the Edge device. Most people will use the default options here: read from server cache, which relies on the Edge to push updates, and the VALUE push type, and configuring a threshold is a good idea as well. This way, only those property updates which are truly necessary are sent to the Foundation server. Details on property aspects can be found in KCS Article 252792.
This is well documented in another PTC Community post. This approach is necessary and considered a best practice if there is event logic which depends on multiple properties at once. Sending all of the necessary properties to determine if an event should fire in one Infotable ensures there is no need to query the database each time a property update comes in from the Edge, which ensures independent business logic and reduces the load on the database to improve ingestion performance.
This is a very broad topic and future articles will address it more specifically. The When Disconnected property aspect is a good way to configure what happens with Edge property values in a mass disconnect scenario. If revenue depends on uptime, consider losing any data that changes while a device is disconnected. All of the updates can be folded into a single value if the changes themselves aren't needed but an updated value is needed to populate remote properties upon reconnect. Many customers will want to keep all of their data, even when a device is offline and use data stores. In this case, consider how much data each Edge device can store (due to memory limitations on the devices themselves), and therefore how long an outage can last before data is lost anyway. Also consider if Foundation can handle massive spikes in activity when this data comes streaming in. Usually, a Connection Server isn't enough. Remember that the more data needs to be kept, the greater the potential for a thundering herd scenario.
Handling a thundering herd scenario goes beyond sizing considerations. It is absolutely crucial to randomize the delay each device will wait before attempting to reconnect. It should be considered a requirement to have the devices connect slowly and "ramp up" over time for multiple reasons. One is that too much data coming in too fast could overwhelm the ingestion queue and result in data loss. Another is that the business logic could demand so many system resources, that the Foundation server crashes again and again and cannot be recovered. Turning off the business logic it isn't possible if the downtime is unexpected, so definitely rely instead on randomized reconnection times for Edge devices.
Common Mistake 2: Overlooking Differences in HA
To accommodate a shared thing model across many servers, changes had to be made in how the thing model is stored and the model tree is walked by the Foundation servers. Model information is no longer cached at the Thing level, and the model tree is therefore walked every time model information is needed, so the number of times a Thing is directly referenced within each service should be limited (see the Help Center for details).
It's best to store whatever information is needed from a Thing in an Infotable, making the Things[thingName] reference a single time, outside of any loops. Storing the property definitions outside of the loop prevents the repetitious Thing references within the service, which otherwise would have occurred twice for each property (for both the name and the description), and then again for every single property on the Thing, a runtime nightmare.
Certain states previously held in memory are now shared across the cluster, like property values, Thing states, and connection statuses. Improvements have been made to minimize the effects of latency on queries, like how they now only return property values on associated Thing Shapes or Thing Templates. Filtering for properties on implementing Things is still possible, but now there is a specific service to do it, called GetThingPropertyValues (covered in detail in the Help Center).
In the script shown above, the first step is a query to get the names of all implementing things of a particular Thing Shape. This is done outside of any loops or queries, so once per service call. Then, an Infotable is built to store what would have been a direct reference to each thing in a traditional loop. This is a very quick loop that doesn't add much by way of runtime since it is all in memory, with no references to the thing model or the database, instead using the results of the first query to build the Infotable. Finally, this thing reference Infotable is passed into the new service GetThingPropertyValues to retrieve all of the property info for all of these things at once, thereby only walking the thing model once. The easiest mistake people would make here is to do a direct thing reference inside of a loop, using code like Things[thingName].Get() over and over again, thereby traversing the thing model repeatedly and adding a lot of runtime.
QueryImplementingThingsOptimized is another new service with new parameters for advanced configuration. Searches can now be done on particular networks or to particular depths, and there's an offset parameter that allows for a maximum number of items to be returned starting at any place in the list of Things, where previously if you needed the Things at the end of the list, you had to return all of the Things. All of these options are detailed in the Help Center, as well as the restrictions listed in the image above.
Common Mistake 3: Async Service Misuse
Async services are sometimes required, say if a user has to trigger many updates on many remote things at once by the click of a button on a mashup that should not be locked up waiting for service completion. Too many async service calls, though, result in spikes in activity and competition for resources.
To avoid this mistake, do not use async unless strictly necessary, and avoid launching too many async threads in parallel. A thread dump will show how many threads there are and what they are doing.
Common Mistake 4: Thread Pool Overload
Adding more threads to the pool may be beneficial in certain circumstances, like if the threads are waiting on other resources to complete their tasks, look stuff up in the database (I/O), or unlock data that can only be accessed one thread at a time (property writes). In this case, threads are waiting on other resources, and not the CPU, so adding more threads to the pool can improve performance. However, too many threads and performance degradation will occur due to increased contention, wasted CPU cycles, and context switches.
To check if there are too many or not enough threads in the pool, take thread dumps and time the completion of requests in the system. Also watch the subsystem memory usage, and note that the side of the queue should never approach the max. Also consider monitoring the overall performance of the system (CPU and Memory) with a tool like Grafana, and remember that a good performance test properly exercises all of the business logic and induces threads in a similar way to real world expectations.
Common Mistake 5: Stream Etiquette
Upserts, or updates to database tables, are expensive operations that can interfere with ingestion if they are performed on the wrong tables. This is why Value Stream and Stream data should never be updated by end users of the application.
As described in the DGIS document on best practices, aggregation is the key to unlocking optimal performance because it reduces the size of database tables that require upserts. Each data structure shown here has an optimal use in a well-designed ThingWorx application.
Data Tables are great for storing overview information on all of the Things in one view, and queries on this data source are the fastest. Update this data source as often as possible (by timer), allowing enough time for updates to be gathered and any necessary calculations made. Data Tables can also be updated by end users directly because each row locks one at a time during updates. Data Tables should be kept as small as possible to improve performance on mashups, so for instance, consider using one to show all Things per region if there are millions of Things. Roll up information is best stored here to avoid calculations upon mashup load, and while a real-time view of many thousands of things at once is practically impossible, this option allows for a frequently updates overview of many things, which can also drill down to other mashup views that are real-time for one Thing at a time.
Value Streams are best used for data ingestion, and queries to these should be kept to a minimum, largely performed by the roll up logic that populates the Data Tables mentioned above. Queries that chart all of the data coming in are best utilized on individual Thing views so that only a handful of users are querying the same data sources at a time. Also be sure to use start and end dates and make use of the "source" field to improve query performance and create a better user experience. Due to the massive size of the corresponding database tables, it's best to avoid updating Value Streams outside of the data ingestion process altogether.
Streams are similar, but better for storing aggregated, historical data. Usually once per day or per week (outside of business hours if possible), Value Stream data will be smoothed or reduced into less data points and then stored into Streams. This allows for data to be stored for longer periods of time on the server without using up as much memory or hurting query performance. Then the high volume ingested data sources can be purged frequently, as discussed below.
Infotables are the most memory intensive, and are really designed to hold only a small number of rows at a time, usually to facilitate the business logic. Sometimes they will be stored in Streams or Data Tables if they aren't expected to grow larger (see the DGIS Coffee Machine App for an example). Infotables should never be logged; if they are used to transmit Edge property updates (like in the Property Set Approach), they should be processed into other logged (usually local) properties.
Referring to the properties themselves is how to get real-time information on a mashup, say by using the GetProperties service and its auto-update option, which relies on internal websockets. This should be done on individual Thing views only, and sizing considerations need to be made if there will be many of these websockets open at once, say if there are many end users all viewing real-time data at a time.
In the newer versions of ThingWorx, these cannot be updated directly, so find the system object called ThingWorxPersistenceProvider and use the service UpdateStreamDataProcessingSettings. ThingWorx Foundation processes data received from remote devices in batches in order to manage the data flow and reduce database churn. All of these settings configure how large those batches are and how frequently they are flushed to the database (detailed in full in KCS Article 240607). This is very advanced configuration that heavily depends on use case and infrastructure, but some info applies to most people: adjusting the scan rate is usually not beneficial; a healthy queue should never approach the max limit; and defaults differ by database because they function differently.
InfluxDB generally works better when there are less processing threads and higher numbers of things per thread, while PostgresDB can have a lot of threads, preferably with less things per thread. That's why the default values shown here are given as the same number of threads (and this can be changed), but Influx has a larger block size and size threshold because it can handle more items per thread.
Value Streams ingest all data into the Foundation server, and so the database tables that correspond with these data sources grow very large, very quickly and need to be purged often and outside of business hours, usually once a day or once per week. That's why it's important to reduce the data down to less points and push them into Streams for historical reference. For a span of years, consider a single point a day might be enough, for a span of hours, consider a data point a minute. Push aggregated data into Streams and then purge the rest as soon as it is no longer needed.