Durable Queues Are Here by Tori Firewind, Principal Cloud Architect Introduction Well folks, the durable queue is here, and it is… durable! We tried everything in our dev ops arsenal to bring it down, but no matter what we threw at it, Event Hub stayed up. No data was lost in any test scenario. ThingWorx 10.0 is a remarkably more mature and stable offering than ever before with its new use of Kafka to prevent data loss, as well as internal queue management and queue diagnostics features. As we announced last quarter, new diagnostics features allow us to record diagnostic data from the moment a problem starts, ensuring RCA can begin immediately, without further time spent waiting for issues to occur again. These are highly configurable, and PTC is ready to support customers opting-in to acceleration-based diagnostics! Coming out now is the new internal throttling mechanism within ThingWorx, which ensures that even when queues max out, regardless of what those queues are doing, ThingWorx remains up and capable of other activity. In some of our failed scale test scenarios, the event queue was maxed out for many hours, without any subsequent out of memory crash of the Platform. It was remarkably durable! Even better if the durable queue is opted-in, because then those events also happen faster and more reliably. The durable events fire immediately within the Platform when a durable property is updated. Both of these go to Event Hub simultaneously. The load within Event Hub is balanced independently and processed more quickly than by ThingWorx, improving overall performance of both property updates and events, while still leaning heavily on ThingWorx for the web access and data redirect and storage. When the lag is well controlled, the subject of most of the rest of this article, the property values go in, they come out within milliseconds, and the latency is not significant in spite of the added component. And of course, if something happens to the Platform in the meantime, never fear, for the data truly is preserved and accessible within Event Hub. Data loss is a thing of the past with ThingWorx 10.0! Configuration Situation The one sacrifice of durability is scale, which can be challenging with Event Hub. There are some key considerations when optimizing ThingWorx for throughput, which should be considered necessary, as well as when sizing Event Hub. ThingWorx Throughput Optimization Within ThingWorx, go to the system object and edit it (may require admin permissions). Modify the Configuration to reduce the overall number of threads, which in turn reduces the distribution of the Event Hub load, allowing each to be processed more quickly. Also lower the buffer scan rate for persistent properties so they flush more often. Especially lower the max number of items before flushing, the buffer that usually delays writes to the database so as not to overwhelm it. That is less of a factor here as Event Hub has an internal load balancer. Event Hub is better for throughput than a database would be, and these are the settings to put on all opt-in queues for optimal performance. Event Hub Sizing and Partition Optimization Within Event Hub, there are several types of processing units. We will focus on the lower two tiers, as the highest tier is very expensive and less common to use, and the concept is the same as the Premium (mid) tier. The Standard Tier uses TUs, a.k.a. “Throughput Units”, and these are less performant but also less resource intensive, and so much less expensive. There is a maximum of 40 TUs overall, and 32 partitions per Event Hub in Standard. There is one Event Hub each for Logged Properties, Persistent Properties, and Unordered Events in both Standard and Premium. Premium Tier instead uses PUs, a,k.a. “Processing Units” and these are more performant and more resource intensive, with lower commit request latency, meaning the commits within Event Hub are happening faster. The data is received faster, and the cost for this is greater, but the stability is also greater and the risk of runaway lag or eventual data loss much lower. The risks are much more mild than before, and recovery is discussed below. In Premium, there is a maximum of 100 partitions per Event Hub, with 200 total per PU. There is a maximum of 16 PUs, and these go only in increments of 2. There are diminishing returns with more resources, however, directly proportional to the number of things. More things overall will reduce the write capabilities within Event Hub, as more CPU resources have to be spent on the network communication portion of the data exchange. Low Partitions Medium Partitions High Partitions It is better to use more partitions than less, and a higher number of partitions will result in less latency and lower mean lag. There is always some lag, however, as it is calculated from the number of items queued versus completed. Both of these queues are very active, and healthy lag is usually between 60 – 80% of the total property wps, with peaks that do not increase over time. Sometimes the lag can be spikey, which must be considered in the alerting infrastructure. The mean load should be significantly less than 16 PUs and load tested, so that there is room to scale up and recover any lag that accrues from the unpredictable nature of production systems. Always leave room for spikes! Recovery of the Event Hub The short version: do not modify the partitions to resolve runaway lag. If more partitions are added while the lag is falling behind, then instead of helping to catch up, they significantly delay the recovery. Anything currently in Event Hub will not be distributed across the new partitions, only new things that are added later, but all of the partitions will still be polled for data, including the new ones, which will slow things down even more. The right way to deal with runaway lag is to increase the TUs or PUs to a decently higher setting temporarily, let the lag catch up, and then increase the number of partitions, and then wait and see how the server responds before finally downsizing once again. It is important to consider that there is a maximum size for processing data, and a maximum number of partitions per Event Hub, creating a hard upper limit for performance and scale. Make sure any Event Hub instance is sized small enough to allow for upsize in the case of runaway lag. Edge load is not guaranteed to remain perfectly steady, generally speaking, there can be surprise disconnections, reconnections, and spikes in utilization. There really is no other way to ensure no data loss occurs to runaway lag, especially since there usually is no way to turn off the Edge load at will in Production. Lag grew to the hundreds of thousands quickly, was surely beyond recovery at this size. The partitions were increased at 11:45 to demonstrate the poor distribution of data processing within Event Hub. Around 15 hours to recover. Here it is up close, see how every partition is doing a tiny amount of work, and it takes quite a long time? Too much lag, and the data will be lost in one of two ways: not being added to an already full queue within Event Hub, or by erroring out as Event Hub tries to pass it back with a variety of errors. If Event Hub backs up too much to be recovered by upsizing, or it cannot be upsized enough, it can be deleted, and only the type of data affected will see loss, and with no downtime for ThingWorx. A Healthy Example An XL ThingWorx deployment was used to ensure that the Platform was not the limiting factor. The required TUs and PUs are the figures calculated by the Grafana dashboard, coming from the Kafka metrics. The average latency for subscriptions is calculated by having a start datetime property (not logged or persisted) update when the rest of the property updates fire, and then an end datetime property update when the subscriptions to the persistent properties run; the timespan is then calculated and written to the script log. This example was an XL sized application, 80k things, each thing with 20 properties total, 10 Logged and 10 Persistent properties, that write to Event Hub twice in a minute. There were 5 events as well to measure the latency, but due to the design of the test (property updates fired from a timer subscription), opting in to Durable Events causes performance issues that affect the test results. That is why events show up in the Event Queue, which does not happen in opt-in tests. : These are calculated by the Kafka Metrics dashboard: Required TUs: 115 Required PUs: 2 These were what was configured for this test: TUs Configured: N/A PUs Configured: 16 Partitions (respectively): 100, 100, 0 Average Latency for Subscriptions: < 100ms The test begins at 11 am. Lag is steady and the spikes are not increasing over time (though they come close). The property write rate includes the 20 properties that go to Event Hub, plus the 10 date time properties for measuring latency, and one additional info table property for a more realistic load. This looks the same as it usually does, there is no change to performance. This is high because of the design of the test, all of the things update on thing template level timer subscriptions. This is much lower with opt-in for durable events. How durable!
View full tip