Cloud Design Patterns for Resiliency, Scale, Performance, ...
Recently I have been accompanying an integration partner and end customer around an issue experienced with ThingWorx resource exhaustion. Early on it seemed like this was an issue with the ThingWorx Azure IoT Hub Connector as it would freeze up and become unresponsive. Following a root cause analysis it became clear that it was actually caused by a lack of a number of standard cloud design patterns, which if used would have automatically adapted operation of the overall solution to be far more resilient as well as resource optimized.
The way that the logic was structured, it prioritized job execution on entities with the oldest last success time and would continue to retry these executions (IoT Direct Methods) every few seconds until successful. There were a number of problems here, but I'll unpack a few in order to tie the problem to the solution via design patterns.
1) No exception handling
When the direct method execution failed/timed out or the system reported being unable to execute the remote service, this response was not used to adapt the solutions behavior.
2) No backoff retry mechanism
As exceptions were not caught, an adaptive retry mechanism with incremental or exponential backoff could not be leveraged to limit the impact of the build up of the failing retries.
3) No exception tracking
Tracking that exceptions were occurring and counting them would allow powering an exponential backoff retry algorithm (with jitter), a Cancel or Circuit Breaker pattern (stop doing something which is just broken), as well as provided alerting to address specific areas of the distributed solution experiencing issues.
4) Conflicting priorities
It was interesting to see the manifestation of the conflicting interests of wanting to ensure checks and balances (had all needed data) and system resiliency. Retries and resource usage built up exponentially due to the transient error instead of backing them off. Trying so hard to get the needed data from failing sensors meant that operational sensors were deprioritized and their data was not received either - spreading the localized issue to the whole system.
What's interesting for us ThingWorx people is that this framework from Microsoft is about well-architected cloud solutions and does not specifically reference the Azure stack, and as such many of these approaches and design practices can be employed in your ThingWorx applications. What are you waiting for? Go check them out!