Cloud Design Patterns for Resiliency, Scale, Performance, ...

Yes

Recently I have been accompanying an integration partner and end customer around an issue experienced with ThingWorx resource exhaustion. Early on it seemed like this was an issue with the ThingWorx Azure IoT Hub Connector as it would freeze up and become unresponsive. Following a root cause analysis it became clear that it was actually caused by a lack of a number of standard cloud design patterns, which if used would have automatically adapted operation of the overall solution to be far more resilient as well as resource optimized.

The way that the logic was structured, it prioritized job execution on entities with the oldest last success time and would continue to retry these executions (IoT Direct Methods) every few seconds until successful. There were a number of problems here, but I'll unpack a few in order to tie the problem to the solution via design patterns.

1) No exception handling

When the direct method execution failed/timed out or the system reported being unable to execute the remote service, this response was not used to adapt the solutions behavior.

2) No backoff retry mechanism

As exceptions were not caught, an adaptive retry mechanism with incremental or exponential backoff could not be leveraged to limit the impact of the build up of the failing retries.

3) No exception tracking

Tracking that exceptions were occurring and counting them would allow powering an exponential backoff retry algorithm (with jitter), a Cancel or Circuit Breaker pattern (stop doing something which is just broken), as well as provided alerting to address specific areas of the distributed solution experiencing issues.

4) Conflicting priorities

It was interesting to see the manifestation of the conflicting interests of wanting to ensure checks and balances (had all needed data) and system resiliency. Retries and resource usage built up exponentially due to the transient error instead of backing them off. Trying so hard to get the needed data from failing sensors meant that operational sensors were deprioritized and their data was not received either - spreading the localized issue to the whole system.

Around the time that I shared my recommendations and some examples of how to make the solution more resilient, one of my technical colleagues at Microsoft shared some extremely interesting and relevant design patterns documented by Microsoft as a part of the "Microsoft Azure Well-Architected Framework". This framework with included Design Patterns for specific cloud application goals allows applying well-known industry standard approaches to dealing with the challenges of large scale distributed enterprise systems (reliability, performance, cost optimization).

She later then shared this blog post describing exactly the exponential backoff retry with jitter pattern which we had together recommended to the systems integrator.

What's interesting for us ThingWorx people is that this framework from Microsoft is about well-architected cloud solutions and does not specifically reference the Azure stack, and as such many of these approaches and design practices can be employed in your ThingWorx applications. What are you waiting for? Go check them out!

Cloud Design Patterns for Resiliency, Scale, Performance, ...

Best Practices

Cloud

Design