Anomaly Detection: concepts, techniques, and Thingworx support
Anomaly Detection (also known as Outlier Detection) is a set of techniques that identify unusual occurrences in data. The premise is that such occurrences may be early indicators of future negative events (e.g. failure of assets or production lines). Data Science algorithms for Anomaly Detection include both Supervised and Unsupervised methods. In Unsupervised Anomaly Detection, the algorithms make the assumption that most of the data points are "normal" (e.g. normal operation of the asset) and are looking for data points that are most dissimilar to the remainder of the dataset. Supervised Anomaly Detection requires a labeled set of Anomalies, in which case predictive algorithms can be applied directly on this data.
Thingworx employs a number of algorithms in support of Anomaly Detection:
Simple threshold alerts. These are easy to setup on Thing properties but require a domain expert to provide such thresholds. Then, the alert will automatically fire when the value of the monitored property goes outside the predefined range of values, often seen as the "bad" side of the threshold.
Statistical Process Control (SPC). This can be implemented using Thingworx Analytics Property Transforms. Most companies use a subset of SPC charts and rules to monitor production processes. Examples include the X Bar and R charts, as well as the Western Electric rules (e.g. one point outside the average +/- three sigma range). Explainable and widely accepted, SPC can also provide an earlier warning system compared to simple threshold alerts, in that it captures more complex patterns.
Clustering. Using Thingworx Analytics, one can build an optimal clustering for the available data points. Under the assumption that data is representative of mostly normal operation and that there is not a significant pre-defined pattern of anomalies that form their own cluster, one can identify outliers by looking at the distance between points and their corresponding cluster centers. Points that are very far from their corresponding cluster center can be labeled as anomalies.
Semi-supervised Anomaly Alerting (formerly known as ThingWatcher). This functionality identifies single property time series behavior that is statistically different than what was seen in a finite window of “known normal operation”. As such, it does not identify a “bad” event, or even a precursor to a “bad” event. Rather, it points the end user to further investigate a situation which may lead to a “bad” event. Anomaly Alerts can be easily setup like any other Alert on a Thing property. Multiple Anomaly Alerts can be setup on the same or different properties of a Thing. Behind the scenes, the platform builds a time series neural network model for the known normal operation data, which is then applied to incoming data, and, if the errors are significantly different than those on known normal operation over a period of time, then an Anomaly Alert is produced.
The techniques mentioned above are either unsupervised or semi-supervised. If the dataset contains labeled anomalies (e.g. asset faults, or suspicious patterns) then supervised predictive techniques (such as regression, decision trees, neural nets, or ensemble methods) are available to model the relationships between such anomalies (dependent variables) and various variables of interest (independent variables). These models can then be employed to monitor assets or production lines for upcoming anomalies. In many real-world use cases, anomalies are relatively rare; care needs to be taken when building such predictive models. Techniques such as up-sampling can prove beneficial in such situations.
What constitutes an Anomaly depends on the observed data and the current context. If only few data points are initially available, then it is possible that a lot of future data is predicted as an Anomaly, despite being normal operation. Also, in terms of context, if an Anomaly Detection is trained on a connected product in the Winter, it is likely to say all Summer operation is anomalous. This can be tackled by having multiple anomaly detection alerts implemented, one for each different context of operation (e.g. season, recipe being manufactured, operation done by a robot).
Another consideration is lead time vs explainability. For example, when a threshold alert fires, it is obvious why, but it may not be early enough to take action. As more advanced methods are employed, more complex patterns can be captured, hence more lead time, but typically at the expense of explainability. For example, semi supervised Anomaly Alerting (formerly known as ThingWatcher) uses time windows, aggregations, and derivatives of up to the third order, resulting in significantly less explainability when an Anomaly is presented to the end user.
Choosing the appropriate Anomaly Detection technique is use case dependent, balancing the desired lead time and explainability. If historical data on failures/anomalies is not available, a good place to start is Statistical Process Control, as it provides a balanced approach between the two dimensions, in addition to being already in use across many manufacturing companies.