Improvements in Monitoring and Diagnostics The Outcome of the Diagnostics Forum of Yesterday By Tori Firewind, Principal Cloud Architect Introduction Many moons ago, a forum of dedicated diagnostic enthusiasts gathered together throughout several round-table discussions to produce a list of diagnostic daydreams, groom and refine that list, and then use it to produce realistic, yet awesome diagnostics requirements. Three different organizations, eight teams in total across PTC were involved, and dozens of real-world, customer experiences were considered throughout these conferences. In the end, a large diagnostics feature was designed and passed already prioritized, from the front lines directly into the backlog in the first ever collaborative effort to do Dev Ops as a process at PTC. The fruits of this labor are now available in the latest version, ThingWorx 9.7! These features include the largely anticipated acceleration-based monitoring, custom persistence provider monitoring, and the ability to turn off metrics as needed, as well as additional open telemetry integrations and capabilities. This article will spotlight some of these new features and direct you to the 9.7 Help Center pages for more information. The Platform is smarter and Dev Ops easier than ever before, with the latest and greatest in monitoring found right here in ThingWorx 9.7. Acceleration-Based Monitoring This is an extremely cool new feature in ThingWorx where the Platform monitors itself for runaway queues, those which indicate a system malfunction in the event or value stream queues for instance. If the acceleration is high enough for that queue, the Platform will automatically generate stack traces. This ensures the diagnostic data is already on the disc if and when the issue progresses to an outage. No more do diagnostic SMEs have to sit and wait for a problem to occur again before they start investigating! Now they can download the stack traces that the Platform already stored on the disc. Of course, it’s really important to get this feature configured right. If thread dumps are taken too often, those associated with the root cause may be purged before a tech can look more closely; only 10 thread dumps at a time are stored on the disc to prevent it growing out of proportion. On the other hand, if it does not take thread dumps often enough or in response to the right kinds of acceleration readings, then the root cause may altogether not be captured. For this reason, there are several parameters to use to configure this feature, one being the acceleration increase percentage, considered very significant and a potential sign of trouble within the queue. This value is a flat percentage calculated at the time of measurement: queue count / total size of the queue. This number is checked frequently, as configured by another parameter, the acceleration calculation frequency. If the queue size exceeds the queue capacity occupied percentage at the same time the increase percentage is above the threshold, then the Platform records a stack trace on the disc. This ensures the diagnostic data will be present even after a restart, from early on when the problem began. Another key parameter is the number of acceleration occurrences to wait before turning the thread captures on, which allows for a great specificity of when to collect the diagnostic data. Even if the acceleration is met once, well maybe that means nothing; perhaps it is normal for the queue to accelerate quickly at times. However, let’s say it does it 5 times within 30 seconds or a minute, or maybe the acceleration is elevated for some time, and we are approaching a data loss scenario. A rule based around the first scenario will capture close to the root cause, while the second might serve to collect what events didn’t make it into the queue before the restart. Once the thread dumps are taken, there is a configurable cool off period in which no new stack traces will be recorded. This can allow the stack traces to be taken repeatedly at set intervals for persistent issues, and also reduces the overall activity of recording stack traces to ensure only the most useful ones remain on the disc. One of five queues can be monitored in this way: Persistent Property Queue Event Processing Queue Stream Processing Queue Value Stream Processing Queue Connections Pool Processing Queue An example of setting all of these values can be found along with more information in the Help Center. Custom Persistence Provider Monitoring There is now the capacity to see what the Platform is doing when interacting with your own custom persistence providers, or those used by the various ThingWorx applications like Navigate or DPM. At the monitoring endpoint, all persistence providers will now be listed at /Metrics, tagged by their name and database type: # HELP thingworx_ThingworxPersistenceProvider_ConnectionPool_BusyConnections Current count of busy connections to the underlying database
# TYPE thingworx_ThingworxPersistenceProvider_ConnectionPool_BusyConnections gauge
thingworx_ThingworxPersistenceProvider_ConnectionPool_BusyConnections{category="DPMpersistenceProvider.ConnectionPool",databaseType="Microsoft SQL Server",otel_scope_name="com.thingworx",persistenceProviderEntityName="DPMpersistenceProvider",platformid="",prefix="Platform.Core.PersistenceProvider"} 0.0
thingworx_ThingworxPersistenceProvider_ConnectionPool_BusyConnections{category="ThingworxPersistenceProvider.ConnectionPool",databaseType="PostgreSQL",otel_scope_name="com.thingworx",persistenceProviderEntityName="ThingworxPersistenceProvider",platformid="",prefix="Platform.Core.PersistenceProvider"} 0.0 For more information about which parts of the persistence providers are monitored, see the Help Center. Disable Metrics Now there is the potential to turn off some metrics if they become problematic or threaten to destabilize the entire environment. For example, the Audit Subsystem has a history of causing such issues for many, since the database tables grow very large and counting the rows can begin to take time. Turning these metrics off is now something easily done in the next maintenance window. Simply add some code to your platform-settings.json file and restart the server, and whatever metrics specified will no longer be captured or appear at the metrics endpoints: "MetricsSettings": {
"DisabledMetricsList": [
"<metrics name 1>",
"<metrics name 2>",
"<metrics name 3>"
]
} Please note that there are several caveats and a warning to those who would turn off default monitoring features: ensure critical metrics are not unintentionally turned off and remember that this feature is intended for administrators who know the system well and need the ability to fine-tune its monitoring to ensure performance. Read more about this feature and its caveats in the Help Center. Open Telemetry OpenTelemetry support was introduced in 9.6 and expanded in ThingWorx 9.7 to facilitate the recording of high volume monitoring metrics. It handles large-scale metrics and provides a more robust observability and capability for both diagnostic and predictive analysis. Find the new endpoint at /MetricsHC, with the old metrics still available as before. You can also use this metrics library to create your own custom metrics in a ThingWorx Extension, really expanding the metrics capability of the Platform. More information on how to make use of OpenTelemetry for monitoring can be found in the Help Center. Conclusion ThingWorx does Dev Ops now better than ever, with features like these coming straight from real-world experiences and going right into the development workflow. Already, 4 additional monitoring features have been thought up and added to the mix, and improvements are coming soon! With more feedback on how to better to Dev Ops in ThingWorx, feel free to reach out.
View full tip