IoT & Connectivity Tips

May 14, 2024

Tune in to The Lean Manufacturer podcast where expert guests bring their outside view of the IIoT and discuss various aspects of manufacturing. Over the course of the series, we’ll cover some of the most important ways the IIoT can maximize manufacturing efficiency and bring value to your organization, including the need for reducing planned and unplanned downtime, enabling operational efficiency, ensuring digital continuous improvement, and so much more.

Apr 9, 2024

Apr 8, 2024

Apr 1, 2024

Note: The following tutorial are based on a Thingworx/CWC 9.5. Steps and names may differ in another version. Context As a human reaction, the tracked time displayed may be misperceived by the Operator. It can lead to a reject of the solution. CWC doesn’t have (yet?) the capability to configure the visibility to hide the timer. The purpose of this tutorial is to create a quick and straight to the point customization to hide the timer in the execution screen. All other features, services and interfaces are left untouched. As a big picture, here are the 6 modifications you will need to do: Modify the 4 mashups Modify 2 values in 2 tables of the MSSQL database Status The mashup containing the timer is PTC.FSU.CWC.Execution.Overview_MU. It is easy to duplicate it and hide the timer widget (switch the visible property to false). But now, how to set it in the standard interface? In order to do it, you need to duplicate the mashups linked to the Execution.Overview_MU mashup. PTC.FSU.CWC.Execution.Overview_MU is directly referenced by the following entities: PTC.FSU.CWC.Authoring.Preview_MU PTC.FSU.CWC.Execution.WorkInstructionStart_MU PTC.FSU.CWC.GIobalUI.ApplicationSpecificHeader_HD PTC.FSU.CWC.WorkDefinitionExecution.StationSelectionContainer_EP Customization Duplicate all those mashups except Authoring.Preview_MU because we will focus only on the authoring side of CWC. Hereafter it will be called the same as the original + _DUPLICATE. Perform the following modifications. Open PTC.FSU.CWC.GlobalUI.ApplicationSpecificHeader_HD_DUPLICATE, then in Functions: open the expression named NavigateToStationSelection. Change the name of the mashup to the relevant one, example: PTC.FSU.CWC.WorkDefinitionExecution.StationSelectionContainer_EP_DUPLICATE open the validator named ShowRaiseHand. Change the name of the 2 mashups to the relevant ones, example: PTC.FSU.CWC.Execution.WorkInstructionStart_MU_DUPLICATE and PTC.FSU.CWC.Execution.Overview_MU_DUPLICATE open the validator named ShowStationSelection. Change the name of the 2 mashups to the relevant ones, example: PTC.FSU.CWC.Execution.WorkInstructionStart_MU_DUPLICATE Open PTC.FSU.CWC.Execution.Overview_DUPLICATE, then in Functions: open the expression named SetMashupToWorkInstructionStart. Change the name of the mashup to the relevant one, example: PTC.FSU.CWC.Execution.WorkInstructionStart_MU_DUPLICATE open the expression named SetMashupToStationSelection. Change the name of the mashup to the relevant one, example: PTC.FSU.CWC.WorkDefinitionExecution.StationSelectionContainer_EP_DUPLICATE Open PTC.FSU.CWC.Execution.WorkInstructionStart_MU_DUPLICATE, then in Functions: open the expression named NavigateToStart. Change the name of the mashup to the relevant one, example: PTC.FSU.CWC.Execution.Overview_MU_DUPLICATE open the expression named NavigateToStationSelection. Change the name of the mashup to the relevant one, example: PTC.FSU.CWC.WorkDefinitionExecution.StationSelectionContainer_EP_DUPLICATE Open PTC.FSU.CWC.WorkDefinitionExecution.StationSelectionContainer_EP_DUPLICATE, in functions: open the expression named NavigateToStart. Change the name of the mashup to the relevant one, example: PTC.FSU.CWC.Execution.Overview_MU_DUPLICATE open the expression named NavigateToWorkInstructionStart. Change the name of the mashup to the relevant one, example: PTC.FSU.CWC.Execution.WorkInstructionStart_MU_DUPLICATE Now, let’s change the database value. In MSSQL, navigate to the thingworxapps database, and edit the dbo.menu table. Look the line for AssemblyExecution (by default line 22) and look the value in column targetmashuplink. Switch the original value PTC.FSU.CWC.WorkDefinitionExecution.StationSelectionContainer_EP to the name of the duplication of this mashup. Lastly, edit the dbo.menucontext table. Look the line related to CWC (application UID = 5) and look the value in column targetmashuplink. Switch the original value PTC.FSU.CWC.GlobalUI.ApplicationSpecificHeader_HD to the name of the duplication of this mashup. Result After this modification, you can start and check an operation. You should see the following result:

Mar 1, 2024

As of May 24, 2023, ThingWorx 9.4.0 is available for download! Here are some of the highlights from the recent ThingWorx release. What’s new in ThingWorx 9.4? Composer and Mashup Builder New Combo chart and Pie chart based on modern Web Components architecture along with several other widget enhancements such as Grid toolbar improvements for custom actions, highlighting newly added rows for Grid widget and others Ability to style the focus on each widget when they are selected to get specific styling for different components Absolute positioning option with a beta flag to use by default is available now Several other Mashup improvements, such as Export function support is made available; improved migration dialogue (backported to 9.3 as well) to allow customers to move to new widgets; and legacy widgets language changes Foundation Heavily subscribed events could now be distributed across the ThingWorx HA cluster nodes to be scaled horizontally for better performance and processing of ThingWorx subscriptions. Azure Database for PostgreSQL Flex Server GA support with ThingWorx 9.4.1 for customers deploying ThingWorx Foundation on their own Azure Cloud infrastructure Improved ThingWorx APIs for InfluxDB to avoid Data loss at a high scale with additional monitoring metrics in place for guardrails Several security fixes and key third-party stack updates Remote Access and Control (RAC) Remote Service Edge Extension (RSEE) for C-SDK, currently under Preview and planned to be Generally Available (GA) by Sept 2023, would allow ThingWorx admins to use Remote Access and Control with the C-SDK based connected devices to use Global Access Server (GAS) for a stable, secure, and large number of remote sessions With 9.4, a self-signed certificate or certificate generated by the trusted certificate authority can be used for RAC with the ThingWorx platform Ability to use parameter substitution to update Auto Launch commands dynamically based on data from the ThingWorx platform is now available Software Content Management Support for 100k+ assets with redesigned and improved performance on Asset Search Better control over package deployment with redesigned package tracking, filtering, and management Improved overall SCM stability, performance, and scalability and a more user-friendly and intuitive experience Analytics Improvements to the scalability of property transforms to enable stream processing of the larger number of properties within a ThingWorx installation Refactoring of Analytics Builder UI to use updated widgets and align with the PTC design system Updates to underlying libraries to enable the creation and scoring of predictive models in the latest version of PMML (v4.4) End of Support for Analytics Manager .NET Agent SDK View release notes here and be sure to upgrade to 9.4! Cheers, Ayush Tiwari Director, Product Management, ThingWorx

May 24, 2023

Mar 30, 2023

Dev Ops is a crucial process that exists in any software setting, whether you plan on it or not. Chaos in the dev ops process, say because less time is spent here than on the shiny new features that are easy to sell, results in bottlenecks in the dev ops process. Bottlenecks reduce efficiency, and leave you open to vulnerabilities as well. The faster you can get a change properly tested and safely into production, the safer and more stable the system is all along. Issues will arise, they always arise. Are you ready for them? Watch this video, see some of these additional links, and think about your dev ops process now, before the fires start! Useful Links: ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 1 ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 2 Overview of Monitoring Tools and Diagnostics The System Health Timer

Jan 31, 2023

Architecting Reason Code Trees in DPM Tori Firewind, IoT EDC What are Machine Codes? Factory hardware devices communicate status changes to their human operators and other machines (IoT) via machine codes. The manufacturers often determine the machine codes for different types of factory hardware, so those are often pre-determined. However, how the reason trees map these machine codes to corresponding business logic in ThingWorx is entirely customizable. Knowing the best way to design your reason trees for this purpose can be challenging, so this guide is here to help with your conceptual knowledge. Using the UI to create, edit, and configure reason codes in technical detail can be found in the Help Center. The Tree Trunk At the highest level of the reason tree, the trunk, there are really 3 categories: Availability (A), Performance or Productivity (P), and Quality (Q). These should look familiar; they are the three dimensions of OEE (Overall Equipment Effectiveness). Fg 1. Calculation of OEE Availability refers to long stops, events that stop planned production long enough that it makes sense to track a reason for being down (typically several minutes, but the threshold between a long stop and a short stop can vary depending on the ideal rate of production of materials). Availability = Run Time / Planned Production Time Productivity/Performance really refers to short stops, things that cause the machine to run at less-than- optimal speeds. This can include stops caused by running out of materials for production, doing minor maintenance like switching out a single, easily-changed part, or even frequent breaks due to ill health of an operator. User error can be a cause as well, say if the machine needs a certain heat to produce parts, and the heat keeps fluctuating (requiring the machine to take the time to calibrate for this before starting on production) because operators are smoking out a back door or adjusting thermostat temperatures. Fg 2. Levels of Runnable Time Operator influence often is a factor when it comes to the conditions that permit optimal performance from machinery, and every factory may face different challenges. Stops like these are not really outages; the amount of downtime isn’t enough to consider the production block entirely unproductive. Production was continuing and ongoing throughout most of the block despite the issues; the rate was just slower than ideal. Performance/Productivity = (Total Count / Run Time) / Ideal Run Rate Quality refers mostly to the number of items that are considered scrap or rework, and it can be split into two categories: start up scrap (that which is expected because the machine is in the process of warming up or being fine-tuned by the operator) and production scrap (things which come out wrong and must be tossed or reworked because the conditions under which they were produced weren’t ideal; this is called first-pass yield only, meaning it's only a "good" product if it passes inspection the first time). Quality = Good Count / Total Count The Branches and the Leaves of the Tree The “leaves” are the reason codes which directly map to machine codes , and the “branches” are the method of categorization that connects them to the trunk. Both the leaves and the trees, the children and the parent nodes of the tree, are split into two states: planned versus unplanned downtime. Changeovers, maintenance, and even scrap, can be broken down into this dichotomy. For scrap, there are startup rejects (planned, because the machines have ramp up periods) and production rejects (unplanned, because the conditions weren’t ideal). For maintenance there is planned and unplanned, small changes that occur on the fly that result in productivity loss, and maybe also reduce availability in the long run. Small, unplanned changes can occasionally shift into the availability loss category if a simple, quick repair winds up being complex and time-consuming. A good reason tree can differentiate easily between short and longer stops in order to respond to each in a deliberate way. To start off in the process of architecting your reason tree, try writing the three categories on a board in a common room in an average factory (or several as a survey). Ask operators to stop in over the course of a few days and write various machine codes that they see often and find useful under one of these categories, or more than one if the machine code pops up under different circumstances and can mean different things. Have them write a 10 word justification, if the association isn’t obvious. Gather all of the “leaves” in this way, and then begin to associate them with the “trunk”, forming the “branches”. An example tree can be seen in figure 3 here, with leaves like “Changeover” and “Maintenance” being semi-ambiguous; they could just as easily be seen as unplanned stops. Therefore, there may be multiple reason codes mapping up to the top of the tree in more than one branch, and these can have different categories, which controls how the business logic responds to the different codes. The Help Center has more details about how the events are mapped to types, and each type contains multiple categories, as configured by you when you set-up the DPM model. Fg 3: Different types of changeovers may have different codes, and can map up as either planned or unplanned, but all planned and unplanned stops (long stops) are under the Availability category of the trunk. Similarly, small stops can involve idling, like if there are not enough materials, reduced speed if the conditions are not ideal, or other small stops, usually caused by human error or unforeseeable circumstances. Quality loss then refers to the products which fail quality checks, either because the machine still has the wrong paint in the applicator and needs a few rounds to be ready for the next production item, or because the conditions are again, not ideal, and items wind up scrapped. Example Reason Tree Fg 5 example tree with more specific tags (there may be dozens or hundreds in a full reason tree, though the fewer are needed to capture the events we care about, the better). Theory of Constraints Fg 6 theory of constraint wheel: an industry process for gradual OEE improvement in factories that has been adapted into the PTC methodology as well. While architecting your reason tree, always remember the key purpose: gathering only as much data as necessary to analyze the efficiency of a factory and to identify the bottleneck, or the most limiting factor. The important point is to identify not just the bottleneck that seems the most troublesome, but the one that actually results in the greatest impact to OEE across the entire factory. Without software like DPM, and a properly designed reason code tree, the process of improving a factory can be very challenging, involving a lot of guesswork, and sometimes solving one problem at the cost of another. The issue is that these machines produce a LOT of raw data, and humans are not the best tool available to gather and aggregate this data in a consumable way. A good reason tree ensures a smart application that can quickly prioritize the machine (bottleneck) that most impacts production, and not just the machine that functions in the least optimal way. So, the theory of constraints is really a process for identifying small, incremental changes, which together can make a big difference, and fast, in factory OEE. The rate at which this cycle can be completed varies, however. The slower the process of identifying constraints and the less information that is gathered, the slower and less precise the first two steps of this process. Alternatively, in a traditional constraint identification process, too much information can be a problem as well, due to human limitation, as discussed above. So, DPM is a great benefit in this regard, because it aggregates the data into a consumable, comparable way every 5 minutes, freeing up your human analytics for problem solving and prioritization, and not data gathering and sorting. Other Key Tips Also remember that a good tree treats the trunk like a whole unit, with each category occupying a percentage of the overall OEE. Afterall, look back up at the 3 dimensions of OEE in the equation above. For example, the more you see issues with availability, the less you will see issues with scrap, for the machine simply doesn’t have as much time to produce scrap if it is constantly down. The more you see issues with quality loss, the less you should see of productivity loss, because these are simply inversely proportional modes; to say it differently, if a machine is running quickly and seeing few minor maintenance stops, then it is likely to produce more scrap (as well as more good product as well). Another thing to remember is that even DPM is limited in its capacity to interpret raw data. Even while many magnitudes more efficient than any human gathering and analysis could ever be, there is an upper limit to how much raw data DPM can ingest and analyze before the system gets very expensive. For this reason, you want to ensure your reason trees use only as many reason codes as are required to capture the OEE of a factory site. This will mean using different codes for different types of things, most likely, which is easy to do maintainably across many sites using thing shapes. Keeping things tightly defined and organized is the easiest way to ensure a clean, efficient system for gathering and storing data. Also remember that data will not need to persist very long once DPM is fully operational and adopted by your factories. DPM ensures that the changes made to the production line to improve efficiency are the highest impact, and the least difficult to implement, meaning that there will be a very rapid return on investment, and a process to ensure future issues are identified and resolved quickly. Data from past issues in the factory won’t be as relevant, and historical data stores can be kept smaller than one might think. It is the power of ingesting data directly into the processing and aggregation process, the automatic reduction of data down into presentable, consumable webpages, that makes DPM and ThingWorx such a great factory solution for optimizing OEE.

Jan 24, 2023

ThingWorx Monitoring and Alerting, Part 2 Using Prometheus and Grafana By Tori Firewind, IoT EDC Building Dashboards To add a panel which monitors some component of the ThingWorx application to a dashboard in Grafana, click to add a new panel. Under “Metrics” in the box at the bottom of the screen, select what ThingWorx metrics you wish to monitor (type “thingworx” in the search box to see them all). For example, select the Platform Subsystem memory in use: Label filters aren’t necessary, though you may want to sort by instance if you are monitoring multiple ones with the same dashboard. You may also want to take some time to format the Y axis, which by default will show in bytes. Go to the formatting panel on the right side and scroll down to the section called “Standard options”. For the Unit dropdown, start typing “data” and then select “bytes (SI)”. This will automatically determine if the bytes you’ve provided should really print as MB or GB based on how large the numbers are. Rename the panel, modify it in any other way desired, and then click Apply (last 5 minutes): Once you add the panel, you can watch the memory usage as it is scraped by selecting the refresh option (10s or 30s, whatever makes sense based on your scrape interval). The viewing window is stored in the URL, so that you can generate a report for a specific interval (like when a test was occurring), and then store that result or share it in a more compact way: http://localhost:3000/d/nleucPv4k/thingworx-monitoring?orgId=1&from=1668528038732&to=1668536503953 (absolute timestamps): Dashboards are just collections of panels which report on all of the various metrics of performance and stability that exist for single components of a system. This is because there can be quite a few metrics worth watching for each individual component. Most of the third-party tools come with their own dashboards, but the ThingWorx component is one which for now, requires some thought and creativity. Consider your use case carefully and look over the various subsystems contained within ThingWorx. Each part of an application is localized to specific subsystems, and some are more business critical than others. What will go at the top of your dashboard? Add rows, add panels per row, and see what the many choices are for watching your system. Don’t forget that with Telegraf running, VM or machine usage metrics are also available for display on a dashboard. Things like overall CPU and Memory usage are critical to determining the health of a system, as we have demonstrated in our own reasoning in past benchmarks and scale tests. You can create a panel to monitor the mem_used versus mem_total, like so: Another metric from Telegraf worth adding is the CPU usage, which should be given “percentage” for the units and which needs a label filter of cpu = cpu-total. If we do some resizing and drag-and-dropping, then we now have the first row of a dashboard: See how the Platform usage climbs steadily and is purged in a cycle? That is the Java Garbage Collection mechanism, and it’s important to remember to leave room for spikes on top of those peaks. Data can also be calculated or processed in some way to make it more useful for determining system health and stability. The data in the picture below uses the formula submitted = completed + number queued + number failed. It shows the current queue on the left Y-axis and the max queue on the right (since the two numbers usually are drastically different). It looks pretty, but it doesn’t really tell us much about the system in this format, so let’s do some math and find a representation that is a bit more helpful. Performing a “non-negative derivative” calculation over the submitted and the completed queue counts over time allows for us to look at the status of the queue as a velocity. When the “complete” speed appears behind the “submitted” speed for too long at a time, then that means the queue is filling up and will eventually result in data loss. If we take this one step further and calculate the average of the submitted minus the completed over time, then we can actually predict approximately when the queue will fill up. This can then be displayed on a dashboard in Grafana, or used as the basis for an alert. What to Monitor In addition to monitoring the system which ThingWorx runs upon, ThingWorx itself can easily be monitored down to the subsystems level by Prometheus due to the Metrics endpoint. Many applications have support built into the way they format the data for scraping, including the JVM (which exposes Prometheus-formatted metrics with the JMX Exporter) and the OS (which can use the Node Exporter or Telegraf for the same purpose). For these more generic components, there are popular community dashboards which can be downloaded and used in Grafana for data analysis and review. For ThingWorx, there’s different kinds of data to track: subsystem data (see the list on the right) and non-subsystem data. There’s queue based versus non-queue data. These different metrics can collectively characterize the overall health of the application, depending on the use case. For instance, if this is a system with very many connected devices, one metric which may be important to track is the number of total devices defined on the Foundation server vs. the number of devices which are currently connected. If there are relays involved, then many devices suddenly going offline can mean a relay has failed. Another example is if the system sizing depends on an assumption that there will only ever be a fraction of the total number of devices connected at a time. Use cases like these could be monitored easily by keeping track of the total vs. the number of connected devices. Other common indicators of a healthy ThingWorx application might include the value stream and stream queues. These queues should fluctuate over time as the data is ingested and processed, but they should never be growing in size. If the stream queues are growing, then that means the data is writing to the queues faster than the queues can write to the database. Eventually, when the system runs out of resources to keep track of the queues, data will be lost. Having the stream information displayed in a chart can make it very easy to spot an upward trend in resource usage early on, which can catch a blockage or bottleneck that needs attention before it starts to affect the larger system in catastrophic ways later. Memory usage information from the various subsystems might be something worth tracking, as well as the event queue. These can indicate that the business logic is functioning with room to handle spikes, and that the server has enough memory to service all three dimensions of an IoT application: the ingestion, the business logic and thing-based alerting, and the user experience and UI. If file transfers are a key part of the use case, then the number of concurrent transfers, the average speed of them, the size of the files, all of this kind of stuff can be tracked and charted in Grafana by making use of the ThingWorx metrics which automatically show up there once you import the Prometheus data source. A mature dashboard used for a production environment might look a little like this: For further reading about subsystem monitoring, check out the Help Center. How to Alert The alerting mechanism built-in to Prometheus is incredibly easy to configure, so it might be tempting to generate tons of alert rules. However, remember that the more noise a system makes, the harder it is for those monitoring that system to know when action is really required. Playbooks which document how to respond to alerts, who to contact, how to act, and all the information necessary to handle an alert, should be created as an ongoing part of the DevOps process. Alerts should fire with the right severity in the subject line, as well as all of the information about the issue that is currently known, presented in a concise way, so that whoever receives the alert starts thinking about the root cause sooner and recovers the system faster. Those who receive the alert should have the ability to facilitate its resolution, and know who is expected to react to any alerts which come in. In the ThingWorx monitoring stack, Prometheus handles the alert rules and the generation of alerts, but alert filtering and delivery is managed in an external alerts manager. Generally, you want your alerts to follow a curve. If the current queue size exceeds 50% of the maximum, perhaps that isn’t a huge deal, if the application catches up quickly. How long are spikes in queue processing expected to last? Perhaps if the queue size is over half-full for 10 seconds, 30 seconds, then that means the queue is falling behind and not catching up. Ok, so this might be a warning level alert. When does this become an error? Well, let’s say the queue exceeds 90% of the max queue size. This might want to alert the moment it hits the mark. Now, farther along the curve, it may not take as long before data gets lost. As the severity of the situation increases, the threshold for alerting should increase as well. That way when errors do alert, it is a sure thing that they require a response immediately. The alerts are then pushed into the “Alerts Manager” for delivery based on your management rules. The Alerts Manager may decide to withhold warnings altogether, or send them to a much smaller mailing list, whatever filtering helps to ensure the right people receive the right alerts, right when they need them. In Conclusion, A Healthy Application... Has stable memory usage that fluctuates predictably and doesn’t grow over time. In a system experiencing mild issues, the memory starts to trend upward: If left unattended, systems like this may eventually experience outages. Finding the issue this early means there is even time to do some digging, debugging, taking of stack traces, and other such troubleshooting steps before the system must be restarted or recovered. That can really make the difference in identifying and resolving before there are real problems. One metric which makes for good alerting is the total number of failed stream entries, which can indicate there’s an issue writing to the database even before the queue has started to fill up. Other alerts may include warnings and errors based around percentages of memory used or queues filled, which depend on how long the queues take to fill up and how long the state has been at its increased usage. Prometheus has all of the tools necessary to make this possible across a variety of infrastructures and use cases. Set it up on a local machine and poke around at what ThingWorx metrics are available to meet your monitoring needs.

Dec 23, 2022

ThingWorx Monitoring and Alerting, Part 1 Using Prometheus and Grafana By Tori Firewind, IoT EDC Introduction and Getting Started As ThingWorx has become a more mature product during the lifetime of the IoT EDC, so too have our dev ops recommendations. As we’ve stated throughout many posts now, testing is a key part of ensuring enterprise readiness, and it occurs at every stage of the process: from unit testing to preserve individual service logic, to integration tests which preserve the functionality of the application as a whole, to user and edge load testing and user experience testing, which ensure enterprise readiness. So testing is a critical component, but the process of dev ops never stops. In order to effectively test the system, a comprehensive monitoring solution is also required. Once the application is tested and the changes pushed into production, there is no knowing with certainty that everything will run smoothly indefinitely. Random spikes in usage, server bandwidth or availability, any unforeseeable factors like these can come along and cause issues for a system. If these issues aren’t detected and addressed early, then they can very rapidly morph into much larger problems: outages, data loss, inflated data tables which are hard to revert due to their size. It is critical to detect performance issues on a system as early as possible, to have as much information as is necessary to figure out where the problem is heading, and what may have started it. Monitoring is key to a healthy system. CI/CD stands for “Continuous Integration/Continuous Deployment”, a never-ending cycle of improvement. Testing just once before the initial go live isn’t enough. Each system should have automated tests that run continuously, as well as monitors and alerts which reveal problems sooner. Diagnostic tools play a role as well, being the bridge from the end of the dev ops process cycle back to the beginning (monitoring into planning). A good CI/CD dev ops process will ensure that problems are found earlier, fixed more rapidly, and fixed for everyone using the system. In a fully mature dev ops pipeline, issues are anticipated, discovered and researched before they become production outages or critical issues. These investigations or testing follow-ups produce development tasks (usually bugs, but also features at times) which then start the dev ops cycle all over again. This is why a good, efficient dev ops pipeline is needed, one which allows changes to quickly and safely go from development to production. This is also why diagnostic tools play a role in the monitoring piece of the dev ops process. They are the bridge between monitoring and planning. Tools like Dynatrace can be configured to provide call stacks and take thread dumps when issues start to occur, before the system is performing so poorly it needs a restart, which happens automatically in a cluster and can clear out any trace of the issue. Thread dumps are often necessary to diagnosing the root cause of the issue (to permanently fix it), and doing so quickly ensures application stability and availability. That is, after all, the purpose of the dev ops process. Diagnostics is therefore an equally important piece of the dev ops Figure-8-shaped pie, and one which deserves its own spotlight in an article to come. Every piece of the dev ops process must be viewed as equally important in its own way, lest the dev ops cycle get hung up on bottlenecks of its own. A safe and stable system is not one which never experiences issues, it is one which has a good, efficient plan in place to handle recovery and prevention of repetition. A wholesome dev ops process is a happy dev ops process. The Monitoring Stack There are many monitoring options available, but in our experience one of the easiest and most effective monitoring stacks to use with ThingWorx is Prometheus for metrics gathering with Grafana for metrics analysis and review. In a mature monitoring stack, Telegraf is also commonly installed on each VM/host to gather the system metrics (like CPU and Memory usage, things we’ve stated are good metrics of system performance and stability in past articles on scale and size testing) and output them in Prometheus format. Prometheus is a highly scalable open-source monitoring framework that contains out of the box monitoring and alert capabilities for Kubernetes-based deployments (not covered in this article). Using Prometheus is very simple because the ThingWorx application exposes a metrics endpoint which is formatted directly for use by Prometheus. There is also built-in alerting in Prometheus, but not the ability to form dashboards for reviewing data or screenshotting it for documentation purposes. That’s where Grafana comes into play. Grafana has a preconfigured Prometheus-type data source and many preconfigured dashboard templates for various applications and services. Telegraf is also easily imported into Grafana, as is shown in the section below. The Prometheus targets in the larger diagram are expanded out on the left. For each target, some tool exports the data in a syntax which Prometheus can scrape. For VMs, this can be Telegraf, for Kubernetes, the Node Exporter. JVM has a JMX Exporter, and other tools like CX Server use Graphite. Many apps already have a Prometheus endpoint built-in, like ThingWorx and Zookeeper. Telegraf is not strictly necessary; the node exporter can also be used on VMs, but Telegraf is the more common choice since it is a more mature dev ops tool. Once Prometheus is scraping the targets, alerting on them can be done with OOTB Prometheus functionality, and dashboards for monitoring can be made easily in Grafana (with built-in support as well). This stack does not include the diagnostics piece, something which triggers thread dumps or the like when issues do occur. There are too many ways to conduct a successful diagnostic piece to cover here. How to Get Started Getting started monitoring a ThingWorx application is incredibly easy in the latest versions. Simply open up a browser, and type in the ThingWorx URL, followed by “/Metrics”. At this endpoint, there is a specially formatted response that can automatically be read by the Prometheus monitoring software which contains subsystem and service data. In addition to the application metrics, Prometheus can be configured to collect metrics from a node exporter at the (virtualized) operating system or container (Kubernetes) level as well. If you haven’t already, install Grafana, install Telegraf as a service, and install Docker Desktop. These are the tools required (in addition to ThingWorx of course) to set-up a simple sandbox system for familiarization with the monitoring stack recommended by PTC. The easiest way to try Prometheus on a local Windows instance is to use Docker. The command for that will be found below, but first open up Docker Desktop to set contextual parameters that the command line will need. Then, modify the configuration file for Telegraf or create one (called telegraf.conf in the same folder as the exe file), and put the following into the file (or uncomment it; the default config file has thousands of lines, so just search for “prometheus”): Output plugin [[outputs.prometheus_client]] listen = "0.0.0.0:9125" Alternatively, install the Prometheus Node Exporter tool, which will likely require some additions to the Prometheus config file (not covered here) which we are about to create. Then, create a configuration file (called prom_config_localhost_scraper.yml in the command to come), add the following (assuming a standard localhost installation of ThingWorx): # my global config global: scrape_interval: 45s evaluation_interval: 30s scrape_timeout: 30s # scrape_timeout is set to the global default (10s). rule_files: - prom_config_rules.yml scrape_configs: - job_name: thingworx static_configs: - targets: ['host.docker.internal:8080'] basic_auth: username: "Administrator" password: "admin!123456789" metrics_path: /Thingworx/Metrics scheme: http params: x-thingworx-session: - "false" - job_name: prometheus static_configs: - targets: ['localhost:9090'] - job_name: Telegraf # If telegraf is installed, grab stats about the local # machine by default. static_configs: - targets: ['host.docker.internal:9125'] This example script file uses the host.docker.internal instead of localhost for the server target for ThingWorx because it is running outside of the Docker container which contains Prometheus. This yml file configures Prometheus to monitor both ThingWorx and itself, as well as the server metrics coming from Telegraf (as long as they are configured to push). It’s a sandbox-only configuration, really, as you wouldn’t want to use the Administrator user, or have the password printed in plain text in the config file in a real system. Also note the need for the x-thingworx-session parameter, as runaway sessions which spawn every 30s or so (whatever the scrape interval is) will result in memory issues over time (so we don’t want to use sessions here). The rules file given here (prom_config_rules.yml) needs to be created separately. This is where all of the alert rules will be defined. This will determine if an alert state is happening, but without configuring the alert manager, there won’t be any notification. That isn’t covered here but is covered extensively in the Grafana docs. Here is an alert example: groups: - name: alert.rules rules: # Alert for any instance that is unreachable for >5 minutes. - alert: HighMemory expr: mem_used > 14000000 for: 1s labels: severity: page annotations: summary: "High Memory" description: "Localhost Memory Usage is High" Now, save these files and use Powershell to run the Docker container: docker run -p 9090:9090 -v C:\<path_to_document>\prom_config_localhost_scraper.yml:/etc/prometheus/prometheus.yml prom/prometheus It should download Prometheus and install it in that container (if this is the first time), allowing you to very rapidly deploy it to an endpoint of localhost:9090 by default. If there is an error like the one shown below, this means that you forgot to start Docker Desktop (the application) before opening Powershell. Docker Desktop sets system parameters required for containers to run in a command line (in Linux, it should work if Docker is installed for use by the command line, simple as that). The localhost endpoints are accessible in a browser. ThingWorx defaults to localhost:8080 endpoint. Prometheus defaults to localhost:9090. Telegraf is on port 9125. Open any of these in a browser tab to see the full monitoring stack. You can see easily if Prometheus is working by clicking “Status” > “Targets” at localhost:9090: If all of the targets appear as blue and say “last scrape” and a time stamp, then they’re working as expected. If they don’t, ensure you have the right ports, that there aren’t any firewall issues (if things aren’t all on localhost), and that everything is running without errors. The last step in the process here is to install a dashboard tool like Grafana. Once this is installed and running on localhost:3000 (by default), you can display the data from Prometheus with a few configuration steps the Grafana UI. Highlight over the settings icon in the bottom left of the screen, and then click on “Data sources”. Select the “Add data source” button, and then click on Prometheus. You have to type the URL again (localhost:9090), but most of the defaults will be ok here, and all you have to do is click “Save and test”. Now both targets should appear within Grafana, with their metrics showing up throughout the Grafana UI. This data source is what allows for the building of monitoring dashboards.

Dec 23, 2022

This video begins Module 7: Predictive & Prescriptive Scoring of the ThingWorx Analytics Training videos. It describes how a trained machine learning model takes inputs and makes predictions of different kinds, depending on the use case. It shows how scoring works in production, taking inputs from various sources and producing a score to help users make informed decisions. It also covers the concept of field importance in an individual score.

Dec 23, 2022

IoT & Connectivity Tips

IoT Implementation Guide - 3 Top Tips in Detail

Tune in to The Lean Manufacturer - Episode 1 now airing!

Today, April 9 is #WorldIoTDay - Let´s Celebrate

ThingWorx is a Market Leader By SPARK Matrix

CWC guide - Hide the Timer for Operator

ThingWorx v9.4 is available now!

Using the DBConnection Building Block to create and access tables

Dev Ops: The High Level Overview

Architecting Reason Code Trees in DPM

ThingWorx Monitoring and Alerting: Using Prometheus and Grafana, Part 2

ThingWorx Monitoring and Alerting: Using Prometheus and Grafana, Part 1

ThingWorx Analytics Training: Module 11 Part 1

ThingWorx Analytics Training: Module 10 Part 1

ThingWorx Analytics Training: Module 9 Part 3

ThingWorx Analytics Training: Module 9 Part 2

ThingWorx Analytics Training: Module 9 Part 1

ThingWorx Analytics Training: Module 8 Part 3

ThingWorx Analytics Training: Module 8 part 2

ThingWorx Analytics Training: Module 8 Part 1

ThingWorx Analytics Training: Module 7 Part 2

ThingWorx Analytics Training: Module 7 Part 1

ThingWorx Learning Paths

Transaction Handling in ThingWorx

Getting Started on the ThingWorx Platform Learning Path