IoT Tips

Jan 21, 2018

KEPServerEX requires the 32-bit version of Java if you are using the IoT Gateway Plug-in. If you do not have the 32-bit version installed and attempt to connect the IoT Gateway, the KEPServerEX Event Log will report the following error: “IoT Gateway failed to start, 32-bit JRE required." Some of the Manufacturing Applications training content relies on this Plug-in, as well. As a best practice, it is recommended that both the 32-bit and 64-bit versions of Java be installed. This install is available for download from the Oracle website, here: Java SE Runtime Environment 8 - Downloads

Jun 29, 2017

Distributed Timer and Scheduler Execution in a ThingWorx High Availability (HA) Cluster Written by Desheng Xu and edited by Mike Jasperson Overview Starting with the 9.0 release, ThingWorx supports an “active-active” high availability (or HA) configuration, with multiple nodes providing redundancy in the event of hardware failures as well as horizontal scalability for workloads that can be distributed across the cluster. In this architecture, one of the ThingWorx nodes is elected as the “singleton” (or lead) node of the cluster. This node is responsible for managing the execution of all events triggered by timers or schedulers – they are not distributed across the cluster. This design has proved challenging for some implementations as it presents a potential for a ThingWorx application to generate imbalanced workload if complex timers and schedulers are needed. However, your ThingWorx applications can overcome this limitation, and still use timers and schedulers to trigger workloads that will distribute across the cluster. This article will demonstrate both how to reproduce this imbalanced workload scenario, and the approach you can take to overcome it. Demonstration Setup For purposes of this demonstration, a two-node ThingWorx cluster was used, similar to the deployment diagram below: Demonstrating Event Workload on the Singleton Node Imagine this simple scenario: You have a list of vendors, and you need to process some logic for one of them at random every few seconds. First, we will create a timer in ThingWorx to trigger an event – in this example, every 5 seconds. Next, we will create a helper utility that has a task that will randomly select one of the vendors and process some logic for it – in this case, we will simply log the selected vendor in the ThingWorx ScriptLog. Finally, we will subscribe to the timer event, and call the helper utility: Now with that code in place, let's check where these services are being executed in the ScriptLog. Look at the PlatformID column in the log… notice that that the Timer and the helper utility are always running on the same node – in this case Platform2, which is the current singleton node in the cluster. As the complexity of your helper utility increases, you can imagine how workload will become unbalanced, with the singleton node handling the bulk of this timer-driven workload in addition to the other workloads being spread across the cluster. This workload can be distributed across multiple cluster nodes, but a little more effort is needed to make it happen. Timers that Distribute Tasks Across Multiple ThingWorx HA Cluster Nodes This time let’s update our subscription code – using the PostJSON service from the ContentLoader entity to send the service requests to the cluster entry point instead of running them locally. const headers = { "Content-Type": "application/json", "Accept": "application/json", "appKey": "INSERT-YOUR-APPKEY-HERE" }; const url = "https://testcluster.edc.ptc.io/Thingworx/Things/DistributeTaskDemo_HelperThing/services/TimerBackend_Service"; let result = Resources["ContentLoaderFunctions"].PostJSON({ proxyScheme: undefined /* STRING */, headers: headers /* JSON */, ignoreSSLErrors: undefined /* BOOLEAN */, useNTLM: undefined /* BOOLEAN */, workstation: undefined /* STRING */, useProxy: undefined /* BOOLEAN */, withCookies: undefined /* BOOLEAN */, proxyHost: undefined /* STRING */, url: url /* STRING */, content: {} /* JSON */, timeout: undefined /* NUMBER */, proxyPort: undefined /* INTEGER */, password: undefined /* STRING */, domain: undefined /* STRING */, username: undefined /* STRING */ }); Note that the URL used in this example - https://testcluster.edc.ptc.io/Thingworx - is the entry point of the ThingWorx cluster. Replace this value to match with your cluster’s entry point if you want to duplicate this in your own cluster. Now, let's check the result again. Notice that the helper utility TimerBackend_Service is now running on both cluster nodes, Platform1 and Platform2. Is this Magic? No! What is Happening Here? The timer or scheduler itself is still being executed on the singleton node, but now instead of the triggering the helper utility locally, the PostJSON service call from the subscription is being routed back to the cluster entry point – the load balancer. As a result, the request is routed (usually round-robin) to any available cluster nodes that are behind the load balancer and reporting as healthy. Usually, the load balancer will be configured to have a cookie-based affinity - the load balancer will route the request to the node that has the same cookie value as the request. Since this PostJSON service call is a RESTful call, any cookie value associated with the response will not be attached to the next request. As a result, the cookie-based affinity will not impact the round-robin routing in this case. Considerations to Use this Approach Authentication: As illustrated in the demo, make sure to use an Application Key with an appropriate user assigned in the header. You could alternatively use username/password or a token to authenticate the request, but this could be less ideal from a security perspective. App Deployment: The hostname in the URL must match the hostname of the cluster entry point. As the URL of your implementation is now part of your code, if deploy this code from one ThingWorx instance to another, you would need to modify the hostname/port/protocol in the URL. Consider creating a variable in the helper utility which holds the hostname/port/protocol value, making it easier to modify during deployment. Firewall Rules: If your load balancer has firewall rules which limit the traffic to specific known IP addresses, you will need to determine which IP addresses will be used when a service is invoked from each of the ThingWorx cluster nodes, and then configure the load balancer to allow the traffic from each of these public IP address. Alternatively, you could configure an internal IP address endpoint for the load balancer and use the local /etc/hosts name resolution of each ThingWorx node to point to the internal load balancer IP, or register this internal IP in an internal DNS as the cluster entry point.

Dec 1, 2021

Race to the finish line of IoT app development with ThingWorx Edge! Hi everyone, Today’s post comes fully packed with an intro to our ThingWorx Edge strategy. Learn how to connect your machine data to ThingWorx, how our SDKs work, why you’d want to use them and how you can deploy agents. To learn more about the Edge with ThingWorx, I spoke with Shravan. When Shravan’s not following Formula 1, learning about the latest trends in AI or playing cricket, he’s leading our development teams as the PM for the Edge. Check out what he had to say: Kaya: How can I connect my machine data to ThingWorx through a protocol that the platform does not natively support? Shravan: If you’re looking to connect your machine data to ThingWorx through a proprietary protocol and you require a 1:1 connection with the platform, we have two paths you can take. You can either: Utilize PTC’s Edge MicroServer (EMS) out-of-the-box solution. This is a stand-alone application that you can configure to send property updates, files, etc. to your instance of the ThingWorx Platform. It’s almost like a little mailman. Or, if you want to build connectivity into your own Edge devices, you can take advantage of PTC’s Edge Software Development Kits (SDKs). We offer three SDKs—C, .NET and Java—to enable you to build connectivity into your own custom application. Kaya: And how can I connect my machine data to ThingWorx if my protocol is known? Shravan: Alternatively, if you’re a customer that communicates over a known protocol, then we’d recommend using our ThingWorx Industrial Connectivity product, also known as KEPServerEX or ThingWorx Kepware Edge. For more background, I’ll go into a little more depth: KEPServerEX works in a Windows environment. It leverages OPC and IT-centric communication protocols (SNMP, ODBC and web services). It has more than 150+ drivers that help to establish stable and secure communication channels to devices, PLCs, Gateways, etc. ThingWorx Kepware Edge, launched during LiveWorx 2019, provides the most valuable features of KEPServerEX to be deployed in Linux-based gateways, starting with three key drivers: Modbus Ethernet, Allen-Bradley ControlLogix Ethernet and Siemens TCP/IP. Kaya: Okay, great. Let’s cover SDKs first. In one sentence, can you explain how our SDKs work? Shravan: The SDKs provide services for establishing a secure WebSocket connection (AlwaysON) to the ThingWorx Platform so that you can perform functions such as property updates, file transfers, tunneling, software content management (SCM), etc. Kaya: At a high level, why do SDKs leverage AlwaysON? Shravan: Typically, transferring data to a server in IoT would require leveraging either REST web services, which has high connection overhead, works over HTTP or using MQTT, which requires a server and additional open ports. So, an ideal connection should have three key features as a minimum: Stays continuously on. Is always ready to receive data and execute commands. Should also use existing open ports on firewall. ThingWorx Edge SDKs provide this ideal connection though “Always On” protocol, which is based on WebSockets. Kaya: What are the top three things a developer can do with the SDKs? Shravan: Here are my top three. Thing property updates: Events can subscribe to changes in property values and in aspects of properties. File transfer: Browse remote directories and files on an instance of ThingWorx platform and enable bidirectional file transfer between an edge device and an instance of the platform. Tunneling: Allows users to establish secure, firewall-friendly application tunnels for applications that use TCP, such as VNC or SSH. Kaya: How can you deploy SDK-based agents to devices? Shravan: You can deploy the agent in two main ways. One where the agent lives adjacent to the control and monitoring application of the machine and the other where the agent is directly integrated to the control and monitoring app of the machine. Kaya: Anything else you want to mention? Shravan: The Edge SDKs support a framework for adding additional functionality to the SDKs. This is known as the Edge Extension Framework. This allows additional functionality to be provided as installable services, which allows them to be extended in a manageable way while keeping the core SDK as compact and efficient as possible. Software Content Management (SCM) was the first commercially available Edge Extension. - - - This concludes the first installment of our series on Edge. Be on the lookout for deeper dives into KEPServerEX, SDKs, ThingWorx Kepware Edge and AlwaysOn. Reach out if you have any questions. Stay connected, Kaya P.S. For everyone who’s followed me for the past year, I’d just like to extend a huge thanks, as “Ask Kaya” has just turned one! It’s been a great first year and I’m excited for year two!

Aug 16, 2019

Business logic and actions in a ThingWorx application are driven by events. Events are interesting or critical property states that a Thing publishes to subscribers. Events are defined at the thing, thing template, or thing shape level, and can be as simple as a new data value from a device, to complex events from many data points. An event that is created on a thing shape or thing template level is inherited down into the entities that implement the shape or template that defines the event. Types of Events Built-in Custom Built-in Events Every entity in ThingWorx can use several built-in events. These events are automatically triggered when a prerequisite condition is met. In some cases, you must provide information that determines the specifics of that prerequisite. Common built-in events include: Data Change Alert Timer In ThingWorx, there are standard events and related data packets (defined by Data Shapes). The most common type of event is a Data Change related to a Thing property. The Data Change fires when a property value changes. The specifics of this event can be configured from the Data Change Info section of the Property Creation page. When you define a property, there are many configuration aspects. Example: Using DataChangeEvent, you have a few options for setting an event to fire when there is new data for a property: only if the data has changed only if the data evaluates true or false only if the new value changed beyond a defined threshold The Alert event is also attached to a property and is set up from the Property Definition page. An Alert event fires when the value of the property reaches a certain threshold or value. Depending on the base type, you may specify the condition of this event from the Manage Alerts button. Timer events can be used to run jobs or fire events on a regular basis. Things themselves can fire specific events such as Thing Start or “special” things, like a Timer type thing that contains special event(s). How to Set up and Configure Timers Creating a Custom Event To create a new custom event: In ThingWorx Composer, open up the Thing, Thing Template, or Thing Shape for edit where the new Event will be added From the sidebar menu under Entity Information, select Events Click Add My Event Provide the event with a name and a data shape that describes the data being sent to a subscriber Click Done and Save Once created, the event appears on the Subscriptions page for subscribing. Since the event is custom built, you must specify when it triggers. This is often accomplished from within a service. Best Practices There should be a subscriber to the event in the model. The subscriber is sent a data packet and the subscription is initiated. If no one is subscribed to the event (no one is listening), nothing happens. DataChange events hit the event processing subsystem (see Event Processing Subsystem below), which has fewer threads allocated to it than some other subsystems: If queries within subscriptions to data change events take too long, this will cause a back-up in the event processing subsystem Too much activity in this subsystem may result in poor system performance, and even server crashes Be cautious what types of calculations are performed in data change event subscriptions Refer to the following knowledge base article regarding tricky data change event use cases, the implementation of rules for alerts, and conditional operations: Complicated Event Calculations in ThingWorx Load Test all applications in Sandbox before committing changes to Production to ensure mashups can handle all design choices on large scales Event Processing Subsystem The Event Processing Subsystem manages event processing for external subscriptions (Things subscribing to other Things) throughout ThingWorx. Event Queue Processing Settings Base Type Default Notes Min Threads Allocated to Event Processing Pool NUMBER 16 Max Threads Allocated to Event Processing Pool NUMBER 500 Max Queue Entries Before Adding New Working Thread NUMBER 200000 Maximum number of entries to queue before adding a thread to the pool Additional Information PTC ThingWorx Help Center - Thing Events Event Types Creating and Triggering Custom Events in ThingWorx

Mar 12, 2018

May 14, 2024

Tune in to The Lean Manufacturer podcast where expert guests bring their outside view of the IIoT and discuss various aspects of manufacturing. Over the course of the series, we’ll cover some of the most important ways the IIoT can maximize manufacturing efficiency and bring value to your organization, including the need for reducing planned and unplanned downtime, enabling operational efficiency, ensuring digital continuous improvement, and so much more.

Apr 9, 2024

Apr 8, 2024

Apr 1, 2024

Dev Ops is a crucial process that exists in any software setting, whether you plan on it or not. Chaos in the dev ops process, say because less time is spent here than on the shiny new features that are easy to sell, results in bottlenecks in the dev ops process. Bottlenecks reduce efficiency, and leave you open to vulnerabilities as well. The faster you can get a change properly tested and safely into production, the safer and more stable the system is all along. Issues will arise, they always arise. Are you ready for them? Watch this video, see some of these additional links, and think about your dev ops process now, before the fires start! Useful Links: ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 1 ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 2 Overview of Monitoring Tools and Diagnostics The System Health Timer

Jan 31, 2023

Architecting Reason Code Trees in DPM Tori Firewind, IoT EDC What are Machine Codes? Factory hardware devices communicate status changes to their human operators and other machines (IoT) via machine codes. The manufacturers often determine the machine codes for different types of factory hardware, so those are often pre-determined. However, how the reason trees map these machine codes to corresponding business logic in ThingWorx is entirely customizable. Knowing the best way to design your reason trees for this purpose can be challenging, so this guide is here to help with your conceptual knowledge. Using the UI to create, edit, and configure reason codes in technical detail can be found in the Help Center. The Tree Trunk At the highest level of the reason tree, the trunk, there are really 3 categories: Availability (A), Performance or Productivity (P), and Quality (Q). These should look familiar; they are the three dimensions of OEE (Overall Equipment Effectiveness). Fg 1. Calculation of OEE Availability refers to long stops, events that stop planned production long enough that it makes sense to track a reason for being down (typically several minutes, but the threshold between a long stop and a short stop can vary depending on the ideal rate of production of materials). Availability = Run Time / Planned Production Time Productivity/Performance really refers to short stops, things that cause the machine to run at less-than- optimal speeds. This can include stops caused by running out of materials for production, doing minor maintenance like switching out a single, easily-changed part, or even frequent breaks due to ill health of an operator. User error can be a cause as well, say if the machine needs a certain heat to produce parts, and the heat keeps fluctuating (requiring the machine to take the time to calibrate for this before starting on production) because operators are smoking out a back door or adjusting thermostat temperatures. Fg 2. Levels of Runnable Time Operator influence often is a factor when it comes to the conditions that permit optimal performance from machinery, and every factory may face different challenges. Stops like these are not really outages; the amount of downtime isn’t enough to consider the production block entirely unproductive. Production was continuing and ongoing throughout most of the block despite the issues; the rate was just slower than ideal. Performance/Productivity = (Total Count / Run Time) / Ideal Run Rate Quality refers mostly to the number of items that are considered scrap or rework, and it can be split into two categories: start up scrap (that which is expected because the machine is in the process of warming up or being fine-tuned by the operator) and production scrap (things which come out wrong and must be tossed or reworked because the conditions under which they were produced weren’t ideal; this is called first-pass yield only, meaning it's only a "good" product if it passes inspection the first time). Quality = Good Count / Total Count The Branches and the Leaves of the Tree The “leaves” are the reason codes which directly map to machine codes , and the “branches” are the method of categorization that connects them to the trunk. Both the leaves and the trees, the children and the parent nodes of the tree, are split into two states: planned versus unplanned downtime. Changeovers, maintenance, and even scrap, can be broken down into this dichotomy. For scrap, there are startup rejects (planned, because the machines have ramp up periods) and production rejects (unplanned, because the conditions weren’t ideal). For maintenance there is planned and unplanned, small changes that occur on the fly that result in productivity loss, and maybe also reduce availability in the long run. Small, unplanned changes can occasionally shift into the availability loss category if a simple, quick repair winds up being complex and time-consuming. A good reason tree can differentiate easily between short and longer stops in order to respond to each in a deliberate way. To start off in the process of architecting your reason tree, try writing the three categories on a board in a common room in an average factory (or several as a survey). Ask operators to stop in over the course of a few days and write various machine codes that they see often and find useful under one of these categories, or more than one if the machine code pops up under different circumstances and can mean different things. Have them write a 10 word justification, if the association isn’t obvious. Gather all of the “leaves” in this way, and then begin to associate them with the “trunk”, forming the “branches”. An example tree can be seen in figure 3 here, with leaves like “Changeover” and “Maintenance” being semi-ambiguous; they could just as easily be seen as unplanned stops. Therefore, there may be multiple reason codes mapping up to the top of the tree in more than one branch, and these can have different categories, which controls how the business logic responds to the different codes. The Help Center has more details about how the events are mapped to types, and each type contains multiple categories, as configured by you when you set-up the DPM model. Fg 3: Different types of changeovers may have different codes, and can map up as either planned or unplanned, but all planned and unplanned stops (long stops) are under the Availability category of the trunk. Similarly, small stops can involve idling, like if there are not enough materials, reduced speed if the conditions are not ideal, or other small stops, usually caused by human error or unforeseeable circumstances. Quality loss then refers to the products which fail quality checks, either because the machine still has the wrong paint in the applicator and needs a few rounds to be ready for the next production item, or because the conditions are again, not ideal, and items wind up scrapped. Example Reason Tree Fg 5 example tree with more specific tags (there may be dozens or hundreds in a full reason tree, though the fewer are needed to capture the events we care about, the better). Theory of Constraints Fg 6 theory of constraint wheel: an industry process for gradual OEE improvement in factories that has been adapted into the PTC methodology as well. While architecting your reason tree, always remember the key purpose: gathering only as much data as necessary to analyze the efficiency of a factory and to identify the bottleneck, or the most limiting factor. The important point is to identify not just the bottleneck that seems the most troublesome, but the one that actually results in the greatest impact to OEE across the entire factory. Without software like DPM, and a properly designed reason code tree, the process of improving a factory can be very challenging, involving a lot of guesswork, and sometimes solving one problem at the cost of another. The issue is that these machines produce a LOT of raw data, and humans are not the best tool available to gather and aggregate this data in a consumable way. A good reason tree ensures a smart application that can quickly prioritize the machine (bottleneck) that most impacts production, and not just the machine that functions in the least optimal way. So, the theory of constraints is really a process for identifying small, incremental changes, which together can make a big difference, and fast, in factory OEE. The rate at which this cycle can be completed varies, however. The slower the process of identifying constraints and the less information that is gathered, the slower and less precise the first two steps of this process. Alternatively, in a traditional constraint identification process, too much information can be a problem as well, due to human limitation, as discussed above. So, DPM is a great benefit in this regard, because it aggregates the data into a consumable, comparable way every 5 minutes, freeing up your human analytics for problem solving and prioritization, and not data gathering and sorting. Other Key Tips Also remember that a good tree treats the trunk like a whole unit, with each category occupying a percentage of the overall OEE. Afterall, look back up at the 3 dimensions of OEE in the equation above. For example, the more you see issues with availability, the less you will see issues with scrap, for the machine simply doesn’t have as much time to produce scrap if it is constantly down. The more you see issues with quality loss, the less you should see of productivity loss, because these are simply inversely proportional modes; to say it differently, if a machine is running quickly and seeing few minor maintenance stops, then it is likely to produce more scrap (as well as more good product as well). Another thing to remember is that even DPM is limited in its capacity to interpret raw data. Even while many magnitudes more efficient than any human gathering and analysis could ever be, there is an upper limit to how much raw data DPM can ingest and analyze before the system gets very expensive. For this reason, you want to ensure your reason trees use only as many reason codes as are required to capture the OEE of a factory site. This will mean using different codes for different types of things, most likely, which is easy to do maintainably across many sites using thing shapes. Keeping things tightly defined and organized is the easiest way to ensure a clean, efficient system for gathering and storing data. Also remember that data will not need to persist very long once DPM is fully operational and adopted by your factories. DPM ensures that the changes made to the production line to improve efficiency are the highest impact, and the least difficult to implement, meaning that there will be a very rapid return on investment, and a process to ensure future issues are identified and resolved quickly. Data from past issues in the factory won’t be as relevant, and historical data stores can be kept smaller than one might think. It is the power of ingesting data directly into the processing and aggregation process, the automatic reduction of data down into presentable, consumable webpages, that makes DPM and ThingWorx such a great factory solution for optimizing OEE.

Jan 24, 2023

ThingWorx Monitoring and Alerting, Part 2 Using Prometheus and Grafana By Tori Firewind, IoT EDC Building Dashboards To add a panel which monitors some component of the ThingWorx application to a dashboard in Grafana, click to add a new panel. Under “Metrics” in the box at the bottom of the screen, select what ThingWorx metrics you wish to monitor (type “thingworx” in the search box to see them all). For example, select the Platform Subsystem memory in use: Label filters aren’t necessary, though you may want to sort by instance if you are monitoring multiple ones with the same dashboard. You may also want to take some time to format the Y axis, which by default will show in bytes. Go to the formatting panel on the right side and scroll down to the section called “Standard options”. For the Unit dropdown, start typing “data” and then select “bytes (SI)”. This will automatically determine if the bytes you’ve provided should really print as MB or GB based on how large the numbers are. Rename the panel, modify it in any other way desired, and then click Apply (last 5 minutes): Once you add the panel, you can watch the memory usage as it is scraped by selecting the refresh option (10s or 30s, whatever makes sense based on your scrape interval). The viewing window is stored in the URL, so that you can generate a report for a specific interval (like when a test was occurring), and then store that result or share it in a more compact way: http://localhost:3000/d/nleucPv4k/thingworx-monitoring?orgId=1&from=1668528038732&to=1668536503953 (absolute timestamps): Dashboards are just collections of panels which report on all of the various metrics of performance and stability that exist for single components of a system. This is because there can be quite a few metrics worth watching for each individual component. Most of the third-party tools come with their own dashboards, but the ThingWorx component is one which for now, requires some thought and creativity. Consider your use case carefully and look over the various subsystems contained within ThingWorx. Each part of an application is localized to specific subsystems, and some are more business critical than others. What will go at the top of your dashboard? Add rows, add panels per row, and see what the many choices are for watching your system. Don’t forget that with Telegraf running, VM or machine usage metrics are also available for display on a dashboard. Things like overall CPU and Memory usage are critical to determining the health of a system, as we have demonstrated in our own reasoning in past benchmarks and scale tests. You can create a panel to monitor the mem_used versus mem_total, like so: Another metric from Telegraf worth adding is the CPU usage, which should be given “percentage” for the units and which needs a label filter of cpu = cpu-total. If we do some resizing and drag-and-dropping, then we now have the first row of a dashboard: See how the Platform usage climbs steadily and is purged in a cycle? That is the Java Garbage Collection mechanism, and it’s important to remember to leave room for spikes on top of those peaks. Data can also be calculated or processed in some way to make it more useful for determining system health and stability. The data in the picture below uses the formula submitted = completed + number queued + number failed. It shows the current queue on the left Y-axis and the max queue on the right (since the two numbers usually are drastically different). It looks pretty, but it doesn’t really tell us much about the system in this format, so let’s do some math and find a representation that is a bit more helpful. Performing a “non-negative derivative” calculation over the submitted and the completed queue counts over time allows for us to look at the status of the queue as a velocity. When the “complete” speed appears behind the “submitted” speed for too long at a time, then that means the queue is filling up and will eventually result in data loss. If we take this one step further and calculate the average of the submitted minus the completed over time, then we can actually predict approximately when the queue will fill up. This can then be displayed on a dashboard in Grafana, or used as the basis for an alert. What to Monitor In addition to monitoring the system which ThingWorx runs upon, ThingWorx itself can easily be monitored down to the subsystems level by Prometheus due to the Metrics endpoint. Many applications have support built into the way they format the data for scraping, including the JVM (which exposes Prometheus-formatted metrics with the JMX Exporter) and the OS (which can use the Node Exporter or Telegraf for the same purpose). For these more generic components, there are popular community dashboards which can be downloaded and used in Grafana for data analysis and review. For ThingWorx, there’s different kinds of data to track: subsystem data (see the list on the right) and non-subsystem data. There’s queue based versus non-queue data. These different metrics can collectively characterize the overall health of the application, depending on the use case. For instance, if this is a system with very many connected devices, one metric which may be important to track is the number of total devices defined on the Foundation server vs. the number of devices which are currently connected. If there are relays involved, then many devices suddenly going offline can mean a relay has failed. Another example is if the system sizing depends on an assumption that there will only ever be a fraction of the total number of devices connected at a time. Use cases like these could be monitored easily by keeping track of the total vs. the number of connected devices. Other common indicators of a healthy ThingWorx application might include the value stream and stream queues. These queues should fluctuate over time as the data is ingested and processed, but they should never be growing in size. If the stream queues are growing, then that means the data is writing to the queues faster than the queues can write to the database. Eventually, when the system runs out of resources to keep track of the queues, data will be lost. Having the stream information displayed in a chart can make it very easy to spot an upward trend in resource usage early on, which can catch a blockage or bottleneck that needs attention before it starts to affect the larger system in catastrophic ways later. Memory usage information from the various subsystems might be something worth tracking, as well as the event queue. These can indicate that the business logic is functioning with room to handle spikes, and that the server has enough memory to service all three dimensions of an IoT application: the ingestion, the business logic and thing-based alerting, and the user experience and UI. If file transfers are a key part of the use case, then the number of concurrent transfers, the average speed of them, the size of the files, all of this kind of stuff can be tracked and charted in Grafana by making use of the ThingWorx metrics which automatically show up there once you import the Prometheus data source. A mature dashboard used for a production environment might look a little like this: For further reading about subsystem monitoring, check out the Help Center. How to Alert The alerting mechanism built-in to Prometheus is incredibly easy to configure, so it might be tempting to generate tons of alert rules. However, remember that the more noise a system makes, the harder it is for those monitoring that system to know when action is really required. Playbooks which document how to respond to alerts, who to contact, how to act, and all the information necessary to handle an alert, should be created as an ongoing part of the DevOps process. Alerts should fire with the right severity in the subject line, as well as all of the information about the issue that is currently known, presented in a concise way, so that whoever receives the alert starts thinking about the root cause sooner and recovers the system faster. Those who receive the alert should have the ability to facilitate its resolution, and know who is expected to react to any alerts which come in. In the ThingWorx monitoring stack, Prometheus handles the alert rules and the generation of alerts, but alert filtering and delivery is managed in an external alerts manager. Generally, you want your alerts to follow a curve. If the current queue size exceeds 50% of the maximum, perhaps that isn’t a huge deal, if the application catches up quickly. How long are spikes in queue processing expected to last? Perhaps if the queue size is over half-full for 10 seconds, 30 seconds, then that means the queue is falling behind and not catching up. Ok, so this might be a warning level alert. When does this become an error? Well, let’s say the queue exceeds 90% of the max queue size. This might want to alert the moment it hits the mark. Now, farther along the curve, it may not take as long before data gets lost. As the severity of the situation increases, the threshold for alerting should increase as well. That way when errors do alert, it is a sure thing that they require a response immediately. The alerts are then pushed into the “Alerts Manager” for delivery based on your management rules. The Alerts Manager may decide to withhold warnings altogether, or send them to a much smaller mailing list, whatever filtering helps to ensure the right people receive the right alerts, right when they need them. In Conclusion, A Healthy Application... Has stable memory usage that fluctuates predictably and doesn’t grow over time. In a system experiencing mild issues, the memory starts to trend upward: If left unattended, systems like this may eventually experience outages. Finding the issue this early means there is even time to do some digging, debugging, taking of stack traces, and other such troubleshooting steps before the system must be restarted or recovered. That can really make the difference in identifying and resolving before there are real problems. One metric which makes for good alerting is the total number of failed stream entries, which can indicate there’s an issue writing to the database even before the queue has started to fill up. Other alerts may include warnings and errors based around percentages of memory used or queues filled, which depend on how long the queues take to fill up and how long the state has been at its increased usage. Prometheus has all of the tools necessary to make this possible across a variety of infrastructures and use cases. Set it up on a local machine and poke around at what ThingWorx metrics are available to meet your monitoring needs.

Dec 23, 2022

This video begins Module 7: Predictive & Prescriptive Scoring of the ThingWorx Analytics Training videos. It describes how a trained machine learning model takes inputs and makes predictions of different kinds, depending on the use case. It shows how scoring works in production, taking inputs from various sources and producing a score to help users make informed decisions. It also covers the concept of field importance in an individual score.

Dec 23, 2022

IoT Tips

ThingWorx Tutorials: Introduction to Streams

Why do I need to install 32-bit java even though I am using a 64-bit operating system?

Distributed Timer Execution in a HA Cluster

Q&A: What should I know about ThingWorx Edge features?

ThingWorx Events Explained

IoT Implementation Guide - 3 Top Tips in Detail

Tune in to The Lean Manufacturer - Episode 1 now airing!

Today, April 9 is #WorldIoTDay - Let´s Celebrate

ThingWorx is a Market Leader By SPARK Matrix

Dev Ops: The High Level Overview

Architecting Reason Code Trees in DPM

ThingWorx Monitoring and Alerting: Using Prometheus and Grafana, Part 2

ThingWorx Analytics Training: Module 10 Part 1

ThingWorx Analytics Training: Module 9 Part 2

ThingWorx Analytics Training: Module 9 Part 1

ThingWorx Analytics Training: Module 8 Part 3

ThingWorx Analytics Training: Module 8 part 2

ThingWorx Analytics Training: Module 8 Part 1

ThingWorx Analytics Training: Module 7 Part 2

ThingWorx Analytics Training: Module 7 Part 1

ThingWorx Analytics Training: Module 6 Part 5

ThingWorx Learning Paths

Getting Started on the ThingWorx Platform Learning Path