IoT Tips

Distributed Testing with JMeter Overview Running JMeter to the scale required by most customers is something that demands additional considerations than discussed in the previous two articles. At scale, a test may need to simulate thousands of users, which will require more than just one JMeter client be set-up on one or many hosts, as shown in the 3rd JMeter article here, in a tutorial on Distributed Testing. Distributed Testing Remote Testing configuration in which the main JMeter client is located at one IP address, controlling the rest as they step through their own copies of the JMeter tests, based on their own unique data files as necessary, to simulate a user load across a network, a series of regions, or simply across many machines if limited by the size of the physical hardware [JMeter link for this image in text body below] One key aspect of a proper JMeter load test is distributed or remote testing, i.e. making use of more than one JMeter client at a time to simulate the user load on the Application server. There are many reasons to make use of a network of clients such as this, like mimicking cross-region user access to the Foundation server, simulating different levels of latency for different users, and increasing the overall number of users which can contribute to the load test, while minimizing the performance cost of hosting that many threads on any single server. A single JMeter client has a practical limit of 150-250 threads across all groups and requires about 1 CPU and 8 Gb of RAM. After this point, the amount of garbage collection and other processing there is for each client to do is substantial. As the client processes its own data and sends requests to the Application server at the same time, there are diminishing returns, and the responses begin to take longer (or errors start occurring) simply because of resource starvation within the client process rather than on the Application server. Therefore, distributed testing is required for most customers doing larger load tests using JMeter. Many applications will have more than a few hundred users and/or will have users accessing the system from a variety of regions and networks, each of which could have significantly different network latency. So, in order to work with the limitations of the JMeter executable and address regional concerns, distributed or remote testing is typically required for almost all of PTC’s customers who scale test with JMeter. With a simple (monolithic) distributed test, all of the JMeter clients are located on the same host and share an IP address, but each must be configured with a unique RMI port to connect to the controlling process. If these are located on a VM, then the resource specifications can merely be increased and the VM sized larger as necessary to ensure the network of JMeter clients runs as expected. Each JMeter client requires around 8 GB for its heap size and 1 CPU (with some additional resources for the host operating system). Multi-hosted testing becomes the required option when limited by physical hardware (or a relatively small VM hardware host). If there are only 4-core, 32-GB machines, then plan for a machine per every 3 JMeter clients. If simulating thousands of users, this could mean half a dozen machines or more are required, which can still sometimes work out to be more cost efficient than one large, 256 GB, VM hosted in the cloud. Using many hosts in physical locations can also simulate regions with different network characteristics. A tutorial for distributed testing across one host is shown here. For more information, see the Apache web articles on each topic: Remote Testing and Distributed Testing Step by Step. Tutorial: Step Up Distributed Test on One Host Copy the source directory for the whole JMeter project and rename it however many times as required. Here there are 22 JMeter clients side-by-side on a single, 256-GB VM (3000+ users): Each directory (shown above) is identical, except that the “jmeter.properties” files (found in the bin directory in each project) have unique settings, namely the RMI port: Each JMeter client must contain a copy of the same test scripts found on the main server: In the “jmeter.properties” file for the main server, specify the IPs and ports for each remote/distributed client (under remote_hosts), as shown: In this image, the IPs are all the same, with just the port differing from client to client. Here only 4 clients are in use, with the rest commented out for future tests. This is how to scale up and test incrementally more users each time. Just add another server to add another 150-250 users, until eventually the target number of users is reached, or the server is saturated. These IPs will differ if doing a true remote test, with each being the server location of the JMeter client within the same network. The combination of IP address and port will all still need to be unique, and communication between the overall jmeter controller and the clients over the RMI ports needs to be allowed by the network/firewalls. Note that the number of users is set using the parameter under “Test Plan” which was set-up last time. This value represents the number of users by specifying the number of threads per thread group, and it can remain the same for every client or vary accordingly, if for instance one region is smaller than another. The “Test Plan” parameters are shown here: To optionally start all of the clients at once in preparation for test execution, create a basic batch or shell script which goes to the bin directory of each agent and calls the start command: “jmeter-server”. In this image from a Windows JMeter host, only the first few agents are in use, but removing the “rem” to uncomment the other start command lines in this file would add more servers to be started. Note how the Java parameter for java.rmi.server.hostname must match the main JMeter client network configuration here for them to connect (see Apache links above for more information). This will start each of them in their own CMD window, which once closed, will terminate the JMeter client processes. Parameter like rampUp time within the main test script will scale with the number of client processes. For example, 100 users and 300 seconds rampUp with 4 clients results in 400 overall user threads that are all logged in after 300 seconds. Once all clients are running, then click Remote Start All to start the test across every server from a GUI (usually for debugging) or execute the test using command line: jmeter -n -r -t <test.jmx> -l <results.jtl> The main server sends the actions to the remote clients to run, so all the clients need is input parameters. For instance, a CSV file may exist in each directory which has different data from client to client, to create pseudo-random user loads and represent different kinds of user activity. The file shown in this image is different, and unique, in each of the client directories: Conclusion Here, we learned how to horizontally scale the load test, setting up more JMeter clients to facilitate larger, more complete user loads. We also discussed the difference between distributed and remote testing, and how the former is easier to set up and use, especially on VMs, but the latter might be better for simulating region differences and the impact of network latency. The latter will likely also be required if there are hardware constraints to consider, since each JMeter client needs about 8 GB for its heap, and another 8 GBs, or a core or two of similar size, is needed per every 3 JMeter clients for the communication and processing of data. Stay tuned for the next article on generating and reviewing the results of the load tests.

Aug 26, 2020

Smoothing Large Data Sets Purpose In this post, learn how to smooth large data sources down into what can be rendered and processed more easily on Mashups. Note that the Time Series Chart widget is limited to load 8,000 points (hard-coded). This is because rendering more points than this is almost never necessary or beneficial, given that the human eye can only discern so many points and the average monitor can only render so many pixels. Reducing large data sources through smoothing is a recommended best practice for ThingWorx, and for data analysis in general. To show how this is done, there are sample entities provided which can be downloaded and imported into ThingWorx. These demonstrate the capacity of ThingWorx to reduce tens of thousands of data points based on a "smooth factor" live on Mashups, without much added load time required. The tutorial below steps through setting these entities up, including the code used to generate the dummy data. Smoothing the Data on Mashups Create a Value Stream for storing the historical data. Create a Data Shape for use in the queries. The fields should be: TestProperty - NUMBER timestamp - DATETIME Create a Thing (TestChartCapacityThing) for simulating property updates and therefore Value Stream updates. There is one property: TestProperty - NUMBER - not persistent - logged The custom query service on this Thing (QueryNamedPropertyHistory) will have the logic for smoothing the data. Essentially, many points are averaged into one point, reducing the overall size, before the data is returned to the mashup. Unfortunately, there is no service built-in to do this (nothing OOTB service). The code is here (input parameters are to - DATETIME; from - DATETIME; SmoothFactor - INTEGER): // This is just for passing the property name into the query var infotable = Resources["InfoTableFunctions"].CreateInfoTable({infotableName: "NamedProperties"}); infotable.AddField({name: "name", baseType: "STRING"}); infotable.AddRow({name: "TestProperty"}); var queryResults = me.QueryNamedPropertyHistory({ maxItems: 9999999, endDate: to, propertyNames: infotable, startDate: from }); // This will be filled in below, based on the smoothing calculation var result = Resources["InfoTableFunctions"].CreateInfoTable({infotableName: "SmoothedQueryResults"}); result.AddField({name: "TestProperty", baseType: "NUMBER"}); result.AddField({name: "timestamp", baseType: "DATETIME"}); // If there is no smooth factor, then just return everything if(SmoothFactor === 0 || SmoothFactor === undefined || SmoothFactor === "") result = queryResults; else { // Increment by smooth factor for(var i = 0; i < queryResults.rows.length; i += SmoothFactor) { var sum = 0; var count = 0; // Increment by one to average all points in this interval for(var j = i; j < (i+SmoothFactor); j++) { if(j < queryResults.rows.length) if(j === i) { // First time set sum equal to first property value sum = queryResults.getRow(j).TestProperty; count++; } else { // All other times, add property values to first value sum += queryResults.getRow(j).TestProperty; count++; } } var average = sum / count; // Use count because the last interval may not equal smooth factor result.AddRow({TestProperty: average, timestamp: queryResults.getRow(i).timestamp}); } } Create a Timer for updating the property values on the Thing. The Timer should subscribe to itself, containing this code (ensure it is enabled as well): var now = new Date(); if(now.getMilliseconds() % 3 === 0) // Randomly reset the number to simulate outliers Things["TestChartCapacityThing"].TestProperty = Math.random()*100; else if(Things["TestChartCapacityThing"].TestProperty > 100) Things["TestChartCapacityThing"].TestProperty -= Math.random()*10; else Things["TestChartCapacityThing"].TestProperty += Math.random()*10; Don't forget to set the runAsUser in the Timer configuration. To generate many properties, set the updateRate to a small value, like 10 milliseconds. Disable the Timer after many thousands of properties are logged in the Value Stream. Create a Mashup for displaying the property data and capacity of the query to smooth the data. The Mashup should run the service created in step 4 on load. The service input comes from widgets on the mashup: Bindings: Place a Time Series Chart widget in the bottom of the Mashup layout. Bind the data from the query to the chart. View the Mashup. Note the difference in the data... All points in one minute: And a smooth factor of 10 in one minute: Note that the outliers still appear, and the peaks are much easier to see. With fewer points, trends become easier to spot and data is easier to understand. For monitoring the specific nature of the outliers, utilize alerts and other types of displays. Alternative forms of data reduction could involve using the mean of each interval (given by the smoothing factor) or the min or max, as needed for the specific use case. Display multiple types of these options for an even more detailed view. Remember, though, the more data needs to be processed, the slower the Mashup will load. As usual, ensure all mashups are load tested and that the number of end users per Mashup is considered during application design.

Sep 3, 2019

ThingWorx 8.5 Sizing Guide Sizing is a very important part of the application design process, answering such questions as: how much hardware is required? What specifications does this hardware need to handle the expected load? And therefore, what is the overall cost of setting up and maintaining the ThingWorx environment? Properly sizing the environment before development begins ensures that there are no unexpected costs or limitations to application functionality later on down the road. "Hardware sizing is driven by many factors - some more easily calculated than others", as stated in the new ThingWorx 8.5 Sizing Guide. "Measures like data streaming frequency (the data ingestion component) and HTTP request volume (the data visualization component) are easily calculated... However, sizing considerations for the data processing component of the application can depend largely on business use cases and application design." Enterprise-Ready applications have the capacity to handle all aspects of an IoT application, from data ingestion and processing to data visualization, as detailed in the friendly infographic above, which many will recognize from LiveWorx. Inside the ThingWorx 8.5 Sizing Guide, there are formulas designed to help size the more analytical aspects of the application, as well as descriptions of other factors and how they (conceptually) play a role in sizing. There are also two application design examples which step the reader through the calculations, the comparisons, and the selections of hardware for each use case. New in this version, these examples have been simulated in the real world to prove that the theory behind these calculations is sound, and to demonstrate the full process of designing, sizing, and testing an application. One of the examples (shown here) sizes a Connected Product Solution, something which has many, many remote things in the field, each writing to the Platform at a slower rate, for consumption by a large number of general users, who don't access the same mashups many times nor refresh their view very often. The second example is much more complex, modeling an industrial use-case, where there are many different kinds of users each accessing the mashups many times, fewer things, and more variations in the types of properties each thing possesses. These examples are designed to help anyone with any use case step through the sizing of their application properly. Please check out the new ThingWorx 8.5 Sizing Guide, especially because each version of ThingWorx is different and must be sized accordingly. Comments and questions about the guide are very welcome right here on this thread!

Jan 30, 2020

The Property Set Approach This article details an approach developed by Prachi Rath and Roy Clarke, refined by the EDC team in the December 2019's Remote Monitoring of Assets Reference Benchmark , and used to handle multi-property business rules in an Enterprise ThingWorx application. Introduction If there are logic rules which depend upon multiple properties, and each property receives its updates one at a time, then each property will need to have an identical subscription, because there is no way for any one subscription to know the most up-to-date values for the other properties. This inefficient approach would create redundancy and sizing constraints, reducing the capacity of the application to scale up to the Enterprise level. The Property Set Approach resolves this issue by sending in all property updates as one Info Table or JSON property (called the “property set”), which can then have a single subscription. The property set is assembled on the Edge when an update needs to be sent, and then the Platform dissects, processes, and stores the data within this property set as required by the business logic. This approach also involves caching the last property value into a runtime variable so that it can be referenced within the business logic subscription without having to be retrieved from the database. This can significantly improve the runtime of the subscription, reducing the number of resources required to sustain the business logic and ensuring that any alerts or events resulting from the business logic occur as soon as possible. It also reduces the load on the database, ensuring that data ingestion can complete unhindered. So, while there are many benefits to this approach, it is also more complicated. It tightly couples the development of the Edge and Platform code and increases the application complexity, making it slightly less easy to maintain the application in the long run. The property set also requires a little more bandwidth and a more stable internet connection between Edge devices and the Platform since there is more metadata in an Info Table property, and therefore every update is slightly larger than it would be otherwise. So this approach is only recommended when multi-property rules are a requirement of the application and a stable internet connection exists between the Edge and Platform. Platform Implementation I. Create an Info Table (or JSON) Property This tutorial uses the out-of-the-box Data Shape called NamedVTQ for the Info Table property, which is defined on a Thing Template as a remote property. It is important that this is not marked as persistent or logged, as the purpose is to reduce the amount of database writes and reads required by the Platform. The Info Table property has the following property definition: <PropertyDefinition aspect.dataChangeType="ALWAYS" aspect.dataShape="NamedVTQ" aspect.isPersistent="false" baseType="INFOTABLE" isLocalOnly="false" name="numberPropertySetAsInfotable"/> II. Create a Data Change Event Subscription for the Info Table Property The subscription has three parts: Cache the last value for the property in a runtime variable Start off the business rules processing, sending in the whole Info Table Send the Info Table to be logged as individual local properties in the database // First step caches the last Value, refer to the next step… // Second step sets off the business rules processing with the Info Table me.ScaleTestBusinessRuleForPropertySetAsInfotable({ PropertySetAsInfotable: eventData.newValue.value }); // Third step sends the Info Table as one property into a service which parses it into the // individual properties, updating both the runtime properties on the remote thing and the database me.UpdatePropertyValues({ values: eventData.newValue.value /* INFOTABLE */ }); III. Set-Up Caching Each property which needs to be cached should be created on the Thing Template level and named in a similar way, say by placing the word “Last” at the end, such as “Property1” => “Property1Last”, “Property2” => “Property2Last”, etc. This property should NOT be logged or persistent, as the point of this is to store the most recent value in memory, removing any superfluous dependency on database queries in the process. Note that while storing the property in runtime memory makes it much more accessible, it also means that the property needs to be rewritten manually upon Platform restart. Additional code (not provided here) must be written to populate these properties from the database upon application start-up. The following code should be placed in the data change event subscription (option 1 in the case where only a few properties need caching, or option 2 if every property value needs to be cached): Option 1: Some but Not All Properties Need Caching // Names of properties for which you want to cache the last value var propertyNames = ['number1', 'number2']; // Loop through the properties and cache their time if they are found in the property set propertyNames.map(assignLast); // This function can be split into two functions for Age and Last separately if need be function assignLast(propertyName) { logger.debug("Looping for property -> "+ propertyName); var searchprop = new Object(); searchprop.name = propertyName; property = eventData.newValue.value.Find(searchprop); if(property){ logger.debug("Found Row. Name= " + property.name); var lastPropertyName = propertyName+"Last"; if(property.value) { // Set the cache property on me, this entity, to the current property value me[lastPropertyName] = me[propertyName]; } } else { logger.debug("Property Not Found in property set -> " + propertyName); } } Option 2: All Properties Need Caching var rowCount = eventData.newValue.value.getRowCount(); for(var i=0; i<rowCount; i++){ logger.warn("property name->" + eventData.newValue.value[i].name + "----- property new value->" + eventData.newValue.value[i].value.value); var propertyName = eventData.newValue.value[i].name; var lastPropertyName = propertyName+"Last"; me[lastPropertyName] = me[propertyName]; logger.warn("done last subscription, last property value for lastPropertyName" + me[lastPropertyName]); } Useful Platform Code Snippets I. Age Calculation var date1 = new Date(); var date2 = me.GetPropertyTime({ propertyName: propertyName /* STRING */ }); var result = millisToMinutesAndSeconds (dateDifference(date1, date2) ); // This function converts from an unintelligibly large number in milliseconds to something formatted in minutes and seconds function millisToMinutesAndSeconds(millis) { var minutes = Math.floor(millis / 60000); var seconds = ((millis % 60000) / 1000).toFixed(0); return (seconds == 60 ? (minutes+1) + ":00" : minutes + ":" + (seconds < 10 ? "0" : "") + seconds); } II. Sort the Info Table by Time var params = { sortColumn: "time" /* STRING */, t: me.propertySet/* INFOTABLE */, ascending: ascending /* BOOLEAN */ }; var result = Resources["InfoTableFunctions"].Sort(params); III. Search the Info Table for a Property var searchprop = new Object(); searchprop.name = propertyName; property = PropertySetAsInfotable.Find(searchprop); if(property === null){ logger.info("Property Not Found -> " + propertyNumber1); } else { logger.info("Found Row. Name= [" + property.name + "], value= " + property.value.value); } Edge Implementation This example implementation uses the .NET Edge SDK to build a property set Info Table at the Edge. I. Define the Data Shape A standard Data Shape is used (NamedVTQ), but because this Data Shape is not exposed in the Edge SDK code, it has to be created manually. // Data Shape definition for NamedVTQ FieldDefinitionCollection namedVTQFields = new FieldDefinitionCollection(); namedVTQFields.addFieldDefinition(new FieldDefinition(CommonPropertyNames.PROP_NAME, BaseTypes.STRING)); namedVTQFields.addFieldDefinition(new FieldDefinition(CommonPropertyNames.PROP_VALUE, BaseTypes.VARIANT)); namedVTQFields.addFieldDefinition(new FieldDefinition(CommonPropertyNames.PROP_TIME, BaseTypes.DATETIME)); namedVTQFields.addFieldDefinition(new FieldDefinition(CommonPropertyNames.PROP_QUALITY, BaseTypes.STRING)); base.defineDataShapeDefinition("NamedVTQ", namedVTQFields); II. Define the Info Table Property The property defined should NOT be logged or persistent, and it can be read-only, since data is always pushed from the Edge and read from the server cache when accessed on the Platform. Note that the push type of the info table property MUST be set to "ALWAYS" (if set to "VALUE", the data change event will only fire if the number of rows changes). // Property Set Definitions [ThingworxPropertyDefinition( name = "DevicePropertySet", description = "Alternative representation of properties as an Info Table for rules processing", baseType = "INFOTABLE", category = "Status", aspects = new string[] { "isReadOnly:true", "isPersistent:false", "isLogged:false", "dataShape:NamedVTQ", "cacheTime:0", "pushType:ALWAYS" } ) ] III. Define a Property to Store the GOOD Quality Status private static String QUALITY_STATUS_GOOD = QualityStatus.GOOD.name(); IV. Define Functions to Populate the Value Collections An Info Table is really just made up of many Value Collections, where each Value Collection is considered a row. These services take in the name and value of a property and return a Value Collection object which can be added to the property set Info Table. public ValueCollection createNumberValueCollection(String name, double value) { ValueCollection vc = new ValueCollection(); // Add quality and time entries to the Value Collection vc.SetStringValue(CommonPropertyNames.PROP_QUALITY, QUALITY_STATUS_GOOD); vc.SetDateTimeValue(CommonPropertyNames.PROP_TIME, new DatetimePrimitive(DateTime.UtcNow)); vc.SetStringValue(CommonPropertyNames.PROP_NAME, name); vc.SetNumberValue(CommonPropertyNames.PROP_VALUE, value); return vc; } public ValueCollection createBooleanValueCollection(String name, Boolean value) { ValueCollection vc = new ValueCollection(); // Add quality and time entries to the Value Collection vc.SetStringValue(CommonPropertyNames.PROP_QUALITY, QUALITY_STATUS_GOOD); vc.SetDateTimeValue(CommonPropertyNames.PROP_TIME, new DatetimePrimitive(DateTime.UtcNow)); vc.SetStringValue(CommonPropertyNames.PROP_NAME, name); vc.SetBooleanValue(CommonPropertyNames.PROP_VALUE, value); return vc; } V. Build the Property Set Call this code from the processScanRequest method to build the property set. // Create an instance of a new Info Table using the standard "NamedVTQ" Data Shape InfoTable propertySet = new InfoTable(getDataShapeDefinition("NamedVTQ")); // Set name/value for Temperature using convenience function propertySet.addRow(createNumberValueCollection("Temperature", temperature)); // Set name/value for Pressure using convenience function propertySet.addRow(createNumberValueCollection("Pressure", pressure)); // Set name/value for TotalFlow using convenience function propertySet.addRow(createNumberValueCollection("TotalFlow", this._totalFlow)); // Set name/value for InletValve using convenience function propertySet.addRow(createBooleanValueCollection("InletValve", inletValveStatus)); // Set name/value for FaultStatus using convenience function propertySet.addRow(createBooleanValueCollection("FaultStatus", faultStatus)); // Set the property set Info Table property base.setProperty("DevicePropertySet", propertySet); VI. Update the subscribed properties These two lines of code update the properties and events, actually sending the property set (containing all property updates) to the Platform. base.updateSubscribedProperties(15000); base.updateSubscribedEvents(60000); Conclusion Following these steps will enable the Edge to build a property set before sending any property updates to the Platform. The Platform can then rely on caching to process the business logic with no database dependency, which is faster and more efficient than any other approach. Finally the updates are still written to the database, so in the end, there is no functional difference between using a property set and binding each property individually. Please don't hesitate to comment here with any questions about this approach.

Feb 21, 2020

The New and Improved DGIS Guide to ThingWorx Development Written by Victoria Firewind of the IoT EDC The classic Developing Great IoT Solutions guide has been reskinned and revamped for newer versions of ThingWorx! The same information on how to build a quality IoT application is now available for versions of ThingWorx 9.1+, and now, a complete sample application is included to demonstrate these ideas. Find within the attached archive a PDF with high-level overview information on development and application design geared towards managers and business users, so that everyone can understand the necessary requirements, common terms, and key tips on how to ensure an application is scalable and maintainable right from the very start. Reduce your chances of running into issues between PoC and Go Live by reviewing this information today! Also find within this PDF a series of tutorials which teach not just how to use the ThingWorx software, but which also educate on how to make good application design choices. A basic rules engine for sending real-time notifications is included here, as well as a complete demo application which illustrates each concept in a real-world use case. This Coffee Machine Demo App relies upon the tutorial entities, which can also now be imported directly using the other XML files provided here. This ensures that anyone can review these concepts, regardless of how much time one can commit or how much knowledge one already has on the subject. This is a complex guide, and any issues, questions, or bugs found within can be reported right here on this thread. Happy developing from the IoT EDC!

Apr 13, 2021

Building More Complex Tests in JMeter Overview This is the second in a series of articles which help inform how to do user load testing in ThingWorx. This article picks up where the previous left off, continuing with the project created there. The screenshots do appear a little differently here because a new “Look and Feel” was selected for the JMeter application (switched from “Metal” to “Windows Classic”) to provide more readable screenshots. In this guide, we are going to make the very simple project more complex, working towards a better representation of a real load test. The steps below walk you through how to create and configure thread groups and parameterize the processes and procedures defined by each thread group. Adding More Thread Groups Within JMeter, thread groups are used to organize the HTTP requests in a test into various processes or procedures, such that different mashups (and all of the HTTP requests required on each) or processes can be executed simultaneously by different thread groups throughout the test. Varying the number of threads in a group is how to vary the number of users accessing that mashup during the test, a number which increases over time in accordance with the ramp up time. The thread group name will also show up in the Summary Report tab at the end of the test, making it easier to parse through and graph the results. Start by renaming the existing thread groups so that their process or procedure names are recognizable at the end of the test: Highlight the line which reads “HTTP(S) Test Script Recorder”. (Optional) Add an Include filter to only capture the URLs relevant to your application using the Requests Filtering tab. For example, with the escape character \ necessary for ‘.’, myhost.mycompany.mydomain becomes: myhost\.mycompany\.mydomain Now record a new thread group clicking the “Start” button: Once the control box shows up in the top left corner, click to open a browser and access the ThingWorx Navigate application. Then click on “View Parts List” or some other mashup: Once the mashup loads, search using a string and/or wildcard, or click on one of the recent results if any exist: Wait for the mashup to fully load with the details on that part or assembly, and then click “Stop” in the recording controller window: All of the HTTP requests performed in the process of loading and using this mashup will be added to the JMeter project here: Next, add a new thread group manually to the project: Highlight the newly created “Thread Group” (default name) and rename it to something that relates to the nature of that process: Drag and drop the new collection of requests so that it is considered a part of the new thread group: Then drag the whole group up so that it is next to the other thread groups in the test: In more complex projects, different thread groups may be added at different times, and each time, the service calls are all assigned an index (at the end of the request URL, for example: <request>-344). These indexes may not be unique depending on how and when the thread groups were created, especially in more complex tests. The easiest way to fix this issue is save the test from the JMeter GUI, then open the JMX file in a text editor and perform a find and replace within the relevant section of text. This is usually done using a regular expression for the number. For example, if the request name indexes are numbered -500 through -525, a regular expression to increase them to -700 through -725 would be (in Notepad++): Find: -5([0-9])([0-9]) Replace: -7\1\2 Note that if you do not use a Request Filter, sometimes the recorder will log URLs that are not part of the target application, like these “generate” samplers. These URLs are typically happening in the background of the browser to track performance, security and errors. These can be deleted: At other times, you will be repeating steps that are already part of another thread group, for example: logging in. This genidkey is a part of the login, as you can see if you look back at the login thread group. Because logging in is only necessary once, and it is assumed to be complete by the time the test starts on the second thread group, this entire section can be deleted: To see for sure if a request can be removed because it is called in a previous thread group, do a non-case-sensitive search for the name of the request: All of the requests found in this particular instance were performed in the previous thread group, so therefore this entire category can also be deleted: Another odd thing you may see (if you do not use a Request Filter with the recording feature) are “blank” requests like these: The recorder isn’t sure what to call these “non-requests”, so anything like this that isn’t an actual URL within the target application should be deleted. Static downloads should be disabled or deleted from scale testing since they are usually cached by the user browser client. In this ThingWorx example, there are static “MediaEntites” which can be deleted or disabled: Within the JMeter client there is no good way to highlight and reset them all at once, unfortunately. The easiest way to remove all of these at once is to open the JMX file in a text editor and use regex expressions for search and replace “enabled=true” with “enabled=false”. Most text editors have examples on how to use regular expressions within their Help topics. The above example is for Notepad++. Parameterize Thread Groups Parameterization is usually the part of creating a JMeter test that takes the most effort and knowledge. Some requests will require the same information for every thread, information which can therefore be defined statically within the JMeter element rather than being parameterized. Some values used within the JMeter test script can be parameterized as inputs in the top level of the test controller, for example: Duration, RampUp time, ApplicationHost, ApplicationPort. Other values may be unique to only one thread group and could be defined in a User Defined Variables element within that group controller. The value(s) used within a request can also be determined on the fly by the results of earlier requests within a thread group. These request results typically must be post-processed and parameterized for later thread elements to function correctly. The highest level values that are unique to each thread should be inputs from a CSV file that are passed into the threads as parameters, for example Username and Password. Data used within the test is usually parameterized in order to better emulate real world application use by multiple users. In the following example, we will parameterize the number of users for each thread group by adding a user- defined variable. Start by selecting the new thread group and parametrizing the number of threads (i.e. the number of users accessing this mashup at a time during the test). The way to enter a variable is with syntax like this: $(searchandviewpartstructure_threads) In this case, make this a user defined variable: or a variable for the whole project, by highlighting “Test Plan” and adding the information there. Begin looking at the samplers to see what types of things need to be parameterized in your test. Consider such things as: thread count (as shown above), ramp up time (also depicted above), duration, timings, roles, URL arguments, info table information, search result information, etc. Another example here parameterizes the search parameters for a query by adding an overall search string column to a CSV file (which can then be randomly generated by some other script): First, parameterize the body data of the request by highlighting the request, and changing the value of the desired field to something like this: $(searchString) Next, define the parameter under the Test Plan and set a default value: Now define the searchString column again as part of the CSV Data Set: Now it can be varied simply by providing different pseudo-random values with wildcards and/or known values in the CSV file. Post Processors and Extractors Most JMeter load tests become more complex when the results of one request are sent as parameters into later requests. This is done in JMeter by using Post Processors (Extractors), tools which facilitate extracting information out of the request results so it can be assigned to JMeter parameters. There are many different types of extractors which can process the results of previous requests: CSS Selector Extractor – commonly used extractor for values returned as html attributes JSON Extractor – processes JSON objects using regular expressions BeanShell Post Processor –facilitates using code scripts to process return text when needed Regular Expression Extractor – JMeter supports use of regular expressions on request results The JSON Extractor can be used to find and store information like the partOID number for a Windchill part as a parameter in JMeter, which can then be used to build more realistic workflows within the JMeter test. The example below steps you through setting up a JSON Post Processor. Start by right-clicking the request that contains the results of our search. Then click “Add” > “Post Processors”> “JSON Extractor”, as shown in the image below: The extractor will now show up under that request as a sub-menu item. Select it, and name the variable something easy to reference. For the JSON Path expressions, pull the object number or some other identifying characteristic out of the search results: $.rows[0].objNumber for example. Another option would be to take information like the partOID number send that into the search string field, by defining both as properties and having one refer to the other. To pull the partOID out, use a Regular Expression Extractor: Another thing to parameterize is the summary report result file name. Adding in the number of users and ramp up time can result in files that are easier to reference later being stored on your machine. We will cover generating and reviewing Summary Reports in full in the next article in this series. Conclusion In this article we saw how to create new thread groups, removing extraneous requests from those groups, and reduce the overall ambiguity of which thread groups are representing which processes or mashup calls. We also covered how to parameterize the individual requests as well as the summary report. Note that things like Windchill URL and hostname, search parameters and part IDs, timings, durations, offsets, anything at all that influence the results of the test, should not be hard-coded. It is better to create variables for these things to ensure that all of the various simulated activities are configured in the exact same way every time. That way, the system can be tested again and again under various strains and loads until the capabilities of the application are verified.

Jul 30, 2020

May 15, 2020

New Scenario Using Multi-Kepware for Asset Monitoring in Connected Factories A new scenario has been completed for Connected Factory implementations, furthering the IOT EDC's goal of providing a reference library of ThingWorx performance. This scenario builds upon the first, with additional tests being performed to demonstrate the capabilities of multiple Kepware Servers running side-by-side. Horizontal scaling is very common for multi-line factory implementations, so be sure to check out the new scenario in this ever-expanding benchmark document. Note that tests below 10,000 writes per second were not repeated with multiple Kepware Servers, since there is little reason to desire such a configuration in implementations that small. ThingWorx deployment sizing was also held constant throughout these tests to demonstrate the limits of a given configuration. Changes that may improve the results of a failed test (such as adding CPUs or Memory) will be mentioned but not validated as part of this benchmark. Let us know about your applications and how they compare with the data shared here. Happy developing!

Sep 30, 2020

ThingWorx and Azure IoT Hub Benchmark This Azure IoT Hub Reference Benchmark showcases the capabilities of ThingWorx and Microsoft Azure IoT Hub, a cloud-hosted solution backend that facilitates secure and reliable communication between an IoT application such as ThingWorx and the devices it manages. By making use of this third party tool, remote monitoring with ThingWorx has never been simpler. In this benchmark, PTC verifies the reliability and scalability of ingesting data through the Azure IoT Hub into the Azure IoT Hub Connector(s) and ThingWorx Foundation. The preliminary version of this document focuses primarily on how the Azure IoT Hub’s capabilities modify and/or enhance the data ingestion and device management capabilities of ThingWorx. Find the benchmark document attached here, and stay tuned for more reference benchmarks to come!

Aug 21, 2020

ThingWorx Monitoring and Alerting, Part 1 Using Prometheus and Grafana By Tori Firewind, IoT EDC Introduction and Getting Started As ThingWorx has become a more mature product during the lifetime of the IoT EDC, so too have our dev ops recommendations. As we’ve stated throughout many posts now, testing is a key part of ensuring enterprise readiness, and it occurs at every stage of the process: from unit testing to preserve individual service logic, to integration tests which preserve the functionality of the application as a whole, to user and edge load testing and user experience testing, which ensure enterprise readiness. So testing is a critical component, but the process of dev ops never stops. In order to effectively test the system, a comprehensive monitoring solution is also required. Once the application is tested and the changes pushed into production, there is no knowing with certainty that everything will run smoothly indefinitely. Random spikes in usage, server bandwidth or availability, any unforeseeable factors like these can come along and cause issues for a system. If these issues aren’t detected and addressed early, then they can very rapidly morph into much larger problems: outages, data loss, inflated data tables which are hard to revert due to their size. It is critical to detect performance issues on a system as early as possible, to have as much information as is necessary to figure out where the problem is heading, and what may have started it. Monitoring is key to a healthy system. CI/CD stands for “Continuous Integration/Continuous Deployment”, a never-ending cycle of improvement. Testing just once before the initial go live isn’t enough. Each system should have automated tests that run continuously, as well as monitors and alerts which reveal problems sooner. Diagnostic tools play a role as well, being the bridge from the end of the dev ops process cycle back to the beginning (monitoring into planning). A good CI/CD dev ops process will ensure that problems are found earlier, fixed more rapidly, and fixed for everyone using the system. In a fully mature dev ops pipeline, issues are anticipated, discovered and researched before they become production outages or critical issues. These investigations or testing follow-ups produce development tasks (usually bugs, but also features at times) which then start the dev ops cycle all over again. This is why a good, efficient dev ops pipeline is needed, one which allows changes to quickly and safely go from development to production. This is also why diagnostic tools play a role in the monitoring piece of the dev ops process. They are the bridge between monitoring and planning. Tools like Dynatrace can be configured to provide call stacks and take thread dumps when issues start to occur, before the system is performing so poorly it needs a restart, which happens automatically in a cluster and can clear out any trace of the issue. Thread dumps are often necessary to diagnosing the root cause of the issue (to permanently fix it), and doing so quickly ensures application stability and availability. That is, after all, the purpose of the dev ops process. Diagnostics is therefore an equally important piece of the dev ops Figure-8-shaped pie, and one which deserves its own spotlight in an article to come. Every piece of the dev ops process must be viewed as equally important in its own way, lest the dev ops cycle get hung up on bottlenecks of its own. A safe and stable system is not one which never experiences issues, it is one which has a good, efficient plan in place to handle recovery and prevention of repetition. A wholesome dev ops process is a happy dev ops process. The Monitoring Stack There are many monitoring options available, but in our experience one of the easiest and most effective monitoring stacks to use with ThingWorx is Prometheus for metrics gathering with Grafana for metrics analysis and review. In a mature monitoring stack, Telegraf is also commonly installed on each VM/host to gather the system metrics (like CPU and Memory usage, things we’ve stated are good metrics of system performance and stability in past articles on scale and size testing) and output them in Prometheus format. Prometheus is a highly scalable open-source monitoring framework that contains out of the box monitoring and alert capabilities for Kubernetes-based deployments (not covered in this article). Using Prometheus is very simple because the ThingWorx application exposes a metrics endpoint which is formatted directly for use by Prometheus. There is also built-in alerting in Prometheus, but not the ability to form dashboards for reviewing data or screenshotting it for documentation purposes. That’s where Grafana comes into play. Grafana has a preconfigured Prometheus-type data source and many preconfigured dashboard templates for various applications and services. Telegraf is also easily imported into Grafana, as is shown in the section below. The Prometheus targets in the larger diagram are expanded out on the left. For each target, some tool exports the data in a syntax which Prometheus can scrape. For VMs, this can be Telegraf, for Kubernetes, the Node Exporter. JVM has a JMX Exporter, and other tools like CX Server use Graphite. Many apps already have a Prometheus endpoint built-in, like ThingWorx and Zookeeper. Telegraf is not strictly necessary; the node exporter can also be used on VMs, but Telegraf is the more common choice since it is a more mature dev ops tool. Once Prometheus is scraping the targets, alerting on them can be done with OOTB Prometheus functionality, and dashboards for monitoring can be made easily in Grafana (with built-in support as well). This stack does not include the diagnostics piece, something which triggers thread dumps or the like when issues do occur. There are too many ways to conduct a successful diagnostic piece to cover here. How to Get Started Getting started monitoring a ThingWorx application is incredibly easy in the latest versions. Simply open up a browser, and type in the ThingWorx URL, followed by “/Metrics”. At this endpoint, there is a specially formatted response that can automatically be read by the Prometheus monitoring software which contains subsystem and service data. In addition to the application metrics, Prometheus can be configured to collect metrics from a node exporter at the (virtualized) operating system or container (Kubernetes) level as well. If you haven’t already, install Grafana, install Telegraf as a service, and install Docker Desktop. These are the tools required (in addition to ThingWorx of course) to set-up a simple sandbox system for familiarization with the monitoring stack recommended by PTC. The easiest way to try Prometheus on a local Windows instance is to use Docker. The command for that will be found below, but first open up Docker Desktop to set contextual parameters that the command line will need. Then, modify the configuration file for Telegraf or create one (called telegraf.conf in the same folder as the exe file), and put the following into the file (or uncomment it; the default config file has thousands of lines, so just search for “prometheus”): Output plugin [[outputs.prometheus_client]] listen = "0.0.0.0:9125" Alternatively, install the Prometheus Node Exporter tool, which will likely require some additions to the Prometheus config file (not covered here) which we are about to create. Then, create a configuration file (called prom_config_localhost_scraper.yml in the command to come), add the following (assuming a standard localhost installation of ThingWorx): # my global config global: scrape_interval: 45s evaluation_interval: 30s scrape_timeout: 30s # scrape_timeout is set to the global default (10s). rule_files: - prom_config_rules.yml scrape_configs: - job_name: thingworx static_configs: - targets: ['host.docker.internal:8080'] basic_auth: username: "Administrator" password: "admin!123456789" metrics_path: /Thingworx/Metrics scheme: http params: x-thingworx-session: - "false" - job_name: prometheus static_configs: - targets: ['localhost:9090'] - job_name: Telegraf # If telegraf is installed, grab stats about the local # machine by default. static_configs: - targets: ['host.docker.internal:9125'] This example script file uses the host.docker.internal instead of localhost for the server target for ThingWorx because it is running outside of the Docker container which contains Prometheus. This yml file configures Prometheus to monitor both ThingWorx and itself, as well as the server metrics coming from Telegraf (as long as they are configured to push). It’s a sandbox-only configuration, really, as you wouldn’t want to use the Administrator user, or have the password printed in plain text in the config file in a real system. Also note the need for the x-thingworx-session parameter, as runaway sessions which spawn every 30s or so (whatever the scrape interval is) will result in memory issues over time (so we don’t want to use sessions here). The rules file given here (prom_config_rules.yml) needs to be created separately. This is where all of the alert rules will be defined. This will determine if an alert state is happening, but without configuring the alert manager, there won’t be any notification. That isn’t covered here but is covered extensively in the Grafana docs. Here is an alert example: groups: - name: alert.rules rules: # Alert for any instance that is unreachable for >5 minutes. - alert: HighMemory expr: mem_used > 14000000 for: 1s labels: severity: page annotations: summary: "High Memory" description: "Localhost Memory Usage is High" Now, save these files and use Powershell to run the Docker container: docker run -p 9090:9090 -v C:\<path_to_document>\prom_config_localhost_scraper.yml:/etc/prometheus/prometheus.yml prom/prometheus It should download Prometheus and install it in that container (if this is the first time), allowing you to very rapidly deploy it to an endpoint of localhost:9090 by default. If there is an error like the one shown below, this means that you forgot to start Docker Desktop (the application) before opening Powershell. Docker Desktop sets system parameters required for containers to run in a command line (in Linux, it should work if Docker is installed for use by the command line, simple as that). The localhost endpoints are accessible in a browser. ThingWorx defaults to localhost:8080 endpoint. Prometheus defaults to localhost:9090. Telegraf is on port 9125. Open any of these in a browser tab to see the full monitoring stack. You can see easily if Prometheus is working by clicking “Status” > “Targets” at localhost:9090: If all of the targets appear as blue and say “last scrape” and a time stamp, then they’re working as expected. If they don’t, ensure you have the right ports, that there aren’t any firewall issues (if things aren’t all on localhost), and that everything is running without errors. The last step in the process here is to install a dashboard tool like Grafana. Once this is installed and running on localhost:3000 (by default), you can display the data from Prometheus with a few configuration steps the Grafana UI. Highlight over the settings icon in the bottom left of the screen, and then click on “Data sources”. Select the “Add data source” button, and then click on Prometheus. You have to type the URL again (localhost:9090), but most of the defaults will be ok here, and all you have to do is click “Save and test”. Now both targets should appear within Grafana, with their metrics showing up throughout the Grafana UI. This data source is what allows for the building of monitoring dashboards.

Dec 23, 2022

5 Common Mistakes to Developing Scalable IoT Applications by Tori Firewind and the IoT EDC Team Introduction To build scalable applications, it’s necessary to identify common mistakes and avoid them at the early stages of development. In an expert session this past month, the PTC Enterprise Deployment Team elaborated on why scalability is important and how to avoid the common development pitfalls in IoT. That video presentation has been adapted here for visual consumption of the content as well. What is Scalability and Why Does it Matter Enterprise ready applications can scale and easily be maintained, which is important even from day 1 because scalability concerns are the largest cause for delays to Go Lives. Applications balance many competing requirements, and performance testing is crucial to ensure an application is ready for Go Live. However, don't just test how many remote assets can connect at once, but also any metrics that are expected to increase in time, like the number of remote properties per thing, the frequency of reporting from those properties, or the number of users accessing the system at once. Also consider how connecting more assets will affect the user experience and business logic, and not just the ability to ingest data. Common Mistake 1: Edge Property Updates Because ThingWorx is always listening for updates pushed from the Edge and those resources are always in use, pulling updates from the Foundation side wastes resources. Fetch from remote every read is essentially a round trip, so it's slower and more memory intensive, but there are reasons to do it, like if the quality tag is needed since the cache doesn't store it. Say a property is pushed at 11:01, and then there's a network issue at 11:02. If the property is pulled from the cache, it will pull the value sent at 11:01 without any indication of there being a more recent value on the Edge device. Most people will use the default options here: read from server cache, which relies on the Edge to push updates, and the VALUE push type, and configuring a threshold is a good idea as well. This way, only those property updates which are truly necessary are sent to the Foundation server. Details on property aspects can be found in KCS Article 252792. This is well documented in another PTC Community post. This approach is necessary and considered a best practice if there is event logic which depends on multiple properties at once. Sending all of the necessary properties to determine if an event should fire in one Infotable ensures there is no need to query the database each time a property update comes in from the Edge, which ensures independent business logic and reduces the load on the database to improve ingestion performance. This is a very broad topic and future articles will address it more specifically. The When Disconnected property aspect is a good way to configure what happens with Edge property values in a mass disconnect scenario. If revenue depends on uptime, consider losing any data that changes while a device is disconnected. All of the updates can be folded into a single value if the changes themselves aren't needed but an updated value is needed to populate remote properties upon reconnect. Many customers will want to keep all of their data, even when a device is offline and use data stores. In this case, consider how much data each Edge device can store (due to memory limitations on the devices themselves), and therefore how long an outage can last before data is lost anyway. Also consider if Foundation can handle massive spikes in activity when this data comes streaming in. Usually, a Connection Server isn't enough. Remember that the more data needs to be kept, the greater the potential for a thundering herd scenario. Handling a thundering herd scenario goes beyond sizing considerations. It is absolutely crucial to randomize the delay each device will wait before attempting to reconnect. It should be considered a requirement to have the devices connect slowly and "ramp up" over time for multiple reasons. One is that too much data coming in too fast could overwhelm the ingestion queue and result in data loss. Another is that the business logic could demand so many system resources, that the Foundation server crashes again and again and cannot be recovered. Turning off the business logic it isn't possible if the downtime is unexpected, so definitely rely instead on randomized reconnection times for Edge devices. Common Mistake 2: Overlooking Differences in HA To accommodate a shared thing model across many servers, changes had to be made in how the thing model is stored and the model tree is walked by the Foundation servers. Model information is no longer cached at the Thing level, and the model tree is therefore walked every time model information is needed, so the number of times a Thing is directly referenced within each service should be limited (see the Help Center for details). It's best to store whatever information is needed from a Thing in an Infotable, making the Things[thingName] reference a single time, outside of any loops. Storing the property definitions outside of the loop prevents the repetitious Thing references within the service, which otherwise would have occurred twice for each property (for both the name and the description), and then again for every single property on the Thing, a runtime nightmare. Certain states previously held in memory are now shared across the cluster, like property values, Thing states, and connection statuses. Improvements have been made to minimize the effects of latency on queries, like how they now only return property values on associated Thing Shapes or Thing Templates. Filtering for properties on implementing Things is still possible, but now there is a specific service to do it, called GetThingPropertyValues (covered in detail in the Help Center). In the script shown above, the first step is a query to get the names of all implementing things of a particular Thing Shape. This is done outside of any loops or queries, so once per service call. Then, an Infotable is built to store what would have been a direct reference to each thing in a traditional loop. This is a very quick loop that doesn't add much by way of runtime since it is all in memory, with no references to the thing model or the database, instead using the results of the first query to build the Infotable. Finally, this thing reference Infotable is passed into the new service GetThingPropertyValues to retrieve all of the property info for all of these things at once, thereby only walking the thing model once. The easiest mistake people would make here is to do a direct thing reference inside of a loop, using code like Things[thingName].Get() over and over again, thereby traversing the thing model repeatedly and adding a lot of runtime. QueryImplementingThingsOptimized is another new service with new parameters for advanced configuration. Searches can now be done on particular networks or to particular depths, and there's an offset parameter that allows for a maximum number of items to be returned starting at any place in the list of Things, where previously if you needed the Things at the end of the list, you had to return all of the Things. All of these options are detailed in the Help Center, as well as the restrictions listed in the image above. Common Mistake 3: Async Service Misuse Async services are sometimes required, say if a user has to trigger many updates on many remote things at once by the click of a button on a mashup that should not be locked up waiting for service completion. Too many async service calls, though, result in spikes in activity and competition for resources. To avoid this mistake, do not use async unless strictly necessary, and avoid launching too many async threads in parallel. A thread dump will show how many threads there are and what they are doing. Common Mistake 4: Thread Pool Overload Adding more threads to the pool may be beneficial in certain circumstances, like if the threads are waiting on other resources to complete their tasks, look stuff up in the database (I/O), or unlock data that can only be accessed one thread at a time (property writes). In this case, threads are waiting on other resources, and not the CPU, so adding more threads to the pool can improve performance. However, too many threads and performance degradation will occur due to increased contention, wasted CPU cycles, and context switches. To check if there are too many or not enough threads in the pool, take thread dumps and time the completion of requests in the system. Also watch the subsystem memory usage, and note that the side of the queue should never approach the max. Also consider monitoring the overall performance of the system (CPU and Memory) with a tool like Grafana, and remember that a good performance test properly exercises all of the business logic and induces threads in a similar way to real world expectations. Common Mistake 5: Stream Etiquette Upserts, or updates to database tables, are expensive operations that can interfere with ingestion if they are performed on the wrong tables. This is why Value Stream and Stream data should never be updated by end users of the application. As described in the DGIS document on best practices, aggregation is the key to unlocking optimal performance because it reduces the size of database tables that require upserts. Each data structure shown here has an optimal use in a well-designed ThingWorx application. Data Tables are great for storing overview information on all of the Things in one view, and queries on this data source are the fastest. Update this data source as often as possible (by timer), allowing enough time for updates to be gathered and any necessary calculations made. Data Tables can also be updated by end users directly because each row locks one at a time during updates. Data Tables should be kept as small as possible to improve performance on mashups, so for instance, consider using one to show all Things per region if there are millions of Things. Roll up information is best stored here to avoid calculations upon mashup load, and while a real-time view of many thousands of things at once is practically impossible, this option allows for a frequently updates overview of many things, which can also drill down to other mashup views that are real-time for one Thing at a time. Value Streams are best used for data ingestion, and queries to these should be kept to a minimum, largely performed by the roll up logic that populates the Data Tables mentioned above. Queries that chart all of the data coming in are best utilized on individual Thing views so that only a handful of users are querying the same data sources at a time. Also be sure to use start and end dates and make use of the "source" field to improve query performance and create a better user experience. Due to the massive size of the corresponding database tables, it's best to avoid updating Value Streams outside of the data ingestion process altogether. Streams are similar, but better for storing aggregated, historical data. Usually once per day or per week (outside of business hours if possible), Value Stream data will be smoothed or reduced into less data points and then stored into Streams. This allows for data to be stored for longer periods of time on the server without using up as much memory or hurting query performance. Then the high volume ingested data sources can be purged frequently, as discussed below. Infotables are the most memory intensive, and are really designed to hold only a small number of rows at a time, usually to facilitate the business logic. Sometimes they will be stored in Streams or Data Tables if they aren't expected to grow larger (see the DGIS Coffee Machine App for an example). Infotables should never be logged; if they are used to transmit Edge property updates (like in the Property Set Approach), they should be processed into other logged (usually local) properties. Referring to the properties themselves is how to get real-time information on a mashup, say by using the GetProperties service and its auto-update option, which relies on internal websockets. This should be done on individual Thing views only, and sizing considerations need to be made if there will be many of these websockets open at once, say if there are many end users all viewing real-time data at a time. In the newer versions of ThingWorx, these cannot be updated directly, so find the system object called ThingWorxPersistenceProvider and use the service UpdateStreamDataProcessingSettings. ThingWorx Foundation processes data received from remote devices in batches in order to manage the data flow and reduce database churn. All of these settings configure how large those batches are and how frequently they are flushed to the database (detailed in full in KCS Article 240607). This is very advanced configuration that heavily depends on use case and infrastructure, but some info applies to most people: adjusting the scan rate is usually not beneficial; a healthy queue should never approach the max limit; and defaults differ by database because they function differently. InfluxDB generally works better when there are less processing threads and higher numbers of things per thread, while PostgresDB can have a lot of threads, preferably with less things per thread. That's why the default values shown here are given as the same number of threads (and this can be changed), but Influx has a larger block size and size threshold because it can handle more items per thread. Value Streams ingest all data into the Foundation server, and so the database tables that correspond with these data sources grow very large, very quickly and need to be purged often and outside of business hours, usually once a day or once per week. That's why it's important to reduce the data down to less points and push them into Streams for historical reference. For a span of years, consider a single point a day might be enough, for a span of hours, consider a data point a minute. Push aggregated data into Streams and then purge the rest as soon as it is no longer needed. In Conclusion

Jun 29, 2021

Thread Safe Coding, Part 2: The Database Locker Approach and Comparison Written by Desheng Xu and edited by @vtielebein Overview This is the second on this topic, describing an alternate approach to thread safe coding than one which requires the Java extension. The demo use case here is the same as in the previous post, and there is a section at the end comparing the two approaches. Database Locker for Thread Safe Coding The database locker is an advanced topic, so some experience with the database thing is assumed. The following steps demonstrate how to be thread safe with a database thing. Create New Database Instance, and New Table for counter It is strongly recommended that a new database instance be created outside of the ThingWorx database schema. This guide will NOT include instructions to create the new database instance. Use the following SQL commands to create a new table: DROP table IF EXISTS counters; CREATE TABLE counters ( name VARCHAR(100) unique , value integer NULL, PRIMARY KEY(name) ); INSERT INTO counters values('DemoCounter',0); This will create a new table called counters, initializing the first counter, called DemoCounter with the value 0. Create a Function to Increase and Return the New counter Value Use the following sample code to create a table lock function: CREATE OR REPLACE FUNCTION IncreaseCounter(coutner_name VARCHAR(100), OUT newvalue INTEGER) AS $$ BEGIN LOCK TABLE counters IN ACCESS EXCLUSIVE MODE; SELECT(SELECT value FROM counters WHERE name = $1) + 1 INTO newvalue; UPDATE counters SET value = newvalue WHERE name = $1; END; $$ language plpgsql; Or use the following SQL command to create a new row level locker function: CREATE OR REPLACE FUNCTION IncreaseCounter(counter_name VARCHAR(100), OUT newvalue INTEGER) AS $$ BEGIN SELECT value FROM counters WHERE name = $1 FOR UPDATE INTO newvalue; newvalue := newvalue + 1; UPDATE counters SET value = newvalue WHERE name = $1; END; $$ language plpgsql; Create a Database Thing Create a thing with the template "database" within ThingWorx, and use the PostgreSQL Driver to connect to the new database instance created above. Create New Services in the Database Thing The service IncreaseCounterDB would be a SQL Query service: SELECT * FROM public.IncreaseCounter([[counter_name]); counter_name would be the input parameter, a STRING which is marked as required. The service GetCounterDB would be another SQL Query service: SELECT value FROM public.counters WHERE name=[[counter_name]] LIMIT 1; counter_name would be another input parameter, a STRING which is also marked as required. The service ResetCounterDB would be a SQL Command service: UPDATE public.counters SET value = 0 WHERE name=[[counter_name]]; counter_name is yet another input parameter, also a STRING and also required. Wrap the Database Thing Service The above database thing service will return an InfoTable, but not an integer. If it's inconvenient to use an InfoTable, wrap the service up into a local Javascript service and return an integer value. The service IncreaseCounter is a wrap up of IncreaseCounterDB and returns an integer value: // result: INFOTABLE dataShape: "" var query_result = me.IncreaseCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = query_result.rows[0]["newvalue"]; Similarly wrap up GetCounter into GetCounterDB: // result: INFOTABLE dataShape: "SingleIntegerDatashape" var query_result = me.GetCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = query_result.rows[0]["value"]; And ResetCounter into ResetCounterDB: // result: NUMBER var query_result = me.ResetCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = 0; Run the Test Again If necessary, head back to the previous post to obtain the tool. Then just change the end point and run a new test: { "host":"twx85.desheng.io", "port":443, "protocol":"https", "endpoint":"/Thingworx/Things/DatabaseDemo/services/IncreaseCounter", "headers":{ "Content-Type":"application/json", "Accept": "application/json", "AppKey":"5cafe6eb-adba-41df-a7d6-4fc8088125c1" }, "payload":{}, "round_break":50000, "req_break":0, "round_size":50, "total_round":20 } Run: Validate the Result Execute the service GetCounter to validate the result: Overall Performance Comparison The Java Extension performance looks the best here, but the database row lock will perform better if there are multiple counters. InfoTable Type Property InfoTable properties have the same thread-safe challenges discussed previously, but they also have some additional challenges due to the way data change events are triggered. This is outside of the scope of this document, but it is worth a very brief mention here. In general, the data change event for an InfoTable fires when the reference to the table is updated, and not the contents of the table. If the values of an InfoTable are updated directly, say by adding or removing a row, then the data change event will not be triggered because the value has technically not changed. Instead, the InfoTable has to be cloned, then modified, and then assigned back to the Thing so that the reference changes as well. Such additional considerations must be made when using other property types than those shown here.

Dec 10, 2020

Update to Connected Factories Benchmark Scenario Three: One Kepware Server in ThingWorx 9.0 The goal of this scenario is to confirm the same performance in ThingWorx 9.0 as seen in scenario one, where one Kepware Server represented a single factory in version 8.5. Matrix 1 - Slow (15s slow properties, 1s fast) The lower frequency tests performed the same in 9.0. Even the 10k ingestion test, which lies very close to the boundary for a single Kepware Server, passed with no errors. Matrix 2 – Fast (5s slow properties, 500ms fast) These showed similar results, but the 500 thing, 50-10 property test had data loss in 9.0. However, the write rate is much higher than PTC recommends for a single Kepware Server anyway. Matrix 3 – Faster (1s slow properties, 200ms fast) The fastest tests had similar results as well. The larger tests ran with more success with two Kepware Servers (data not shown here). Conclusions ThingWorx 9.0 is similarly capable of ingesting data using Kepware Server. A single instance can still achieve up to 10k wps. Future scenarios will now make use of ThingWorx 9.0. Download the updated draft here!

Oct 22, 2020

Distributed Timer and Scheduler Execution in a ThingWorx High Availability (HA) Cluster Written by Desheng Xu and edited by Mike Jasperson Overview Starting with the 9.0 release, ThingWorx supports an “active-active” high availability (or HA) configuration, with multiple nodes providing redundancy in the event of hardware failures as well as horizontal scalability for workloads that can be distributed across the cluster. In this architecture, one of the ThingWorx nodes is elected as the “singleton” (or lead) node of the cluster. This node is responsible for managing the execution of all events triggered by timers or schedulers – they are not distributed across the cluster. This design has proved challenging for some implementations as it presents a potential for a ThingWorx application to generate imbalanced workload if complex timers and schedulers are needed. However, your ThingWorx applications can overcome this limitation, and still use timers and schedulers to trigger workloads that will distribute across the cluster. This article will demonstrate both how to reproduce this imbalanced workload scenario, and the approach you can take to overcome it. Demonstration Setup For purposes of this demonstration, a two-node ThingWorx cluster was used, similar to the deployment diagram below: Demonstrating Event Workload on the Singleton Node Imagine this simple scenario: You have a list of vendors, and you need to process some logic for one of them at random every few seconds. First, we will create a timer in ThingWorx to trigger an event – in this example, every 5 seconds. Next, we will create a helper utility that has a task that will randomly select one of the vendors and process some logic for it – in this case, we will simply log the selected vendor in the ThingWorx ScriptLog. Finally, we will subscribe to the timer event, and call the helper utility: Now with that code in place, let's check where these services are being executed in the ScriptLog. Look at the PlatformID column in the log… notice that that the Timer and the helper utility are always running on the same node – in this case Platform2, which is the current singleton node in the cluster. As the complexity of your helper utility increases, you can imagine how workload will become unbalanced, with the singleton node handling the bulk of this timer-driven workload in addition to the other workloads being spread across the cluster. This workload can be distributed across multiple cluster nodes, but a little more effort is needed to make it happen. Timers that Distribute Tasks Across Multiple ThingWorx HA Cluster Nodes This time let’s update our subscription code – using the PostJSON service from the ContentLoader entity to send the service requests to the cluster entry point instead of running them locally. const headers = { "Content-Type": "application/json", "Accept": "application/json", "appKey": "INSERT-YOUR-APPKEY-HERE" }; const url = "https://testcluster.edc.ptc.io/Thingworx/Things/DistributeTaskDemo_HelperThing/services/TimerBackend_Service"; let result = Resources["ContentLoaderFunctions"].PostJSON({ proxyScheme: undefined /* STRING */, headers: headers /* JSON */, ignoreSSLErrors: undefined /* BOOLEAN */, useNTLM: undefined /* BOOLEAN */, workstation: undefined /* STRING */, useProxy: undefined /* BOOLEAN */, withCookies: undefined /* BOOLEAN */, proxyHost: undefined /* STRING */, url: url /* STRING */, content: {} /* JSON */, timeout: undefined /* NUMBER */, proxyPort: undefined /* INTEGER */, password: undefined /* STRING */, domain: undefined /* STRING */, username: undefined /* STRING */ }); Note that the URL used in this example - https://testcluster.edc.ptc.io/Thingworx - is the entry point of the ThingWorx cluster. Replace this value to match with your cluster’s entry point if you want to duplicate this in your own cluster. Now, let's check the result again. Notice that the helper utility TimerBackend_Service is now running on both cluster nodes, Platform1 and Platform2. Is this Magic? No! What is Happening Here? The timer or scheduler itself is still being executed on the singleton node, but now instead of the triggering the helper utility locally, the PostJSON service call from the subscription is being routed back to the cluster entry point – the load balancer. As a result, the request is routed (usually round-robin) to any available cluster nodes that are behind the load balancer and reporting as healthy. Usually, the load balancer will be configured to have a cookie-based affinity - the load balancer will route the request to the node that has the same cookie value as the request. Since this PostJSON service call is a RESTful call, any cookie value associated with the response will not be attached to the next request. As a result, the cookie-based affinity will not impact the round-robin routing in this case. Considerations to Use this Approach Authentication: As illustrated in the demo, make sure to use an Application Key with an appropriate user assigned in the header. You could alternatively use username/password or a token to authenticate the request, but this could be less ideal from a security perspective. App Deployment: The hostname in the URL must match the hostname of the cluster entry point. As the URL of your implementation is now part of your code, if deploy this code from one ThingWorx instance to another, you would need to modify the hostname/port/protocol in the URL. Consider creating a variable in the helper utility which holds the hostname/port/protocol value, making it easier to modify during deployment. Firewall Rules: If your load balancer has firewall rules which limit the traffic to specific known IP addresses, you will need to determine which IP addresses will be used when a service is invoked from each of the ThingWorx cluster nodes, and then configure the load balancer to allow the traffic from each of these public IP address. Alternatively, you could configure an internal IP address endpoint for the load balancer and use the local /etc/hosts name resolution of each ThingWorx node to point to the internal load balancer IP, or register this internal IP in an internal DNS as the cluster entry point.

Dec 1, 2021

Leveraging Dell and VMWare for Asset Monitoring in Connected Factories As an extension of the Connected Factory Reference Benchmark performed on Microsoft Azure , PTC partnered with Dell Technologies in producing this document, a baseline which illustrates the effectiveness of ThingWorx and Kepware when combined with Dell and VMWare technologies to create solutions for on-premises and hybrid Connected Factory implementations. Please join us in thanking Bhagyashree Angadi, Brian Anzaldua, Todd Edmunds, Mike Hayes, and the Dell Customer Solution Center team in Limerick, Ireland for working with the IOT Enterprise Deployment Center on this benchmark! This benchmark is of a very similar design to a previous publication, but this time designed specifically with Dell Technologies in mind. In a Dell/VMWare architecture, the close proximity of Kepware Server and ThingWorx Foundation provides ideal conditions for network throughput between these components. Combined with the ability to easily monitor and resize virtual machines as your business needs evolve, these hardware configurations can be very effective in on-premises or hybrid deployment scenarios.

Oct 8, 2020

Is your team operating an effective DevOps pipeline? DevOps is an important part of a mature, enterprise ready application, but the process isn’t simple. This expert session will focus on how a full DevOps pipeline looks like and how PTC can help to build a seamless pipeline. Join us for our upcoming Expert Session to learn how to create a Docker image, integrate Azure with Docker and Git, and set up a seamless DevOps pipeline. When? Thursday, September 30th 2021 | 11 AM EST Host: Tori FIrewind, Senior Engineer in PTC IOT Enterprise Deployment Center Registration link: https://www.ptc.com/en/resources/iiot/webcast/devops-pipeline-thingworx

Sep 22, 2021

ThingWorx Docker Overview and Pitfalls to Avoid by Tori Firewind of the IoT EDC Containers are isolated and can run side-by-side on the same machine, but they share the host OS, making them more efficient in terms of memory usage and scalability. Docker is a great tool for deploying ThingWorx instances because everything is pre-packaged within the Docker image and can be stored in a repository ready for deployment at any time with little configuration required. By using a different container for every component of an application, conflicting dependencies can be avoided. Containers also facilitate the dev ops process, providing consistent application deployments which can be set up, taken down, and tested automatically using scripts. Using containers is advantageous for many reasons: simplified configuration, easier dev ops management, continuous integration and deployment, cost savings, decreased delivery time for new application versions, and many versions of an application running side-by-side without any wasted resources setting them up or tearing them down. The ThingWorx Help Center is a great resource for setting up Docker and obtaining the ThingWorx Docker files from the PTC Software Downloads website. The files provided by PTC handle the creation of the image entirely, simplifying the process immensely. All one has to do is place the ThingWorx version and all of the required dependencies in the staging folder, configure the YML file, and run the build scripts. The Help Center has all of the detailed information required, but there are a few things worth noting here about the configuration process. For one thing, the platform-settings.json file is generated based on the options given in the YML file, so configuration changes made within this configuration file will not persist if the same options aren’t given in the YML file. If using Docker Desktop to run an image on a Windows machine, then the configuration options must be given in an ENV file that can be referenced from the command used to start the image. The names of the configuration parameters differ from the platform-settings.json file in ways that are not always obvious, and a full list can be found here. For example, if extension imports need to be enabled on a ThingWorx instance running in Docker, then the EXTPKG_IMPORT_POLICY_ENABLED option must be added to the environment section of the YML file like this: environment: - "CATALINA_OPTS=-Xms2g -Xmx4g" # NOTE: TWX_DATABASE_USERNAME and TWX_DATABASE_PASSWORD for H2 platform must # be set to create the initial database, or connect to a previous instance. - "TWX_DATABASE_USERNAME=dbadmin" - "TWX_DATABASE_PASSWORD=dbadmin" - "EXTPKG_IMPORT_POLICY_ENABLED=true" - "EXTPKG_IMPORT_POLICY_ALLOW_JARRES=true" - "EXTPKG_IMPORT_POLICY_ALLOW_JSRES=true" - "EXTPKG_IMPORT_POLICY_ALLOW_CSSRES=true" - "EXTPKG_IMPORT_POLICY_ALLOW_JSONRES=true" - "EXTPKG_IMPORT_POLICY_ALLOW_WEBAPPRES=true" - "EXTPKG_IMPORT_POLICY_ALLOW_ENTITIES=true" - "EXTPKG_IMPORT_POLICY_ALLOW_EXTENTITIES=true" - "EXTPKG_IMPORT_POLICY_HA_COMPATIBILITY_LEVEL=WARN" - "DOCKER_DEBUG=true" - "THINGWORX_INITIAL_ADMIN_PASSWORD=Pleasechangemenow" Note that if the container is started and then stopped in order for changes to the YML file to be made, the license file will need to be renamed from "successful_license_capability_response.bin" to "license_capability_response.bin" so that the Foundation server can rename it. Failing to rename this file may cause an error to appear in the Application Log, and the server to act as if no license was ever installed: "Error reading license feature info for twx_realtime_data_sub". In Docker Desktop on a Windows machine, create a file called whatever.env and list the parameters as shown here: Then, reference this environment file when bringing up the machine using the following command in Powershell: docker run -d --env-file h2.env -p 8080:8080 -v ${pwd}/ThingworxPlatform:/ThingworxPlatform -v ${pwd}/ThingworxStorage:/ThingworxStorage -it <image_id> Notice in this command that the volumes for the ThingworxPlatform and ThingworxStorage folders are specified with the “-v” options. When building the Docker image in Linux, these are given in the YML file under the volumes section like this (only change the path to local mount on the left side of the colon, as the container mount on the right side will never change): volumes: - ./ThingworxPlatform:/ThingworxPlatform - ./ThingworxStorage:/ThingworxStorage - ./tomcat-logs:/opt/apache-tomcat/logs Specifying the volumes this way allows for ThingWorx logs and configuration files to be accessed directly, a crucial requirement to debugging any issues within the Foundation instance. These volumes must be mapped to existing folders (which have write permissions of course) so that if the instance won’t come up or there are any other issues which require help from Tech Support, the logs can be copied out and shared. Otherwise, the Docker container is like a black box which obscures what is really going on. There may not be any errors in the Docker logs; the container may just quit without error with no sign of why it won’t stay up. Checking the ThingWorx and Tomcat logs is necessary to debugging, so be sure to map these volumes correctly. Once these volumes are mapped and ThingWorx is successfully making use of them, adding a license file to the Docker instance is simple. Use the output in the ThingworxPlatform folder to obtain the device ID, grab a valid license file, and put it right back into that ThingworxPlatform folder, exactly the same way as on a regular instance of ThingWorx. However, if the Docker image is being used for a dev ops process, a license may not be necessary. The ThingWorx instance will work and allow development for a time before the trial license expires, which normally will be enough time for developers to make their changes, push those changes to a repository, and tear the container down. Another thing worth noting about ThingWorx Docker image creation is that the version of Java supplied in the staging folder must match the compatibility requirements for each version of ThingWorx. This is the version of Java used by the container to run the Foundation server. In versions of ThingWorx 9.2+, this means using the Amazon Corretto version of Java. The image absolutely will not start ThingWorx successfully if older versions of Java are used, even if the scripts do successfully build the image. Also note that in the newer versions of ThingWorx Docker, the ThingWorx Foundation version within the build.env file is used throughout the Docker image creation process. Therefore, while the archive name can be hard-coded to whatever is desired, the version should be left as is, including any additional specifications beyond just the version number. For example, the name of the archive can be given as Thingworx-Platform-H2-9.2.0.zip (a prettier version of the archive name than is used by default), but the PLATFORM_VERSION should still be set to 9.2.0-b30 (which should be how it appears within the build.env file upon download of the ThingWorx Docker files). Paying attention to every note in the Help Center is critically important to using ThingWorx Docker, as the process is extensive and can become very complicated depending on how the image will be used. However, as long as the volumes are specified and the log files accessible, debugging any issues while bringing up a Docker-contained ThingWorx instance is fairly straightforward. Credits: Images borrowed from ThingWorx Docker Containerization Tech Talk by Adrian Petrescu

Jul 30, 2021

Dev Ops is a crucial process that exists in any software setting, whether you plan on it or not. Chaos in the dev ops process, say because less time is spent here than on the shiny new features that are easy to sell, results in bottlenecks in the dev ops process. Bottlenecks reduce efficiency, and leave you open to vulnerabilities as well. The faster you can get a change properly tested and safely into production, the safer and more stable the system is all along. Issues will arise, they always arise. Are you ready for them? Watch this video, see some of these additional links, and think about your dev ops process now, before the fires start! Useful Links: ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 1 ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 2 Overview of Monitoring Tools and Diagnostics The System Health Timer

Jan 31, 2023

Unlocking the Power of Industrial Data Presentation by Mike Jasperson, VP of the IoT Enterprise Deployment Center his video presentation was performed at the Digital Transformations in Manufacturing conference of 2021, hosted by Enterprise Digital. In this presentation, Mike Jasperson goes over the benefits to modernizing and consolidating access to time-stamped data that is ingested from equipment and sensors into a central location like ThingWorx. Moving away from monolithic, legacy, and siloed systems, and towards more agile solutions, has never been more critical in order to increase machine, operational, and business efficiencies while also opening up visibility into data systems and infrastructure deployments. This video partners with InfluxData to help customers extract value from IoT data systems, maximizing both performance and operational capabilities of their monitoring systems. To stay competitive in the IoT market, it's important to review the best practices for scaling and testing your industrial metrics solutions, as well as how to get the best performance out of your digital data solutions by using time-series optimized databases like InfluxDB. Open source technologies discussed here are a great way to create modular and upgradable solutions and accelerate IoT innovation. (view in My Videos)

May 25, 2021

Architecting Reason Code Trees in DPM Tori Firewind, IoT EDC What are Machine Codes? Factory hardware devices communicate status changes to their human operators and other machines (IoT) via machine codes. The manufacturers often determine the machine codes for different types of factory hardware, so those are often pre-determined. However, how the reason trees map these machine codes to corresponding business logic in ThingWorx is entirely customizable. Knowing the best way to design your reason trees for this purpose can be challenging, so this guide is here to help with your conceptual knowledge. Using the UI to create, edit, and configure reason codes in technical detail can be found in the Help Center. The Tree Trunk At the highest level of the reason tree, the trunk, there are really 3 categories: Availability (A), Performance or Productivity (P), and Quality (Q). These should look familiar; they are the three dimensions of OEE (Overall Equipment Effectiveness). Fg 1. Calculation of OEE Availability refers to long stops, events that stop planned production long enough that it makes sense to track a reason for being down (typically several minutes, but the threshold between a long stop and a short stop can vary depending on the ideal rate of production of materials). Availability = Run Time / Planned Production Time Productivity/Performance really refers to short stops, things that cause the machine to run at less-than- optimal speeds. This can include stops caused by running out of materials for production, doing minor maintenance like switching out a single, easily-changed part, or even frequent breaks due to ill health of an operator. User error can be a cause as well, say if the machine needs a certain heat to produce parts, and the heat keeps fluctuating (requiring the machine to take the time to calibrate for this before starting on production) because operators are smoking out a back door or adjusting thermostat temperatures. Fg 2. Levels of Runnable Time Operator influence often is a factor when it comes to the conditions that permit optimal performance from machinery, and every factory may face different challenges. Stops like these are not really outages; the amount of downtime isn’t enough to consider the production block entirely unproductive. Production was continuing and ongoing throughout most of the block despite the issues; the rate was just slower than ideal. Performance/Productivity = (Total Count / Run Time) / Ideal Run Rate Quality refers mostly to the number of items that are considered scrap or rework, and it can be split into two categories: start up scrap (that which is expected because the machine is in the process of warming up or being fine-tuned by the operator) and production scrap (things which come out wrong and must be tossed or reworked because the conditions under which they were produced weren’t ideal; this is called first-pass yield only, meaning it's only a "good" product if it passes inspection the first time). Quality = Good Count / Total Count The Branches and the Leaves of the Tree The “leaves” are the reason codes which directly map to machine codes , and the “branches” are the method of categorization that connects them to the trunk. Both the leaves and the trees, the children and the parent nodes of the tree, are split into two states: planned versus unplanned downtime. Changeovers, maintenance, and even scrap, can be broken down into this dichotomy. For scrap, there are startup rejects (planned, because the machines have ramp up periods) and production rejects (unplanned, because the conditions weren’t ideal). For maintenance there is planned and unplanned, small changes that occur on the fly that result in productivity loss, and maybe also reduce availability in the long run. Small, unplanned changes can occasionally shift into the availability loss category if a simple, quick repair winds up being complex and time-consuming. A good reason tree can differentiate easily between short and longer stops in order to respond to each in a deliberate way. To start off in the process of architecting your reason tree, try writing the three categories on a board in a common room in an average factory (or several as a survey). Ask operators to stop in over the course of a few days and write various machine codes that they see often and find useful under one of these categories, or more than one if the machine code pops up under different circumstances and can mean different things. Have them write a 10 word justification, if the association isn’t obvious. Gather all of the “leaves” in this way, and then begin to associate them with the “trunk”, forming the “branches”. An example tree can be seen in figure 3 here, with leaves like “Changeover” and “Maintenance” being semi-ambiguous; they could just as easily be seen as unplanned stops. Therefore, there may be multiple reason codes mapping up to the top of the tree in more than one branch, and these can have different categories, which controls how the business logic responds to the different codes. The Help Center has more details about how the events are mapped to types, and each type contains multiple categories, as configured by you when you set-up the DPM model. Fg 3: Different types of changeovers may have different codes, and can map up as either planned or unplanned, but all planned and unplanned stops (long stops) are under the Availability category of the trunk. Similarly, small stops can involve idling, like if there are not enough materials, reduced speed if the conditions are not ideal, or other small stops, usually caused by human error or unforeseeable circumstances. Quality loss then refers to the products which fail quality checks, either because the machine still has the wrong paint in the applicator and needs a few rounds to be ready for the next production item, or because the conditions are again, not ideal, and items wind up scrapped. Example Reason Tree Fg 5 example tree with more specific tags (there may be dozens or hundreds in a full reason tree, though the fewer are needed to capture the events we care about, the better). Theory of Constraints Fg 6 theory of constraint wheel: an industry process for gradual OEE improvement in factories that has been adapted into the PTC methodology as well. While architecting your reason tree, always remember the key purpose: gathering only as much data as necessary to analyze the efficiency of a factory and to identify the bottleneck, or the most limiting factor. The important point is to identify not just the bottleneck that seems the most troublesome, but the one that actually results in the greatest impact to OEE across the entire factory. Without software like DPM, and a properly designed reason code tree, the process of improving a factory can be very challenging, involving a lot of guesswork, and sometimes solving one problem at the cost of another. The issue is that these machines produce a LOT of raw data, and humans are not the best tool available to gather and aggregate this data in a consumable way. A good reason tree ensures a smart application that can quickly prioritize the machine (bottleneck) that most impacts production, and not just the machine that functions in the least optimal way. So, the theory of constraints is really a process for identifying small, incremental changes, which together can make a big difference, and fast, in factory OEE. The rate at which this cycle can be completed varies, however. The slower the process of identifying constraints and the less information that is gathered, the slower and less precise the first two steps of this process. Alternatively, in a traditional constraint identification process, too much information can be a problem as well, due to human limitation, as discussed above. So, DPM is a great benefit in this regard, because it aggregates the data into a consumable, comparable way every 5 minutes, freeing up your human analytics for problem solving and prioritization, and not data gathering and sorting. Other Key Tips Also remember that a good tree treats the trunk like a whole unit, with each category occupying a percentage of the overall OEE. Afterall, look back up at the 3 dimensions of OEE in the equation above. For example, the more you see issues with availability, the less you will see issues with scrap, for the machine simply doesn’t have as much time to produce scrap if it is constantly down. The more you see issues with quality loss, the less you should see of productivity loss, because these are simply inversely proportional modes; to say it differently, if a machine is running quickly and seeing few minor maintenance stops, then it is likely to produce more scrap (as well as more good product as well). Another thing to remember is that even DPM is limited in its capacity to interpret raw data. Even while many magnitudes more efficient than any human gathering and analysis could ever be, there is an upper limit to how much raw data DPM can ingest and analyze before the system gets very expensive. For this reason, you want to ensure your reason trees use only as many reason codes as are required to capture the OEE of a factory site. This will mean using different codes for different types of things, most likely, which is easy to do maintainably across many sites using thing shapes. Keeping things tightly defined and organized is the easiest way to ensure a clean, efficient system for gathering and storing data. Also remember that data will not need to persist very long once DPM is fully operational and adopted by your factories. DPM ensures that the changes made to the production line to improve efficiency are the highest impact, and the least difficult to implement, meaning that there will be a very rapid return on investment, and a process to ensure future issues are identified and resolved quickly. Data from past issues in the factory won’t be as relevant, and historical data stores can be kept smaller than one might think. It is the power of ingesting data directly into the processing and aggregation process, the automatic reduction of data down into presentable, consumable webpages, that makes DPM and ThingWorx such a great factory solution for optimizing OEE.

Jan 24, 2023

ThingWorx Monitoring and Alerting, Part 2 Using Prometheus and Grafana By Tori Firewind, IoT EDC Building Dashboards To add a panel which monitors some component of the ThingWorx application to a dashboard in Grafana, click to add a new panel. Under “Metrics” in the box at the bottom of the screen, select what ThingWorx metrics you wish to monitor (type “thingworx” in the search box to see them all). For example, select the Platform Subsystem memory in use: Label filters aren’t necessary, though you may want to sort by instance if you are monitoring multiple ones with the same dashboard. You may also want to take some time to format the Y axis, which by default will show in bytes. Go to the formatting panel on the right side and scroll down to the section called “Standard options”. For the Unit dropdown, start typing “data” and then select “bytes (SI)”. This will automatically determine if the bytes you’ve provided should really print as MB or GB based on how large the numbers are. Rename the panel, modify it in any other way desired, and then click Apply (last 5 minutes): Once you add the panel, you can watch the memory usage as it is scraped by selecting the refresh option (10s or 30s, whatever makes sense based on your scrape interval). The viewing window is stored in the URL, so that you can generate a report for a specific interval (like when a test was occurring), and then store that result or share it in a more compact way: http://localhost:3000/d/nleucPv4k/thingworx-monitoring?orgId=1&from=1668528038732&to=1668536503953 (absolute timestamps): Dashboards are just collections of panels which report on all of the various metrics of performance and stability that exist for single components of a system. This is because there can be quite a few metrics worth watching for each individual component. Most of the third-party tools come with their own dashboards, but the ThingWorx component is one which for now, requires some thought and creativity. Consider your use case carefully and look over the various subsystems contained within ThingWorx. Each part of an application is localized to specific subsystems, and some are more business critical than others. What will go at the top of your dashboard? Add rows, add panels per row, and see what the many choices are for watching your system. Don’t forget that with Telegraf running, VM or machine usage metrics are also available for display on a dashboard. Things like overall CPU and Memory usage are critical to determining the health of a system, as we have demonstrated in our own reasoning in past benchmarks and scale tests. You can create a panel to monitor the mem_used versus mem_total, like so: Another metric from Telegraf worth adding is the CPU usage, which should be given “percentage” for the units and which needs a label filter of cpu = cpu-total. If we do some resizing and drag-and-dropping, then we now have the first row of a dashboard: See how the Platform usage climbs steadily and is purged in a cycle? That is the Java Garbage Collection mechanism, and it’s important to remember to leave room for spikes on top of those peaks. Data can also be calculated or processed in some way to make it more useful for determining system health and stability. The data in the picture below uses the formula submitted = completed + number queued + number failed. It shows the current queue on the left Y-axis and the max queue on the right (since the two numbers usually are drastically different). It looks pretty, but it doesn’t really tell us much about the system in this format, so let’s do some math and find a representation that is a bit more helpful. Performing a “non-negative derivative” calculation over the submitted and the completed queue counts over time allows for us to look at the status of the queue as a velocity. When the “complete” speed appears behind the “submitted” speed for too long at a time, then that means the queue is filling up and will eventually result in data loss. If we take this one step further and calculate the average of the submitted minus the completed over time, then we can actually predict approximately when the queue will fill up. This can then be displayed on a dashboard in Grafana, or used as the basis for an alert. What to Monitor In addition to monitoring the system which ThingWorx runs upon, ThingWorx itself can easily be monitored down to the subsystems level by Prometheus due to the Metrics endpoint. Many applications have support built into the way they format the data for scraping, including the JVM (which exposes Prometheus-formatted metrics with the JMX Exporter) and the OS (which can use the Node Exporter or Telegraf for the same purpose). For these more generic components, there are popular community dashboards which can be downloaded and used in Grafana for data analysis and review. For ThingWorx, there’s different kinds of data to track: subsystem data (see the list on the right) and non-subsystem data. There’s queue based versus non-queue data. These different metrics can collectively characterize the overall health of the application, depending on the use case. For instance, if this is a system with very many connected devices, one metric which may be important to track is the number of total devices defined on the Foundation server vs. the number of devices which are currently connected. If there are relays involved, then many devices suddenly going offline can mean a relay has failed. Another example is if the system sizing depends on an assumption that there will only ever be a fraction of the total number of devices connected at a time. Use cases like these could be monitored easily by keeping track of the total vs. the number of connected devices. Other common indicators of a healthy ThingWorx application might include the value stream and stream queues. These queues should fluctuate over time as the data is ingested and processed, but they should never be growing in size. If the stream queues are growing, then that means the data is writing to the queues faster than the queues can write to the database. Eventually, when the system runs out of resources to keep track of the queues, data will be lost. Having the stream information displayed in a chart can make it very easy to spot an upward trend in resource usage early on, which can catch a blockage or bottleneck that needs attention before it starts to affect the larger system in catastrophic ways later. Memory usage information from the various subsystems might be something worth tracking, as well as the event queue. These can indicate that the business logic is functioning with room to handle spikes, and that the server has enough memory to service all three dimensions of an IoT application: the ingestion, the business logic and thing-based alerting, and the user experience and UI. If file transfers are a key part of the use case, then the number of concurrent transfers, the average speed of them, the size of the files, all of this kind of stuff can be tracked and charted in Grafana by making use of the ThingWorx metrics which automatically show up there once you import the Prometheus data source. A mature dashboard used for a production environment might look a little like this: For further reading about subsystem monitoring, check out the Help Center. How to Alert The alerting mechanism built-in to Prometheus is incredibly easy to configure, so it might be tempting to generate tons of alert rules. However, remember that the more noise a system makes, the harder it is for those monitoring that system to know when action is really required. Playbooks which document how to respond to alerts, who to contact, how to act, and all the information necessary to handle an alert, should be created as an ongoing part of the DevOps process. Alerts should fire with the right severity in the subject line, as well as all of the information about the issue that is currently known, presented in a concise way, so that whoever receives the alert starts thinking about the root cause sooner and recovers the system faster. Those who receive the alert should have the ability to facilitate its resolution, and know who is expected to react to any alerts which come in. In the ThingWorx monitoring stack, Prometheus handles the alert rules and the generation of alerts, but alert filtering and delivery is managed in an external alerts manager. Generally, you want your alerts to follow a curve. If the current queue size exceeds 50% of the maximum, perhaps that isn’t a huge deal, if the application catches up quickly. How long are spikes in queue processing expected to last? Perhaps if the queue size is over half-full for 10 seconds, 30 seconds, then that means the queue is falling behind and not catching up. Ok, so this might be a warning level alert. When does this become an error? Well, let’s say the queue exceeds 90% of the max queue size. This might want to alert the moment it hits the mark. Now, farther along the curve, it may not take as long before data gets lost. As the severity of the situation increases, the threshold for alerting should increase as well. That way when errors do alert, it is a sure thing that they require a response immediately. The alerts are then pushed into the “Alerts Manager” for delivery based on your management rules. The Alerts Manager may decide to withhold warnings altogether, or send them to a much smaller mailing list, whatever filtering helps to ensure the right people receive the right alerts, right when they need them. In Conclusion, A Healthy Application... Has stable memory usage that fluctuates predictably and doesn’t grow over time. In a system experiencing mild issues, the memory starts to trend upward: If left unattended, systems like this may eventually experience outages. Finding the issue this early means there is even time to do some digging, debugging, taking of stack traces, and other such troubleshooting steps before the system must be restarted or recovered. That can really make the difference in identifying and resolving before there are real problems. One metric which makes for good alerting is the total number of failed stream entries, which can indicate there’s an issue writing to the database even before the queue has started to fill up. Other alerts may include warnings and errors based around percentages of memory used or queues filled, which depend on how long the queues take to fill up and how long the state has been at its increased usage. Prometheus has all of the tools necessary to make this possible across a variety of infrastructures and use cases. Set it up on a local machine and poke around at what ThingWorx metrics are available to meet your monitoring needs.

Dec 23, 2022

IoT Tips

Distributed Testing with JMeter

Smoothing Large Data Sets

Announcing the Release of the ThingWorx 8.5 Sizing Guide!

The Property Set Approach

Introducing the New and Improved DGIS Guide to ThingWorx Development

Building More Complex Tests in JMeter

ThingWorx 8.5 Architecture Deployment Guide Update

IOT EDC Reference Benchmark - New Scenario Using Multi-Kepware for Connected Factories

IOT EDC Reference Benchmark - ThingWorx and Azure IoT Hub

ThingWorx Monitoring and Alerting: Using Prometheus and Grafana, Part 1

5 Common Mistakes for Developing Scalable IoT Applications

Thread Safe Coding in ThingWorx, Part 2: The Database Locker Approach and Comparison

IOT EDC Reference Benchmark - Updated Connected Factories Benchmark with ThingWorx 9.0

Distributed Timer Execution in a HA Cluster

IOT EDC Reference Benchmark - Leveraging Dell and VMWare for Asset Monitoring in Connected Factories

Live Webinar: Setting up a DevOps pipeline in ThingWorx on September 30th

ThingWorx Docker Overview and Pitfalls to Avoid

Dev Ops: The High Level Overview

Unlocking the Power of Industrial Data

Architecting Reason Code Trees in DPM

ThingWorx Monitoring and Alerting: Using Prometheus and Grafana, Part 2

ThingWorx Learning Paths

Getting Started on the ThingWorx Platform Learning Path