Showing results for 
Search instead for 
Did you mean: 
Showing results for 
Search instead for 
Did you mean: 

Community Tip - Did you know you can set a signature that will be added to all your posts? Set it here! X

IoT Tips

Sort by:
Interested in learning how others using and/or hosting ThingWorx solutions can comply with various regulatory and compliance frameworks?   Based on inquiries regarding the ability of customers to meet a wide range of obligations – ranging from SOC 2 to ISO 27001 to the Department of Defense’s Cybersecurity Maturity Model Certification (CMMC) – the PTC's IoT Product Management and EDC teams have collaborated on a set of detailed articles explaining how to do so.   Please check out the ThingWorx Compliance Hub ( login required) for more information!
View full tip
How to Scale Vertically and Horizontally, and When to Use Sharding Written by Mike Jasperson, VP of IOT EDC   Deployment architecture describes the way in which an IOT application is deployed, or where each of the components are hosted on the network. There are deployment architecture considerations to make when scaling up an application. Each approach to deployment expansion can be described by the “eggs in a basket” analogy: vertical scale is like one person carrying a bigger basket, horizontal scale is like one person carrying more baskets, and sharding is like more than one person carrying the baskets (see below).   All of these approaches result in the eggs getting from point A to point B (they all satisfy the use case), but the simplest (vertical scale) is not necessarily the best. Sure, it makes sense on paper for one person to carry everything in one big basket, but that doesn’t ensure that all of the eggs arrive intact. Selecting the right deployment architecture is a way to ensure the use cases are satisfied in the best and most efficient ways possible, with the least amount of application downtown or data loss.   Vertical Scale - a.k.a. "one person carrying a bigger basket" The most common scalability approach is to simply size the IOT server larger, or scale up the server. This might mean the server is given additional CPU cores, faster CPU clock speeds, more memory, faster disks, additional network bandwidth or improved network cards, and so on. This is a very good idea  when the application logic is increased in complexity, when more data is therefore needed in memory at a time, or when the processing of said data has to occur as quickly as possible. For example, adding additional devices to the fleet increases the size of the “Thing Model” in the process and will require additional heap memory be available to the Foundation server.   However, there are limitations to this approach. Only so many concurrent operations and threads can be performed at once by a single server. Operations trying to read and write to the disks at once can introduce bottlenecks and reduce server performance. Likewise, “one person, one basket” introduces a single-point-of-failure operating risk. If for some reason the server’s performance does degrade or cease altogether, then all of the “eggs” go down with it. Therefore, this approach is important, but usually not sufficient on its own for empowering an enterprise level deployment.   Horizontal Scale - a.k.a. "one person, with two or more baskets" As of ThingWorx version 9.0, Foundation servers can be deployed in clusters, meaning more baskets to carry the eggs. More baskets means that if even one of these servers is active, the application remains up in the event of an individual node failure or maintenance. So clustered deployments are those which facilitate High Availability.   Clustered servers save on some resources, but not others. For instance, every server in a cluster will need to have the same amount of memory, enough to store the entire Thing Model. Each of the multiple baskets in our analogy has to have the same type of eggs. One basket can’t have quail eggs if the rest have chicken eggs. So, each server has to have an identical version of the application, and therefore enough memory to store the entire application.   Also keep in mind that not all application business logic can scale horizontally. Event queues are local to each ThingWorx node, so the events generated within each node are processed locally by that particular node, and not the entire network (examples are timer and scheduler-based activities). Likewise, data ingestion done through an extension or other background process, like MQTT, emits events within a node that therefore must be processed by that particular node, since that's where the events are visible. On the other hand, load distribution that happens external to ThingWorx in either the Connection Server (for AlwaysOn based data coming from ThingWorx SDKs, EMS, eMessage agents, or Kepware) or REST API calls through a load balancer (i.e. user activity) will be distributed across the cluster, facilitating greater scaling potential in terms of userbase and mashup complexity. Also note that batched data will be processed by the node that received it, but different batches coming through a connection server or load balancer will still be distributed.   Another consideration with clusters pertains to failure modes. While each node in the cluster shares a cache for many things, Stream and Value Stream queues are only stored locally. In the event of a node failure, other nodes will pick up subsequent requests, but any activity already queued on the failed node will be lost. For use cases where each and every data point is critical, it is important to size each node large enough (in other words, to vertically scale each node) such that queue sizes are constantly kept low and the data within them processed as quickly as possible. Ensuring sufficient network and database throughput to handle concurrent writes from the many clustered nodes is key as well.   Once each node has enough resources to handle local queues, the system is highly available with low risk for outages or data loss. However, when multiple use cases become necessary on single deployments, horizontal scaling may no longer be enough to ensure things run smoothly. If one use case is logic-heavy, something non-time-critical which processes data for later consumption, it can use too many resources and interfere with other, lighter but more time-critical use cases. Clustering alone does not provide the flexibility to prioritize specific operations or use cases over others, but sharding does.   Sharding - a.k.a. "more than one person carrying baskets" “Sharding” generally refers to breaking up a larger IOT enterprise implementation into smaller ones, each with its own configuration and resources. More server maintenance and administration may be required for each ThingWorx implementation, but the reduction in risk is worth it. If each of the use cases mentioned above has its own implementation, then any unexpected issues with the more complex, analytical logic will not affect the reaction time of operators to time-sensitive matters in the other use case. In other words, “don’t keep all your eggs in one basket”.   The best places to break up an implementation lie along logical boundaries already accepted by the business. Breaking things down in other ways might look nice on paper, but encouraging widespread adoption in those cases tends to be an uphill battle.   In connected products use cases, options for boundaries could be regional, tied more towards business vertical, or centered around different products or models.  These options can be especially beneficial when data needs to stay in particular countries or regions due to regulatory requirements. In connected operations use cases, the most common logical boundaries would be site-based, with smaller IOT implementations serving just a smaller number of related factories in a particular area. Use-case or product-line boundaries can also make sense here, in-line with the above comments about keeping production-critical or time-sensitive use cases isolated from interference from business-support and analysis use cases.     Ideally, a shard model will put the IOT implementation “closer” to both the devices communicating with it and the users that interact with the data. This minimizes the amount of data to be sent or received over long distances, reducing the impact from bandwidth and latency on performance. When determining which approach is best, consider that smaller, more focused implementations offer more flexibility, but are harder to manage. Having different versions of the same applications deployed in multiple places can easily become a maintainability nightmare. It’s therefore best not to combine a regional model with a use case model when it comes to determining sharding boundaries. Also consider using deployment automation tools like Solution Central. These enable tracking and managing version-controlled deployments to multiple IOT implementations, whether they are deployed on-premise, or in the cloud, giving one central location of all source code.   Another benefit to sharding is the focused investment of server resources in a more targeted way. For instance, if one region is larger than another, it may need more CPU and memory. Or, perhaps only part of an application requires High Availability, the time-critical use cases which are best suited to small, clustered deployments. The larger, analysis-centered use cases can then remain non-clustered (but still vertically scaled of course). Sharding can also make access control simpler, as those who need access to only one region or use case can just be given a user account on that particular shard.   However, certain use cases need data from more than one shard in order to operate, turning the data storage and access control benefits into challenges. Luckily, ThingWorx has an excellent toolkit for overcoming such issues. For one thing, REST API calls are readily available in ThingWorx, allowing each shard to exchange information with each other, as well as other enterprise data systems, like ERP or Service Ticketing. Sometimes, lower-level data replication strategies are the way to go, say if downsampling or data transfer from one store to another is necessary, and built-in database tools can more easily handle the workload. Most of the time, however, REST API calls are used to define the business logic within the application layer so that copying data actively between shards is unnecessary, using fewer resources to control what information is shared overall.   There are several design approaches for REST API communication between shards, the two most common being the “peer” model and the “layered” model. In a peer model, one shard may call upon another shard using REST whenever it needs more information. In a layered model, there are “front-line” shards which handle most (if not all) of the device communication and time-critical use cases, things which require only the information in one shard to operate. Then there are also “back-line” shards that aggregate data from the many front-line shards, performing any operations that are less time-sensitive and more complex or analytical.   For any of these approaches, it remains important to keep your data archival and purge strategy in mind. It is a best practice in ThingWorx to only retain as much data as is absolutely necessary, purging the rest periodically. If the front-line shards only ever need the last 7 days of raw data for 5 properties, plus the last 52 weeks of min/max/average data, then implementing an approach where each shard computes the min/max/average values and then archives the older data to a shared “data lake” before purging it would be ideal. This data lake then serves as the data store for all back-line shard operations.   There is also the option to consider sharing some infrastructure between ThingWorx instances when using sharding in a deployment, which can create more flexible, scalable architectures, but can also introduce issues where more than one shard is affected when issues occur on only one shard. For instance, a common shared infrastructure piece is at the database level; each ThingWorx instance needs its own database, but a single database server instance (such as a PostgreSQL HA cluster) could serve separate database namespaces to multiple ThingWorx instances. This is an attractive option where an existing enterprise-scale database infrastructure with experienced DBAs is already in place.   Similarly, load balancers can often be configured to manage load for multiple servers or URLs. If properly configured, an experienced load balancer could direct traffic for multiple applications, but it can also create a bottleneck for inbound connections if not properly sized. Load balancers designed for High Availability can also be considered. Apache Zookeeper is another tool often deployed once for an entire cluster to monitor the health and availability of individual components, or to vote them in or out of operations if problems are detected. With all of these options, remember that sharing infrastructure increases the chances of sharing issues from one ThingWorx system to the next, which can reduce the overall infrastructure complexity at the cost of increased administrative complexity.   Bringing it All Together Vertical and Horizontal scale are both effective ways to add more capacity and availability to software implementations, but there are typically some diminishing returns in the investment of additional infrastructure. For example, consider two large, vertically-scaled implementations – one running on a VM with 64 vCPUs and 256 GiB RAM, and another running on a VM with 96 vCPUs and 384GiB of RAM. While the 96-core server has 1.5 times the compute capacity, in sizing tests with 1 million simulated assets, these two systems tend to fall behind on WebSocket execution at approximately the same point.     In a horizontal scale example, with two nodes each sized the same (64 vCPU and 256GiB RAM), one would expect High Availability to occur, where one node picks up the other’s slack in a failover scenario. However, what if that singular node can’t handle the entire workload? Should both machines be sized vertically such that either can take on the full load, and if so, then what is the cost-benefit of that? It would be less expensive in this case to have a third server.     Optimizing the deployment architecture for a ThingWorx application will therefore usually involve a blended approach. With more than two nodes, High Availability is more readily obtained, as two servers can almost certainly share the load of the third, failed node. Likewise, some workload aspects do not scale well until multiple additional nodes are added. For instance, spreading out the user load from mashup requests across multiple nodes to give the singleton more resources for the tasks it alone can perform doesn’t have much benefit if there are just two nodes.   However, with horizontal scaling alone, the servers may still need to be vertically scaled larger than is ideal in terms of cost. Each one has to hold the entire Thing Model in memory, which means that sometimes, some of the nodes may be oversized for the tasks actually performed there. Sharding allows for each node to have a different Thing Model as necessary based around what boundaries are selected, which can mean saving on costs by sizing each server only as large as it really needs to be.   So, a combination of approaches is typically the best when it comes to deployment architecture. The key is to break things up as much as possible, but in ways that make sense. Determine where the boundaries of the shards will be such that each machine can be as light and focused as possible, while still not introducing more work in terms of user effort (having to access two system to get the job done), application development (extra code used to maintain multiple systems or exchange information between them), and system administration (monitoring and maintaining multiple enterprise systems).   Find the right balance for your systems, and you’ll maximize your cost-benefit ratio and get the most out of your ThingWorx application. Happy developing!
View full tip
Thread Safe Coding, Part 2: The Database Locker Approach and Comparison Written by Desheng Xu and edited by @vtielebein    Overview This is the second on this topic, describing an alternate approach to thread safe coding than one which requires the Java extension. The demo use case here is the same as in the previous post, and there is a section at the end comparing the two approaches.   Database Locker for Thread Safe Coding The database locker is an advanced topic, so some experience with the database thing is assumed. The following steps demonstrate how to be thread safe with a database thing.   Create New Database Instance, and New Table for counter It is strongly recommended that a new database instance be created outside of the ThingWorx database schema. This guide will NOT include instructions to create the new database instance. Use the following SQL commands to create a new table: DROP table IF EXISTS counters; CREATE TABLE counters ( name VARCHAR(100) unique , value integer NULL, PRIMARY KEY(name) ); INSERT INTO counters values('DemoCounter',0); This will create a new table called counters, initializing the first counter, called DemoCounter with the value 0. Create a Function to Increase and Return the New counter Value Use the following sample code to create a table lock function: CREATE OR REPLACE FUNCTION IncreaseCounter(coutner_name VARCHAR(100), OUT newvalue INTEGER) AS $$ BEGIN LOCK TABLE counters IN ACCESS EXCLUSIVE MODE; SELECT(SELECT value FROM counters WHERE name = $1) + 1 INTO newvalue; UPDATE counters SET value = newvalue WHERE name = $1; END; $$ language plpgsql;​ Or use the following SQL command to create a new row level locker function: CREATE OR REPLACE FUNCTION IncreaseCounter(counter_name VARCHAR(100), OUT newvalue INTEGER) AS $$ BEGIN SELECT value FROM counters WHERE name = $1 FOR UPDATE INTO newvalue; newvalue := newvalue + 1; UPDATE counters SET value = newvalue WHERE name = $1; END; $$ language plpgsql;   Create a Database Thing Create a thing with the template "database" within ThingWorx, and use the PostgreSQL Driver to connect to the new database instance created above. Create New Services in the Database Thing The service IncreaseCounterDB would be a SQL Query service: SELECT * FROM public.IncreaseCounter([[counter_name]);​ counter_name would be the input parameter, a STRING which is marked as required. The service GetCounterDB would be another SQL Query service: SELECT value FROM public.counters WHERE name=[[counter_name]] LIMIT 1; counter_name would be another input parameter, a STRING which is also marked as required. The service ResetCounterDB would be a SQL Command service: UPDATE public.counters SET value = 0 WHERE name=[[counter_name]]; counter_name is yet another input parameter, also a STRING and also required.  Wrap the Database Thing Service The above database thing service will return an InfoTable, but not an integer. If it's inconvenient to use an InfoTable, wrap the service up into a local Javascript service and return an integer value. The service IncreaseCounter is a wrap up of IncreaseCounterDB and returns an integer value: // result: INFOTABLE dataShape: "" var query_result = me.IncreaseCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = query_result.rows[0]["newvalue"]; Similarly wrap up GetCounter into GetCounterDB: // result: INFOTABLE dataShape: "SingleIntegerDatashape" var query_result = me.GetCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = query_result.rows[0]["value"];​ And ResetCounter into ResetCounterDB: // result: NUMBER var query_result = me.ResetCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = 0;​ Run the Test Again If necessary, head back to the previous post to obtain the tool. Then just change the end point and run a new test: { "host":"", "port":443, "protocol":"https", "endpoint":"/Thingworx/Things/DatabaseDemo/services/IncreaseCounter", "headers":{ "Content-Type":"application/json", "Accept": "application/json", "AppKey":"5cafe6eb-adba-41df-a7d6-4fc8088125c1" }, "payload":{}, "round_break":50000, "req_break":0, "round_size":50, "total_round":20 }​ Run: Validate the Result Execute the service GetCounter to validate the result: Overall Performance Comparison The Java Extension performance looks the best here, but the database row lock will perform better if there are multiple counters.   InfoTable Type Property InfoTable properties have the same thread-safe challenges discussed previously, but they also have some additional challenges due to the way data change events are triggered. This is outside of the scope of this document, but it is worth a very brief mention here.    In general, the data change event for an InfoTable fires when the reference to the table is updated, and not the contents of the table. If the values of an InfoTable are updated directly, say by adding or removing a row, then the data change event will not be triggered because the value has technically not changed. Instead, the InfoTable has to be cloned, then modified, and then assigned back to the Thing so that the reference changes as well. Such additional considerations must be made when using other property types than those shown here. 
View full tip
Announcing the Final Installment   JMeter for ThingWorx, the Comprehensive Guide and Best Practice Tips This is the final post on using JMeter for ThingWorx. Below there are best practice tips for using JMeter and for load testing in general. Attached to this post is a comprehensive guide including all of the information from every post we've made on JMeter, including the tutorials. For a more central source, feel free to download the guide , or see the past posts here: JMeter for ThingWorx (original post) Building More Complex Tests in JMeter Distributed Testing with JMeter Generating and Reviewing JMeter Results   JMeter Best Practice Tips Use Distributed Testing As already mentioned in a previous post, each JMeter client can only handle about 150-250 threads depending on the complexity of the tests, and each client will need around 1 CPU and up to 8 GB of RAM for the Java heap. Some test plans will run with fewer host resources, so resizing the test client VM up or down is often required during test development. Create a batch or shell script to start the multiple JMeter clients for greater ease of use. Use Non-Graphical Mode Non-graphical mode allows the system to scale up higher; client processing uses up resources just to keep the simulation running, but with graphical mode turned off, there is less of an impact on the response times and other results. Graphical mode is essentially only used for debugging. Turn off Embedded Resources This setting reloads all of the typically cached requests over and over; there will be far more download requests, and to the exclusion of other requests, than is helpful. Ensure this box is not checked, especially in the HTTP Requests Defaults element:   Browser caching means that this setting doesn’t actually simulate a proper user load, given that many of the reloaded resources would not be reloaded by actual users. Use this incrementally, for one or two HTTP requests only, if there is a reason why those requests might need to download fresh images, scripts, or other resources with each call; for instance, simulate page timeouts using this once per hour or something similar. Using this across the whole project will prevent it from scaling well, while not actually simulating real-world conditions. Avoid Using Listeners For instance, the “View Results Tree”, which uses additional resources that may impact the results in disingenuous ways, based around the needs of the clients themselves and not the actual response times of the server. Many listeners are only for debugging a handful of threads while designing the tests. A list of recommended listeners for different purposes is in JMeter documentation. Summary Report is the only one you want enabled, as that exports the results as a csv or similarly formatted file, which can then be used to build reports. JMeter CAN handle SSO JMeter can authenticate into and test an SSO-enabled system. Sometimes the SSO configuration is essential for customers, and they may be quick to assume therefore that they cannot use JMeter, but that's not entirely true. Some external tools that might help with this are BlazeMeter (mentioned again in just a moment) and Fiddler, a good tool for decoding what data a particular SSO setup is exchanging during the authentication process. Use Logic Controllers for Parametrization Parametrization is critical to mirroring a proper user load, and allowing different data sets to be queried or created; the load should seem organic, random in the right ways, with actions occurring at random times, not predictable times, to prevent seeing artificial peaks of usage that don’t represent real usage of the Foundation server. Random order controllers direct the threads down different paths based on random dice rolls, allowing for a randomized collection of user activity each time, not something that has to be regenerated like a set of Boolean values that is specified in an input CSV and used to navigate a series of true or false switches. Switches just look for an environment variable to be either 1 or 0, and when it hits a switch that’s a 1, it triggers the switch below, running them in the order given under the transaction controller that goes with the switch. In this image, the 1’s and 0’s are given in the CSV input file; randomizing that input file therefore randomizes the execution of the switches too:   Use Commercial Add-Ons There are many external, add-on tools and plugins which enhance JMeter’s capabilities. One external tool that can enhance JMeter’s capabilities is Blazemeter, which has some free and some paid options to help create better reports, removing automatically much of the “garbage” REST calls (which would otherwise need to be manually deleted), and provide more consumable test reports right out of the box. Other tools and plugins include: Maven Netbeans SonarQube Jenkins Autometer Gradle Amazon EC2 Lightning IntelliJ IDEA Cassandra Grafana For more best practice information, see the JMeter Best Practice Manual.   General Load Testing Guidelines Concurrency Requirements – How to Properly Estimate the Size of the Load Test Take a brand new ThingWorx-based app. How people will be accessing the system and how often? How many are business users? How many are engineers? What do they do? Many assume that every named user in the corporate LDAP will need to access to the server, often 10s of thousands of users; this generally drastically oversizes the system. Load testing for many thousands of users is very hard and requires a lot of set-up, tuning, and optimization to get right; so if it seems that thousands of users are expected, then validating this claim is important: most customers don’t really have that many concurrent users in an engineering system. Use estimates based on how many people work at which offices, which time zones those people are in, and what kinds of users they are. Do they need access to engineering data? Perhaps there are simpler mashups for them that uses less resources. One tool for these sorts of estimations that PTC offers is the office time zone overlap Windchill Sizing Calculator (shown here) Other ways to estimate include: Analyzing the business processes, things like how long workloads typically take to complete and how many workloads are generated per day, converted into hour, minute, or second as desired for the peak duration, the length of the test. “Day in the Life” modelling, or considering things like “what does user X do in a day?” Maybe, user X checks out some drawings, edits them, and then checks them back in at 4:30. Maybe user Y actually digs into the underlying parts and assemblies, putting in change requests or orders throughout the day, instead of waiting for the end. Models are made based around the types of users. Also consider: What are worst case scenarios? What are the longest running activities? What produces the largest data transfers? What activities have large, heavy data base queries? When is the peak overlap of usage? Beginning and end of day downloads and check ins? Reports that are generated regularly? How do these impact the foreground users? For a simpler estimate, start with a percentage of the named user count, anywhere from 5-15% is a good ballpark percentage. Don’t overestimate to feel like the application has been financially worth it; even if everyone is logged in and using it all at once, which is unlikely, load testing for every single user doesn’t take into account the fact that people pause in between clicking on things to think, type emails, get coffee, and so forth. Fewer people than expected are actually doing concurrent activities like loading web pages and updating data streams. Whenever possible, use concurrency data from existing customer systems to guide the estimate for the new system. Legacy system are great places to start.   Use Grafana to monitor the system side throughout the load test, which is also required to know the test has been successful; also set up Grafana to monitor the application once it goes live, to both prevent and mitigate more rapidly any technical issues with the server. Also remember that PTC Technical Support is here to help! Provide thread dumps with an open case to any TSE, and they will help troubleshoot the tests and review any errors in the ThingWorx or Tomcat logs.    
View full tip
New Scenario Using Multi-Kepware for Asset Monitoring in Connected Factories   A new scenario has been completed for Connected Factory implementations, furthering the IOT EDC's goal of providing a reference library of ThingWorx performance. This scenario builds upon the first, with additional tests being performed to demonstrate the capabilities of multiple Kepware Servers running side-by-side. Horizontal scaling is very common for multi-line factory implementations, so be sure to check out the new scenario in this ever-expanding benchmark document.   Note that tests below 10,000 writes per second were not repeated with multiple Kepware Servers, since there is little reason to desire such a configuration in implementations that small. ThingWorx deployment sizing was also held constant throughout these tests to demonstrate the limits of a given configuration. Changes that may improve the results of a failed test (such as adding CPUs or Memory) will be mentioned but not validated as part of this benchmark.   Let us know about your applications and how they compare with the data shared here. Happy developing!
View full tip
ThingWorx DevOps with Jenkins DevOps as a topic is vast and has been addressed at many times throughout the history of the PTC Community. Previous posts address what DevOps is, teach how to make use of DevOps like a pro,  announce updates to the PTC Git Extension, and explain why this extension is so helpful to achieving continuous Git integration with ThingWorx.   This post provides a PDF guide on Jenkins integration with ThingWorx, including tutorials with detailed information on how to setup your ThingWorx instance and how to configure your Jenkins Pipeline. The PDF is listed for download separately, but it is also included in the zip with the other required files for the tutorial. The Jenkins Pipeline provided here is intended as an example / starting point for managing your DevOps in ThingWorx and can easily be extended. Please note that this Pipeline is not officially supported by PTC. 
View full tip
Building More Complex Tests in JMeter Overview This is the second in a series of articles which help inform how to do user load testing in ThingWorx. This article picks up where the previous left off, continuing with the project created there. The screenshots do appear a little differently here because a new “Look and Feel” was selected for the JMeter application (switched from “Metal” to “Windows Classic”) to provide more readable screenshots. In this guide, we are going to make the very simple project more complex, working towards a better representation of a real load test. The steps below walk you through how to create and configure thread groups and parameterize the processes and procedures defined by each thread group.   Adding More Thread Groups Within JMeter, thread groups are used to organize the HTTP requests in a test into various processes or procedures, such that different mashups (and all of the HTTP requests required on each) or processes can be executed simultaneously by different thread groups throughout the test. Varying the number of threads in a group is how to vary the number of users accessing that mashup during the test, a number which increases over time in accordance with the ramp up time. The thread group name will also show up in the Summary Report tab at the end of the test, making it easier to parse through and graph the results. Start by renaming the existing thread groups so that their process or procedure names are recognizable at the end of the test: Highlight the line which reads “HTTP(S) Test Script Recorder”. (Optional) Add an Include filter to only capture the URLs relevant to your application using the Requests Filtering tab. For example, with the escape character \ necessary for ‘.’, myhost.mycompany.mydomain becomes: myhost\.mycompany\.mydomain Now record a new thread group clicking the “Start” button: Once the control box shows up in the top left corner, click to open a browser and access the ThingWorx Navigate application. Then click on “View Parts List” or some other mashup: Once the mashup loads, search using a string and/or wildcard, or click on one of the recent results if any exist: Wait for the mashup to fully load with the details on that part or assembly, and then click “Stop” in the recording controller window: All of the HTTP requests performed in the process of loading and using this mashup will be added to the JMeter project here: Next, add a new thread group manually to the project: Highlight the newly created “Thread Group” (default name) and rename it to something that relates to the nature of that process: Drag and drop the new collection of requests so that it is considered a part of the new thread group: Then drag the whole group up so that it is next to the other thread groups in the test: In more complex projects, different thread groups may be added at different times, and each time, the service calls are all assigned an index (at the end of the request URL, for example: <request>-344). These indexes may not be unique depending on how and when the thread groups were created, especially in more complex tests. The easiest way to fix this issue is save the test from the JMeter GUI, then open the JMX file in a text editor and perform a find and replace within the relevant section of text.   This is usually done using a regular expression for the number. For example, if the request name indexes are numbered -500 through -525, a regular expression to increase them to -700 through -725 would be (in Notepad++): Find: -5([0-9])([0-9]) Replace: -7\1\2 Note that if you do not use a Request Filter, sometimes the recorder will log URLs that are not part of the target application, like these “generate” samplers. These URLs are typically happening in the background of the browser to track performance, security and errors. These can be deleted: At other times, you will be repeating steps that are already part of another thread group, for example: logging in. This genidkey is a part of the login, as you can see if you look back at the login thread group. Because logging in is only necessary once, and it is assumed to be complete by the time the test starts on the second thread group, this entire section can be deleted: To see for sure if a request can be removed because it is called in a previous thread group, do a non-case-sensitive search for the name of the request: All of the requests found in this particular instance were performed in the previous thread group, so therefore this entire category can also be deleted: Another odd thing you may see (if you do not use a Request Filter with the recording feature) are “blank” requests like these: The recorder isn’t sure what to call these “non-requests”, so anything like this that isn’t an actual URL within the target application should be deleted. Static downloads should be disabled or deleted from scale testing since they are usually cached by the user browser client. In this ThingWorx example, there are static “MediaEntites” which can be deleted or disabled: Within the JMeter client there is no good way to highlight and reset them all at once, unfortunately. The easiest way to remove all of these at once is to open the JMX file in a text editor and use regex expressions for search and replace “enabled=true” with “enabled=false”. Most text editors have examples on how to use regular expressions within their Help topics. The above example is for Notepad++. Parameterize Thread Groups Parameterization is usually the part of creating a JMeter test that takes the most effort and knowledge. Some requests will require the same information for every thread, information which can therefore be defined statically within the JMeter element rather than being parameterized. Some values used within the JMeter test script can be parameterized as inputs in the top level of the test controller, for example: Duration, RampUp time, ApplicationHost, ApplicationPort.   Other values may be unique to only one thread group and could be defined in a User Defined Variables element within that group controller. The value(s) used within a request can also be determined on the fly by the results of earlier requests within a thread group. These request results typically must be post-processed and parameterized for later thread elements to function correctly.   The highest level values that are unique to each thread should be inputs from a CSV file that are passed into the threads as parameters, for example Username and Password. Data used within the test is usually parameterized in order to better emulate real world application use by multiple users. In the following example, we will parameterize the number of users for each thread group by adding a user- defined variable.   Start by selecting the new thread group and parametrizing the number of threads (i.e. the number of users accessing this mashup at a time during the test). The way to enter a variable is with syntax like this: $(searchandviewpartstructure_threads) In this case, make this a user defined variable: or a variable for the whole project, by highlighting “Test Plan” and adding the information there. Begin looking at the samplers to see what types of things need to be parameterized in your test. Consider such things as: thread count (as shown above), ramp up time (also depicted above), duration, timings, roles, URL arguments, info table information, search result information, etc.   Another example here parameterizes the search parameters for a query by adding an overall search string column to a CSV file (which can then be randomly generated by some other script): First, parameterize the body data of the request by highlighting the request, and changing the value of the desired field to something like this: $(searchString) Next, define the parameter under the Test Plan and set a default value: Now define the searchString column again as part of the CSV Data Set: Now it can be varied simply by providing different pseudo-random values with wildcards and/or known values in the CSV file.   Post Processors and Extractors Most JMeter load tests become more complex when the results of one request are sent as parameters into later requests. This is done in JMeter by using Post Processors (Extractors), tools which facilitate extracting information out of the request results so it can be assigned to JMeter parameters. There are many different types of extractors which can process the results of previous requests: CSS Selector Extractor – commonly used extractor for values returned as html attributes JSON Extractor – processes JSON objects using regular expressions BeanShell Post Processor –facilitates using code scripts to process return text when needed Regular Expression Extractor – JMeter supports use of regular expressions on request results   The JSON Extractor can be used to find and store information like the partOID number for a Windchill part as a parameter in JMeter, which can then be used to build more realistic workflows within the JMeter test. The example below steps you through setting up a JSON Post Processor.   Start by right-clicking the request that contains the results of our search. Then click “Add” > “Post Processors”> “JSON Extractor”, as shown in the image below: The extractor will now show up under that request as a sub-menu item. Select it, and name the variable something easy to reference. For the JSON Path expressions, pull the object number or some other identifying characteristic out of the search results: $.rows[0].objNumber for example. Another option would be to take information like the partOID number send that into the search string field, by defining both as properties and having one refer to the other. To pull the partOID out, use a Regular Expression Extractor: Another thing to parameterize is the summary report result file name. Adding in the number of users and ramp up time can result in files that are easier to reference later being stored on your machine. We will cover generating and reviewing Summary Reports in full in the next article in this series.     Conclusion In this article we saw how to create new thread groups, removing extraneous requests from those groups, and reduce the overall ambiguity of which thread groups are representing which processes or mashup calls. We also covered how to parameterize the individual requests as well as the summary report. Note that things like Windchill URL and hostname, search parameters and part IDs, timings, durations, offsets, anything at all that influence the results of the test, should not be hard-coded. It is better to create variables for these things to ensure that all of the various simulated activities are configured in the exact same way every time. That way, the system can be tested again and again under various strains and loads until the capabilities of the application are verified.
View full tip
Load Testing through Remote Device Simulation   Designing an enterprise-ready application requires extensive testing and quality assurance. This includes all sorts of tests, of course, from examining the user interface for flaws to verifying there is correct logic in all background services. However, no area of testing is more important than scalability. Load testing is how to test the application to ensure it still functions as desired when remote things are connected and streaming information to the Platform.   Load testing is considered a critical component of the change management process. It is mentioned numerous times throughout PTC best practice documentation. This tutorial will step you through designing a load test using Kepware as a simulator. Kepware is free to download and use in short demos, making it the perfect tool for this type of test.   Start by acquiring the latest version of Kepware from the download site. Click “Download Free Demo” if a license was not included in your PTC product package. The installation of Kepware is simple, and for details, see the Kepware Installation Guide. The tutorial shown here uses Kepware version 6.7 and ThingWorx version 8.4.4. Given that we are testing a ThingWorx application, this tutorial assumes ThingWorx is already installed and configured correctly.   Once Kepware is installed, follow these steps: (This tutorial was developed by Desheng Xu and edited by Victoria Tielebein. Exact specifications of the equipment used in both large scale and local tests are given in step VI, which discusses the size of the simulation)   Understand how to configure Kepware as a simulator Go to the Help menu within Kepware, and click on “Driver Help” Select “Simulator” in the pop-up window, and click “OK” Expand “Address Descriptions” and then “Simulation Functions” Select “Ramp Function” to review details about the function needed for this tutorial, as well as information about function syntax Close the window once this information has been reviewed Create a new project in Kepware Click “File” > “New” In case you are connected to runtime, Kepware will allow you to choose to edit this project offline Add a channel in Kepware Channels represent threads which Kepware will use to contact ThingWorx Under “Connectivity”, click “Click to add a channel.” From the drop-down list, select “Simulator” Use all the default settings, selecting “Next” all the way down to “Finish” Next, add one device to the channel Highlight the new channel and click “Click to add a device” (which will appear in the center of the screen) Once again, use the default settings, selecting “Next” all the way down to “Finish” Add a tag to this device Within Kepware, tags represent properties which bind to remote things on the Platform and update with new information over time. Each device will need several tags to simulate remote property updates. The easiest way to add many tags for testing is to create one, and then copy and paste it. Highlight the device created in the previous step and click “Click to add static tag”, which appears in the center of the screen For “name” type “tag1” For Address, enter the Ramp function: RAMP(1000,1,2000000,1) The first parameter is the update rate given in milliseconds The next two parameters are the range of values which can be sent The last parameter is the increment or step Together this means that every 1 second, this tag will send a new value that is 1 higher than the previous value to the Platform, starting at 1 and ending at 2 million Ensure the Data Type is given as “DWord” or any type which will be read as a “Number” (and NOT an “Integer”) on the Platform Change the Scan Rate to 250 Then click “OK” Add more devices to the test The most basic set-up is now done: if this project connected to the Platform, one remote thing with one remote property could be used to simulate property updates. That is not very useful for load testing, however. We need many more things than this, and many more properties. The number of tags on each device should match the expected number of remote properties in the application itself. The number of devices in each channel should be large enough that when more channels are created, the number of total devices is close to the target for the application. For example, to simulate 10,000 things, each with 25 remote properties, we need 25 tags per device, 200 devices per channel, and 50 channels. This would require a lot of memory to run and should not be attempted on a local machine. A full test of 40 channels each with 10 devices was performed as shown in the screenshots here. This simulates 10,000 writes per second to the Platform total, or about 400 remote device connections. This test used the following hardware specifications: Kepware machine running Windows 2016 64-bit, 2 cores, 8G ThingWorx Platform machine running Ubuntu 16.04, 4 cores, 16G PostgreSQL 9.6 machine running Ubuntu 16.04, 4 cores, 16G Influx 1.6.3 machine running Ubuntu 16.04, 4 cores, 16G A local test was also run on Windows 10 (64-bit), using the H2 database, with Kepware and ThingWorx running side by side on the machine, 4 cores, 16G. This test made use of only 2 channels, with 10 devices each. For local tests to see how the simulation works, this is fine, but a more robust set-up like the above will be needed in a true load test. If there is not enough memory on the machine hosting Kepware, errors like this will appear in the Kepware logs: One or more value change updates lost due to insufficient space in the connection buffer. Once you decide on the number of tags and devices needed, follow the steps below to add them.  To add more tags, copy and paste the existing tag (ctrl+c  and ctrl+v  work in Kepware for convenience) until there as many tags as desired To add more devices, highlight the device in Kepware and copy and paste it as well (click on the channel before pasting) Then, copy and paste the entire channel until the number of channels, devices, and tags totals the desired load (be sure to click on “Connectivity” before hitting paste this time)  Configure the ThingWorx connection Right click on Project in the left-hand navigation bar and in the pop-up window that appears, highlight ThingWorx Change the “Enable” field to “Yes” to activate the other fields Fill in the details for “Host”, “Port”, “Application Key”, and “Thing name” Note that the application key will need to be created in ThingWorx and then the value copied in here The certificate and encryption settings may also need to be adjusted to match your environment For local set-ups, it is likely that self-signed and all certificates will need to be accepted, so both of those fields will likely need to be set to “Yes” (Encryption may need to be disabled as well). In production systems, this should not be the case  Save the project It doesn’t matter too much if this project is saved as encrypted or not, so either enter a password to encrypt the save or select “No encryption” Connect to ThingWorx Click “Runtime” > “Connect…” A pop-up will appear asking if you want to load this project, click “Yes” The connection status should then appear in the bottom portion of the window where the logs are displayed Configure in ThingWorx Login to the ThingWorx Platform Under “Industrial Connections” a thing should appear which is named as indicated in the Kepware configuration step above Click to open this thing and save it Also create a new thing, a value stream for ingesting data from Kepware Create remote things in ThingWorx Import the provided entity into ThingWorx (should appear as a downloadable attachment to this post) Open the KepwareUtil thing and go to the services tab Run the AutoKepwareCreate service to generate remote things on the Platform Give the name of the stream created above so each thing has a place to store property information The IgnoreTemplate flag should be set to false. This allows for the service to create a thing template first, which is then passed to the remote devices. The only reason this would be set to true is if the devices need to be deleted and recreated, but the template does not (then set the flag to true). To delete the devices, use the AutoKepwareDelete service also provided on the KepwareUtil thing Note that the AutoKepwareCreate service is asynchronous, so once it is executed, close the window and check the script logs to see when it completes. The logs will look like: KepwareUtil AutoKepwareCreate task finished!!! Check status of remote things Once the things are created, they should automatically connect to the Platform Run the TotalDeviceByTemplateWithTemplate service to see if the things are connected The template given here could be the one created by the AutoKepwareCreate service, or just give it RemoteThings if this is a small local set-up without many remote things on it The number of devices will equal the number of devices per channel times the total number of channels, which in the test shown here, is 400 isConnected will be checked if all of the devices are connected without issue If some of them are not connected, verify in the logs if there are any errors and resolve those before moving on View Ingestion Rate Once the devices are created, their tags should show as numbers (NOT integers), and they should already be updating with new values every second To view the ingestion rate, run the KepwareUtil service AutoKepwareRateSummary Give the thing template name that is created by the AutoKepwareCreate service, which will look like the name of the Kepware thing itself with a “T-“ in the front The start time should be close to the current time, and the periodInMinutes should be large enough to include some of the test (periodInMinutes is used to calculate the end time within the service) Note in the results here that the Average Write Per Second is only 9975 wps, which is close but not exactly what we would expect. This means that there are properties not updating correctly, which requires us to look at the logs and restart some things. If nothing shows up here, despite the Total Connected Things showing correctly, then look at the type of the tags on one of the remote devices. The type must be NUMBER for the query within this service to work, and not INTEGER. If the type of the tags is incorrect, then the type of the tags within Kepware was probably given as something which is not interpreted as a number in ThingWorx. Ensure DWord is used for the tags in Kepware Within the script log, look for any devices which show errors as seen in the image below and restart them to get their properties updating correctly Once the ingestion rate equals what is expected (in the case of the test here, 10,000 wps), use the AutoKepwareIngestionStat service on the KepwareUtil thing to see details about each remote device The TimeGapAvg in this service represents the gap between two ingestions in milliseconds, showing any lag that may be present between Kepware and ThingWorx The TimeGapSTD shows the standard deviation of the time gap between two ingestions on any given thing, also indicating lag (the lower this number, the better) The StartTime and EndTime show the first and last timestamp observed in the ThingWorx database during the given duration The totalCount shows the total number of ingested records during the sampling cycle The StartValue and EndValue fields show the first and last value ingested into the tag during the given duration If the ingestion rate is working as expected, and the ramp function is actually sending an update on time (in this case, once each second), then the difference between the EndValue and StartValue should always be equal to the totalCount plus 1. If this doesn’t match up, then there may be data loss or something else wrong with the property updates, which will show as a checked box in the valueException column. It is not enough to ensure that the ingestion rate is correct, as sometimes the rate may fluctuate only by 1 or 2 wps and appear perfect, even while some data is lost. That is why it is important to ensure that there are no valueException boxes showing as checked in the test of the application. If none of these are marked as having failed, then the test was successful and this ingestion rate is acceptable for the application   This tutorial is a very basic way to simulate many remote devices ingesting data into the Platform. For this to be a true test of the application, the remote things created in this test will need to be given business logic tasks as well. The AutoKepwareCreate service can be modified to give any template (and not just RemoteThing) to the thing template which is created and subsequently passed into the demo devices. Likewise, the template itself can be created, and then manually modified to look like the actual remote device template in the application, before the rest of the things are created (using the IgnoreTemplate flag in the creation and deletion services, as discussed above).   Ensure that events are triggered as expected and that subscriptions to property updates are in place on the thing template before creating the demo things. Make use of the subsystem monitor to ensure that the event, value stream, and stream queues do not grow so large that the Platform cannot keep up with the requests (for details about tuning the stream and value stream processing subsystems, see PTC’s best practice documentation). Also be sure to load some of the mashups to see how they perform while the ingestion test is happening. This will test whether or not the ingestion rate and business logic of the application can function side by side without errors, data loss, or performance issues.
View full tip
User Load Testing in ThingWorx Java Client Tutorial Written by Tori Firewind, IoT EDC   Introduction As stated in previous posts, user load testing is a critical component of ensuring a ThingWorx solution is Enterprise-ready. Even a sturdy new feature that seems to function well in development can run into issues once larger loads are thrown into the mix. That's why no piece of code should be considered production-ready until it has undergone not just unit and integration testing (detailed in our Comprehensive DevOps Guide), but also load testing that ensures a positive user experience and an adequately sized server to facilitate the user load.    The EDC has spent quite a few posts detailing the process of setting up an accurate, real-world testing suite using JMeter for ThingWorx. In this piece, we detail an alternative approach that makes use of the Java Spring Boot Framework to call rest requests against the ThingWorx server and simulate the user load. This Java Client tutorial produces a very immature user load client, one which would still take a lot of development to function as flexibly as the JMeter tutorial counterpart. For Java developers, however, this is still a very attractive approach; it allows for more custom, robust testing suites that come only as an investment made in a solid testing tool.   For someone experienced in Java, the risk is smaller of overlooking some aspect of simulation that JMeter may have handled automatically. For example, JMeter automatically creates more than one HTTP session, and it's much easier to implement randomized user logins instead of one account. The Java Client could do it with some extra work (not demonstrated here), but it uses just the Administrator login by default for a quick and dirty sort of load test, one focused less on the customer experience and more on server and database performance under the strain of the user requests (the method used in our sizing guidance, for instance, to see if a server is sized correctly).   The amount of time required to develop a Java Client isn't so bad for a Java developer, and when compared with learning the JMeter Framework, might be a better investment. A tool like this can handle a greater number of threads on a single testing VM; JMeter caps out around 250 threads per client on an 8Gb VM (under ideal conditions), while a Java Client can have thousands of threads easily. Likewise, a Java Client has less memory overhead than JMeter, less concern for garbage collection, and less likelihood that influence from heap memory management will affect the test results.   However, remember that everything in a Java Client has to be built from scratch and maintained over time. That means that beyond the basic tutorial here, there needs to be some kind of metrics gathering and analysis tool implemented (JMeter has built-in reporting tools), the calls need to be randomized, and not called at set intervals like they are here (which is not a very accurate representation of user load compared to a real-world scenario), and the number of users accessing the system at once should probably vary over time (to resemble peak usage hours). JMeter has a recording tool to ensure all the necessary REST requests to simulate a mashup load are made, so great care has to be taken to ensure all of the necessary REST calls for a mashup are made by the Java Client if a true simulation is called for by that approach.    Java Client Tutorial   Conclusion Neither a Java Client nor a JMeter testing suite is inherently better than the other, and both have their place within PTC's various testing processes. The best test of all is to stand up any sort of user load testing client, either of these approaches, at the same time as the UAT or QA user experience testing. QA testers who load and click about on mashups in true, user fashion can then see most accurately how the mashups will perform and what the users will experience in the Enterprise-ready, production application once the changes go out.
View full tip
The DPM User Experience Written by Tori Firewind, IoT EDC Team   As discussed in a previous post, DPM is a tool designed to be beneficial at all levels of a company, from the operators monitoring automated data on production events from the factory machines themselves, to the production supervisors who need to establish, task out, and track machine maintenance and improvement measures. DPM also engages the continuous improvement and plant leadership, by providing a standardized way to monitor performance that ultimately rolls up to the executive level. The end users of DPM are therefore diverse both in how they access DPM, and how they make use of its various features.   One of the perks to building DPM on top of the ThingWorx Foundation is that many of the webpages (called “mashups”) within ThingWorx are already responsive, and any  which aren’t responsive OOTB can be modified and custom designed for different size viewing screens to ensure that if necessary, end users can access DPM   from a variety of locations and devices. Most of the time, end users will be accessing mashups from hard-wired dashboards mounted on the actual devices,    or from wireless laptops which have standard size screens with standard resolutions. For use cases involving phones or tablets, however, it may be necessary to see how DPM will perform across a variety of bandwidth and latency conditions. Often, cellular or satellite connection is a must to facilitate field team cooperation, and 5G networks often result in worsened performance.   So, to demonstrate the influence of bandwidth and latency on the responsiveness of DPM, the Production Dashboard was loaded in the Google Chrome browser repeatedly under varying conditions. This dashboard is the webpage most operators and field users would access to log event information and production details (so it is widely used by end users). This provides a sort of benchmark of the DPM solution, something which indicates what can be expected and tells us a few things about how DPM should be deployed and configured.   Latency was introduced by hosting the servers involved in the test in different regions (all Azure cloud hosted servers, one in US East, one US West, and one in Japan East). Bandwidth was introduced using a tool on the PC with either no bandwidth or 4 megabits/second.   Browser caching was turned on and off as well, to simulate the difference between new and return users; new users would not have the webpage cached, so their load times are expected to be longer. Tomcat compression was also configured in half of the runs to demonstrate the importance of compression for optimal performance.   Each of these 24 scenarios was then tested 10 times from each location, and the actual data can be found in the attached benchmark document (a working  solution benchmark, which is not designed to be referenced directly, as matters of infrastructure may influence the exact performance of the solution).  Even with bandwidth, every region sees better performance for return users versus new users, which may be important to note. However, because DPM field users most commonly access DPM often, the return user time is a better indicator of adoption, and those numbers look great in our simulations. Notice the top line which shows the very worst of mobile performance, what happens over networks with bandwidth when Tomcat Compression is not enabled. Load times vary only slightly for regular networks when Tomcat Compression is enabled, and they vastly improve performance across regions and on mobile networks, so it is highly recommended (instructions on how to enable are below).   Key Takeaways Latency and bandwidth impact DPM performance in exactly the way one would expect of a web application. While any DPM server can be accessed from any region, regions with more latency will experience delays proportional to the amount of latency. In the chart here, find the three regions represented three times by three different colors (different from the charts above): The three different shades of each color represent the different regions Green represents the optimal configuration settings (Tomcat compression enabled, caching turned on) for returning users with bandwidth limitations (i.e. mobile networks like 5G) Blue shows first-time page visitors with no bandwidth limitations Purple shows first-time visitors that do have bandwidth limitations The uncompressed first-time load for mobile users (those with bandwidth limitations imposed) within the same region is also given to demonstrate the importance of enabling Tomcat Compression (load times only get worse without compression the farther the region) Notice how the green series has lower load times across the board than the blue one, meaning that return users even with bandwidth limitations have better performance across every region than new users. Also notice how the gap is larger between lighter colors and darker colors, where the darker the color, the farther the region from the DPM servers. This indicates that network latency has a more significant influence on performance versus bandwidth, with only longer running transactions like file uploads seeing a significant performance hit when on a network with bandwidth limitations.  Find out how to enable tomcat compression  and review the full solution benchmark in the document attached.  
View full tip
Generating and Reviewing JMeter Results Overview The 4th in a series of articles on load testing with JMeter, this one covers pushing the limits of a test to see how much the application can handle, as well as generating and analyzing reports once the testing completes. This article rounds off the basics of JMeter, such that anyone should be able to perform enterprise-level load testing after reviewing the content here.    Multiple criteria can be used to evaluate results, including: response time (as monitored both by JMeter, and by some other tool on the system side) throughput number of errors resource saturation CPU, Memory, disk, and network utilization Depending on use case, some of these may be considered more important than others. For instance, some customers don't care if users wait a while for results to appear on the page (response time), because they set their users' expectations and mitigate the experience with well-designed loading graphics. With response times secondary, the real issues center around data loss or system outages, with resource utilization and number of errors becoming the more important indicators of system health. Request and database timeout errors are more important indicators, as they occur most often when resources are saturated and there is data loss.   It is typical for many customers to find preventing data loss and/or promoting data integrity to be more important than preventing long response times. Consider which of these factors is most important to your use case as you determine what kind of information to gather and review in your reports.   How to Create Client-Side Reports in JMeter Creating reports for the client-side data is very simple using JMeter, both from the command line and within the UI (as shown in the tutorial below). These reports have graphical displays of response times, information about the number and type of response errors, and other criteria of performance used to gauge the success or failure of a load test. Follow these steps to generate an index file, which when opened in your browser of choice, will show all of the relevant JMeter data. Tutorial: Create an empty directory in which to store reports: Start the JMeter test with these options, or run these commands after the fact, to generate the HTML report: Once the test completes, use: jmeter -g <outputfile.jtl/csv> -o <path to output folder for html report>​ To start a test with the correct command for report generation, use this command: jmeter -n -t <test JMX file> -l <outputfile.jtl/csv> -e -o <Path to output folder>​ Running the above commands will generate these files: When the test is complete, the many JMeter client consoles will look like this: Go ahead and close the windows to terminate once they are finished. Optionally you can run multiple tests sequentially using the same jmeter-server windows. Click on the “index.html” file to open the results viewing window:     At any time, modify the settings of this “HTML dashboard” using the details from the JMeter user manual. This citation describes many options for these dashboards, as well as recommendations on how to group and format the results in ways which best convey the success or failure of the test, based on the custom requirements of the application and how granular the view needs to be. Most of the time, the default settings work ok, showing something similar to this: The charts aren’t labeled very well here, so click on the Response Times submenu: This page may take some time to render if there is a lot of data: Next, scroll down to see all the requests that occurred and sort them by how long they took to complete. Anything which took over 5 seconds (or more depending on what is expected) should be investigated as part of the post-test analysis. Does something need to be tuned or optimized? This is how to tell which request is holding things up for your customers.  There is also a chart that shows the overview, grouping the response times by how long they took to demonstrate the health of the system more concretely. Typically, the bars look something like this:  This represents expected behavior, where most of the requests are quite fast, and then there are a few that had errors or took a bit longer. This is pretty typical for web activity. You can also generate the report through the main JMeter client: Give it a results file and an output directory to generate the same index file: There are log files in each of the JMeter client directories called “jmeter-server.log”: These files may show the wrong timezone, but the elapsed times are correct, and they will show when the JMeter clients started, how many threads they ran, which servers were which, and if there were any errors. Not all errors will mean a failed test, so review anything that appears and determine what is expected. Consider designing a batch script to gather all of these logs together, or even analyze them automatically to extract only relevant information.     How to Create Server-Side Results in DynaTrace Collecting data from the environment, including CPU usage, Memory utilization (used vs. total), Garbage Collection times and other metrics of system health on the server, will require the use of an external tool. PTC’s official tool for this is called DynaTrace (PTC System Monitor), shown here. PTC offers a runtime license for DynaTrace to anyone who buys certain products, including Kepware Server, ThingWorx Foundation and Navigate, Windchill, Integrity, and more. Read more information about DevOps on the PTC Community, and stay tuned for more articles on the subject to come from the EDC.   Another option would be something like telegraf and Grafana (from the previous blog post), which facilitate the option to create dashboards around the data output specific to the needs of the application, which can still be monitored even once the application goes live. It can certainly be worth it to use such a tool for monitoring the server-side, but the set-up takes more time. Likewise, many VMs have monitoring faculties for CPU usage and memory utilization built-in, but DynaTrace also has visualization, consolidation of system elements, and other features that make it easy to use right out of the box. See the screenshots below for some examples on how to use DynaTrace, and be sure to review PTC’s full documentation here.   The example shown here is a ThingWorx Navigate system, with Windchill and ThingWorx Foundation set up side-by-side. This chart shows the overall response times of the server-side of the system. JMeter collects the statistics on what the client looks like, while another tool is required to collect the server-side metrics like CPU usage and Memory utilization, things that indicate the health of the VM or computer hosting the clients. An older version of DynaTrace is depicted here, available for free for all ThingWorx customers from the PTC Downloads Site (under various product listings).   In DynaTrace, you can build new dashboards using PurePaths: You can also look at the response times for each service, but be sure to change the response limit to a large number so that all the results are returned. Changing the response limit to a large number to ensure all of the results show in the PurePaths dashboard.   Highlighted here in DynaTrace is the longest service that ran, which in this case took 95 seconds to fully respond: More specific analysis of this service can now begin. Perhaps it needs to be tuned, or otherwise optimized to handle the number of threads, i.e. the number of users. Perhaps the system needs more resources or the VM isn’t large enough for the test. Perhaps more JMeter clients and system resources are required. Something will explain this long response time, and that will inform as to what work might still remain before this system can scale up to the enterprise level.   How to Use the Test Results Load Testing often means scaling the test up a little more each time until the system eventually breaks, or the target performance is reached. Within JMeter, this won’t mean increasing the overall number of threads per one JMeter client, but instead, scaling horizontally to other JMeter clients (as covered in the previous blog post). Now that the remote or distributed clients are configured and the test running, how do we know when the test is beginning to fail?   It turns out that this answer is not a simple one. Which results are considered desirable will vary from one customer to the next based on many factors, and analyzing the test results is a massive topic all on its own. However, there is one thing that any customer would care to review, and that is the response time overview chart found within the JMeter reports. This chart can be used to compare the performance of the majority of threads against a baseline, indicating the point at which the test begins to fail, i.e. the point at which the limits of the system are reached.   The easiest way to determine a good standard response time for a load test, a baseline, is to start with a single JMeter client and record the response times for just 1-5 threads. You can record the response times for individual requests, particularly queries and other services with expected long response times, or the average response times across all requests or groups of requests, if the performance of some mashups are more important than others.   This approach is better than relying on the response times seen in a browser because HTML pages load differently when rendered in a browser, with differing graphical resource requirements than what is requested in JMeter. Note that some customers will also manually record response times within a separate browser-based test scenario during load testing as either a sanity check or as part of their overall benchmarking in order to further validate the scalability of the application, but this wouldn’t involve JMeter given that browsers load things differently and cross-comparison is a bad idea.   Once the baseline response times are established, start increasing the thread counts across the many JMeter clients until you see the response times go up on average. PTC’s standard criteria for load testing is exceeded when the average response times are roughly doubled, or when the system seems overwhelmed with the user load on the server side (which is what to look out for in DynaTrace or the external system monitor). At this point, the application is said to have reached a bottleneck, which could be a simple tuning problem, or it could be saturated by resource requirements. Either way, the bottleneck is proof that the system can’t take any more threads without users beginning to notice and the response times approaching an unreasonable delay.   Other criteria can be used as well, say if any one thread takes more than 5 seconds to respond. Also ensure there are no unexpected errors, as gateway errors represent failed tests too. Sometimes there will be errors even when the test is successful, though, so consider monitoring the error percentage, a column in the Summary Report tab of JMeter, to see what is normal. The throughput column may also be something to monitor. Many watch for increases in throughput as the thread count increases to ensure there is no degradation in performance (which may indicate hardware or sizing constraints).   The Summary Report will look something like this, with thread group results from all of the clients appearing side by side, differentiated from each other by the unique port: Conclusions Generating and reviewing reports within JMeter is straight-forward and easily customizable. Be sure to also monitor the system itself using an external tool like DynaTrace, PTC’s official System Monitor, which has a lot of value considering how easy it is to use out of the box. If the system looks healthy on the server side and the response times are within an acceptable range on the client side, then the application is ready for enterprise use. Be sure to generate a baseline for response times within JMeter, remembering that browsers have different loading processes than JMeter, and not to cross-compare.   This article constitutes the end of the basics. The final article to come will talk about more advanced test design features and best practices, so stay tuned!
View full tip
Distributed Testing with JMeter Overview Running JMeter to the scale required by most customers is something that demands additional considerations than discussed in the previous two articles. At scale, a test may need to simulate thousands of users, which will require more than just one JMeter client be set-up on one or many hosts, as shown in the 3rd JMeter article here, in a tutorial on Distributed Testing.     Distributed Testing Remote Testing configuration in which the main JMeter client is located at one IP address, controlling the rest as they step through their own copies of the JMeter tests, based on their own unique data files as necessary, to simulate a user load across a network, a series of regions, or simply across many machines if limited by the size of the physical hardware [JMeter link for this image in text body below] One key aspect of a proper JMeter load test is distributed or remote testing, i.e. making use of more than one JMeter client at a time to simulate the user load on the Application server. There are many reasons to make use of a network of clients such as this, like mimicking cross-region user access to the Foundation server, simulating different levels of latency for different users, and increasing the overall number of users which can contribute to the load test, while minimizing the performance cost of hosting that many threads on any single server.      A single JMeter client has a practical limit of 150-250 threads across all groups and requires about 1 CPU and 8 Gb of RAM. After this point, the amount of garbage collection and other processing there is for each client to do is substantial. As the client processes its own data and sends requests to the Application server at the same time, there are diminishing returns, and the responses begin to take longer (or errors start occurring) simply because of resource starvation within the client process rather than on the Application server. Therefore, distributed testing is required for most customers doing larger load tests using JMeter. Many applications will have more than a few hundred users and/or will have users accessing the system from a variety of regions and networks, each of which could have significantly different network latency. So, in order to work with the limitations of the JMeter executable and address regional concerns, distributed or remote testing is typically required for almost all of PTC’s customers who scale test with JMeter.      With a simple (monolithic) distributed test, all of the JMeter clients are located on the same host and share an IP address, but each must be configured with a unique RMI port to connect to the controlling process. If these are located on a VM, then the resource specifications can merely be increased and the VM sized larger as necessary to ensure the network of JMeter clients runs as expected. Each JMeter client requires around 8 GB for its heap size and 1 CPU (with some additional resources for the host operating system). Multi-hosted testing becomes the required option when limited by physical hardware (or a relatively small VM hardware host). If there are only 4-core, 32-GB machines, then plan for a machine per every 3 JMeter clients. If simulating thousands of users, this could mean half a dozen machines or more are required, which can still sometimes work out to be more cost efficient than one large, 256 GB, VM hosted in the cloud. Using many hosts in physical locations can also simulate regions with different network characteristics.      A tutorial for distributed testing across one host is shown here. For more information, see the Apache web articles on each topic: Remote Testing and Distributed Testing Step by Step.     Tutorial: Step Up Distributed Test on One Host Copy the source directory for the whole JMeter project and rename it however many times as required. Here there are 22 JMeter clients side-by-side on a single, 256-GB VM (3000+ users):   Each directory (shown above) is identical, except that the “” files (found in the bin directory in each project) have unique settings, namely the RMI port:     Each JMeter client must contain a copy of the same test scripts found on the main server:   In the “” file for the main server, specify the IPs and ports for each remote/distributed client (under remote_hosts), as shown: In this image, the IPs are all the same, with just the port differing from client to client. Here only 4 clients are in use, with the rest commented out for future tests. This is how to scale up and test incrementally more users each time. Just add another server to add another 150-250 users, until eventually the target number of users is reached, or the server is saturated. These IPs will differ if doing a true remote test, with each being the server location of the JMeter client within the same network. The combination of IP address and port will all still need to be unique, and communication between the overall jmeter controller and the clients over the RMI ports needs to be allowed by the network/firewalls. Note that the number of users is set using the parameter under “Test Plan” which was set-up last time. This value represents the number of users by specifying the number of threads per thread group, and it can remain the same for every client or vary accordingly, if for instance one region is smaller than another. The “Test Plan” parameters are shown here:   To optionally start all of the clients at once in preparation for test execution, create a basic batch or shell script which goes to the bin directory of each agent and calls the start command: “jmeter-server”. In this image from a Windows JMeter host, only the first few agents are in use, but removing the “rem” to uncomment the other start command lines in this file would add more servers to be started. Note how the Java parameter for java.rmi.server.hostname must match the main JMeter client network configuration here for them to connect (see Apache links above for more information). This will start each of them in their own CMD window, which once closed, will terminate the JMeter client processes. Parameter like rampUp time within the main test script will scale with the number of client processes. For example, 100 users and 300 seconds rampUp with 4 clients results in 400 overall user threads that are all logged in after 300 seconds. Once all clients are running, then click Remote Start All to start the test across every server from a GUI (usually for debugging) or execute the test using command line: jmeter -n -r -t <test.jmx> -l <results.jtl>   The main server sends the actions to the remote clients to run, so all the clients need is input parameters. For instance, a CSV file may exist in each directory which has different data from client to client, to create pseudo-random user loads and represent different kinds of user activity. The file shown in this image is different, and unique, in each of the client directories:   Conclusion Here, we learned how to horizontally scale the load test, setting up more JMeter clients to facilitate larger, more complete user loads. We also discussed the difference between distributed and remote testing, and how the former is easier to set up and use, especially on VMs, but the latter might be better for simulating region differences and the impact of network latency. The latter will likely also be required if there are hardware constraints to consider, since each JMeter client needs about 8 GB for its heap, and another 8 GBs, or a core or two of similar size, is needed per every 3 JMeter clients for the communication and processing of data. Stay tuned for the next article on generating and reviewing the results of the load tests.  
View full tip
Remote Monitoring of Assets in Connected Factories   As stated in the previous reference benchmark, one of the missions of the IOT Enterprise Deployment Center (EDC) is to showcase how real-world IOT business problems are solved. Our goal is that these benchmarks can be used as a reference or baseline for architects working on their own implementations, showing not only a successful at-scale implementation, but also what happens when that same implementation is pushed to, or even past, it's limits.   The second in this series is attached here, this time reflecting a Connected Factory implementation. ThingWorx was deployed alongside Kepware Server, with the numbers of things, the number of properties, and the write rate for those properties being varied to once again test the capabilities of a remote monitoring use case, but this time in a Connected Factory setting. The business logic was kept simple to ensure it was not the limiting factor, as the throughput between Kepware Server and ThingWorx was pushed to the limit. See first hand the capabilities of Kepware Server and ThingWorx Foundation to handle implementations centered around real-time data reporting   More Connected Factory implementations will be added to this document in time, with multiple Kepware Server deployments and other scenarios to come. Please feel free to use this community post to ask any questions about our approach and discuss any design, deployment, and simulation factors. 
View full tip
Still not sure what the reference benchmarks have to offer? Check out this short video abstract reviewing the purpose of the reference benchmarks, some notes on how to read the guide, and information about what these guides will have to offer. Then, check out the reference benchmark for remote monitoring here!   ~~
View full tip
Load Testing through C SDK Remote Device Simulation in ThingWorx   As discussed in the EDC's previous article, load or stress testing a ThingWorx application is very important to the application development process and comes highly recommended by PTC best practices. This article will show how to do stress testing using the ThingWorx C SDK at the Edge side. Attached to this article is a download containing a generic C SDK application and accompanying simulator software written in python. This article will discuss how to unpack everything and move it to the right location on a Linux machine (Ubuntu 16.04 was used in this tutorial and sudo privileges will be necessary). To make this a true test of the Edge software, modify the C SDK code provided or substitute in any custom code used in the Edge devices which connect to the actual application.   It is assumed that ThingWorx is already installed and configured correctly. Anaconda will be downloaded and installed as a part of this tutorial. Note that the simulator only logs at the "error" level on the SDK side, and the data log has been disabled entirely to save resources. For any questions on this tutorial, reach out to the author Desheng Xu from the EDC team (@DeShengXu).   Background: Within ThingWorx, most things represent remote devices located at the Edge. These are pieces of physical equipment which are out in the field and which connect and transmit information to the ThingWorx Platform. Each remote device can have many properties, which can be bound to local properties. In the image below, the example property "Pressure" is bound to the local property "Pressure". The last column indicates whether the property value should be stored in a time series database when the value changes. Only "Pressure" and "TotalFlow" are stored in this way.  A good stress test will have many properties receiving updates simultaneously, so for this test, more properties will be added. An example shown here has 5 integers, 3 numbers, 2 strings, and 1 sin signal property.   Installation: Download Python 3 if it isn't already installed Download Anaconda version 5.2 Sometimes managing multiple Python environments is hard on Linux, especially in Ubuntu and when using an Azure VM. Anaconda is a very convenient way to manage it. Some commands which may help to download Anaconda are provided here, but this is not a comprehensive tutorial for Anaconda installation and configuration. Download Anaconda curl -O  Install Anaconda (this may take 10+ minutes, depending on the hardware and network specifications) bash​ To activate the Anaconda installation, load the new PATH environment variable which was added by the Anaconda installer into the current shell session with the following command: source ~/.bashrc​ Create an environment for stress testing. Let's name this environment as "stress" conda create -n stress python=3.7​ Activate "stress" environment every time you need to use source activate stress​  Install the required Python modules Certain modules are needed in the Python environment in order to run the  file: psutil, requests. Use the following commands to install these (if using Anaconda as installed above): conda install -n stress -c anaconda psutil conda install -n stress -c anaconda requests​ Unpack the download attached here called Unzip  and move it into the /opt  folder (if another folder is used, remember to change the page in the simulator.json  file later) Assign your current user full access to this folder (this command assumes the current user is called ubuntu ) : sudo chown -R ubuntu:ubuntu /opt/csim   Move the C SDK source folder to the lib  folder Use the following command:  sudo mv /opt/csim/csdkbuild/ /usr/lib​ You may have to also grant a+x permissions to all files in this folder Update the configuration file for the simulator Open /opt/csim/simulator.json  (or whatever path is used instead) Edit this file to meet your environment needs, based on the information below Familiarize yourself with the file and its options Use the following command to get option information: python --help​   Set-Up Test Scenario: Plan your test Each simulator instance will have 8 remote properties by default (as shown in the picture in the Background section). More properties can be added for stress test purposes in the simulator.json  file. For the simulator to run 1k writes per second to a time series database, use the following configuration information (note that for this test, a machine with 4 cores and 16G of memory was used. Greater hardware specifications may be required for a larger test): Forget about the default 8 properties, which have random update patterns and result in difficult results to check later. Instead, create "canary properties" for each thing (where canary refers to the nature of a thing to notify others of danger, in the same way canaries were used in mine shafts) Add 25 properties for each thing: 10 integer properties 5 number properties 5 string properties 5 sin properties (signals) Set the scan rate to 5000 ms, making it so that each of these 25 properties will update every 5 seconds. To get a writes per second rate of 1k, we therefore need 200 devices in total, which is specified by the start and end number lines of the configuration file The simulator.json  file should look like this: Canary_Int: 10 Canary_Num: 5 Canary_Str: 5 Canary_Sin: 5 Start_Number: 1 End_Number: 200​ Run the simulator Enter the /opt/csim  folder, and execute the following command: python ./simulator.json -i​ You should be able to see a screen like this: Go to ThingWorx to check if there is a dummy thing (under Remote Things in the Monitoring section): This indicates that the simulator is running correctly and connected to ThingWorx Create a Value Stream and point it at the target database Create a new thing and call it "SimulatorDummyThing" Once this is created successfully and saved, a message should pop up to say that the device was successfully connected Bind the remote properties to the new thing Click the "Properties and Alerts" tab Click "Manage Bindings" Click "Add all properties" Click "Done" and then "Save" The properties should begin updating immediately (every 5 seconds), so click "Refresh" to check Create a Thing Template from this thing Click the "More" drop-down and select "Create ThingTemplate" Give the template a name (ensure it matches what is defined in the simulator.json  file) and save it Go back and delete the dummy thing created in Step 4, as now we no longer need it Clean up the simulator Use the following command: python ./simulator.json -k​ Output will look like this: Create 200 things in ThingWorx for the stress test Verify the information in the simulator.json  file (especially the start and end numbers) is correct Use the following command to create all things: python ./simulator.json -c​ The output will look like this: Verify the things have also been created in ThingWorx: Now you are ready for the stress test   Run Stress Test: Use the following command to start your test: python ./simulator.json -l​ or python ./simulator.json --launch The output in the simulator will look like this: Monitor the Value Stream writing status in the Monitoring section of ThingWorx:   Stop and Clean Up: Use the following command to stop running all instances: python ./simulator -k​ If you want to clean up all created dummy things, then use this command: python ./simulator -d​ To re-initiate the test at a later date, just repeat the steps in the "Run Stress Test" section above, or re-configure the test by reviewing the steps in the "Set-Up Test Scenario" section   That concludes the tutorial on how to use the C SDK in a stress or load test of a ThingWorx application. Be sure to modify the created Thing Template (created in step 6 of the "Set-Up Test Scenario" section) with any business logic required, for instance events and alerts, to ensure a proper test of the application. 
View full tip
Setting Up the Azure Load Balancer with a ThingWorx High Availability Deployment Purpose In this post, one of PTC’s most experienced ThingWorx deployment architects, Desheng Xu, explains the steps to configure Azure Load Balancer with ThingWorx when deployed in a High Availability architectural model.   This approach has been used successfully on customer implementations for several ThingWorx 7.x and 8.x versions. However, with some of the improvements planned for ThingWorx High Availability architecture in the next major release, this best practice will likely change (so keep an eye out for updates to come).   Azure Load Balancer The overview article What is Azure Load Balancer? from Microsoft will give you a high-level understanding of load balancers in general, as well as the capabilities and limitations of Azure Load Balancer itself. For our purposes, we will leverage Azure Load Balancer's capability to manage incoming internet traffic to ThingWorx Platform virtual machine (VM) instances. This configuration is known as a Public Load Balancer.   Important Note: Different load balancers operate at different “layers” of the OSI Model. Azure Load Balancer operates at Layer 4 (Transport Layer) – it is indifferent to the specific TCP Payload. As a result, you must either configure both the front-end and back-end to work on SSL, or configure both of them to work on non-SSL communications. “SSL Termination” or “TLS Offload” is not supported by Azure Load Balancer.   Azure offers multiple different load balancing solutions. If you need some guidance on choosing the right one for you, I highly recommending reviewing the Microsoft DevBlog post Azure Load Balancing Solutions: A guide to help you choose the correct option.   High-Level Diagram: ThingWorx High Availability with Azure Load Balancer To keep this article focused, we will not go into the setup of ThingWorx in a High Availability architecture. It will be assumed that ThingWorx is working correctly and the ZooKeeper cluster is managing failover for the Platform instances as expected. For more details on setting up this configuration, the best place to start would be the High Availability Administrator’s Guide.   Planning In this installation, let's assume we have following plan (you will likely need to change these values for your own implementation): Azure Load Balancer will have a public facing domain name: Azure Load Balancer will have a public IP: ThingWorx Platform VM instance 1 has a local computer name, like: vm1 ThingWorx Platform VM instance 2 has a local computer name, like: vm2   ThingWorx Preparation By default, the ThingWorx Platform provides a healthcheck end point at /Thingworx/Admin/HA/LeaderCheck , which can only be accessed with a credential configured in platform-settings.json : "HASettings": { "LoadBalancerBase64EncodedCredentials":"QWRtaW5pc3RyYXRvcjphZG1pbg==" } However, Azure Load Balancer does not permit this Health Check with a credential with current versions of ThingWorx. As a workaround, you can create a pings.jsp (using the attached JSP example code) in the Tomcat folder $CATALINA_HOME/webapps/docs . This workaround will no longer be needed in ThingWorx 8.5 and newer releases.   There are two lines that likely need to be modified to meet your situation: The hostname in final String probeURL (line 10) must match your end point domain name. It's in our example, don’t forget to replace this with your real hostname! You also need to add a line in your local hosts file and point this domain name to . For example: The credential in final String base64EncodedCredential (line 14) must match the credential configured in platform-settings.json. Additionally: Don't forget to make the JSP file accessible and executable by the user who starts Tomcat service for ThingWorx. These changes must be applied to both ThingWorx Platform VM instances. Tomcat needs to be configured to support SSL on a specific port. In this example, SSL will be enabled on port 8443. Please make sure similar configuration is included in $CATALINA_HOME/conf/server.xml <Connector protocol="org.apache.coyote.http11.Http11NioProtocol" port="8443" maxThreads="200" scheme="https" secure="true" SSLEnabled="true" keystoreFile="/opt/yourcertificate.pfx" keystorePass="dontguess" clientAuth="false" sslProtocol="TLS" keystoreType="PKCS12"/> The values in keystoreFile and keyStorePass will need to be changed for your implementation. While pkcs12 format is used in above example, you can use a different certificate formats, as long as it is supported by Tomcat (example: jks format). All other parameters, like maxThreads , are just examples - you should adjust them to meet your requirements.   How to Verify Before configuring the load balancer, verify that health check workaround is working as expected on both ThingWorx Platform instances. You can use following command to verify: curl -I The expected result from active node should look like: HTTP/1.1 200 There will be three or more lines in output, depending on your instance configuration but you should be able to see the keyword: HTTP/1.1 200.   Expected result from passive node should look like: HTTP/1.1 503   Load Balancer Configuration Step 1: Select SKU Search for “load balancer” in the Azure market and select Load Balancer from Microsoft Verify the correct vendor before you create a Load Balancer. Step 2: Create load balancer To create a proper load balancer, make sure to read Microsoft’s What is Azure Load Balancer? overview to understand the differences between “basic” and “standard” SKU offerings. If your IT policy only requires SSL communication to the outside but doesn't require a SSL communication in a health probe, then the “basic” SKU should be adequate (not considering zone redundancy). You have to decide following parameters: Region Type (public or Internal) SKU (basic or standard) IP address Public IP address name Availability zone PTC cannot provide specific recommendations for these parameters – you will need to choose them based on your specific business needs, or consult Microsoft for available offerings in your region.   Step 3: Start to configure Once a load balancer is successfully created by Azure, You should be able to see:   Step 4: Confirm frontend IP Click frontend IP configuration at left side and you should be able to see public IP address configuration. Please make sure to register this IP with your domain name ( in our example) in your Domain Name Server (DNS). If you unfamiliar with DNS configuration, you should consult with the administrator of your DNS server. If you are using Azure DNS, this Quickstart article on creating Azure DNS Zones and records may help.   Step 5: Configure Backend pools Click Backend pools and click “Add” to add a backend pool definition. Select a name for your Backend pool (using ThingworxBackend in our example). Next step is to choose Virtual network.   Once you select Virtual network, then you can choose which VM (or VMs) you want to put behind this load balancer. The VM should be the ThingWorx VM instance.   In a high availability architecture, you will typically need to choose two instances to put behind this load balancer.   Please Note: The “Virtual machine status” column in this table only shows VM status, but not ThingWorx status. ThingWorx running status will be determined by the health probe configured in the next step.     Step 6: Configure Health Probe Health Probe will be used to determine the ThingWorx Platform’s running status. When a ThingWorx Platform instance is running as the leader, then it will give HTTP status code 200 during a health probe. The Azure Load Balancer will rely on this status code to determine if the platform is running properly.   When a ThingWorx platform VM is not responding, offline, or not the leader in a High Availability setup, then this health probe will provide response with a different HTTP status code other than 200.   For the health probe, select HTTPS for the protocol. In our example port 8443 is used, though another port can be selected if necessary. Then, provide the “/docs/pings.jsp” we created earlier as the probe’s path. You may need to change this path value if you put this file in a different location.   Step 7: Configure Load balancing rules. Select “Load balancing rules” from left side and click “Add”   Select TCP as protocol, in our example we are using 443 as front-end port and 8443 as back-end port. You can choose other port numbers if necessary.   Reminder: Azure Load Balancer is a layer 4 (Transport Layer) router – it cannot differentiate between HTTP or HTTPS requests. It will simply forward requests from front-end to back-end, based on port-forwarding rules defined.   Session persistence is not critical for current versions of ThingWorx as only one active node is currently permitted in a High Availability architecture. In the future, selecting Client IP may be required to support active-active architectures.   Step 8: Verify health probe Once you complete this configuration, you can go to the $CATALINA_HOME/logs folder and monitor latest local_access log. You should see similar entries as pictured below - HTTP 200 responses should be observed from the ThingWorx leader node, and HTTP 503 responses should be observed from the ThingWorx passive node. In the example below, is the internal IP Address of the load balancer in the current region.   Step 9: Network Security Group rules to access Azure Load Balancer On its own, Azure Load Balancer does not have a network access policy – it simply forwards all requests to the back-end pool. Therefore, the appropriate Network Security Group associated with the backend resources within the resource group should have a policy to allow access to the destination port of the backend ThingWorx Foundation server (shown as 8443 here, for example). The following image displays an inbound security rule that will accept traffic from any source, and direct it to port 443 of the IP Address for the Azure Load Balancer.   Enjoy!! With the above settings, you should be able to access ThingWorx via: (replacing with the hostname you have selected).   Q&A Can I configure the health probe running on a port other than the traffic port (8443) in this case? Yes – if desired you can use a different port for the health probe configuration.   Can I use different protocol other than HTTPS for health probe? Yes – you can use different protocol in the health probe configuration, but you will need to develop your own functional equivalent to the pings.jsp example in this article for the protocol you choose.   Can I configure ZooKeeper to support the health probe? No – the purpose of the health probe is to inform the Load Balancer which node is providing service (the leader), not to select a leader. In a High Availability architecture, ZooKeeper determines which VM is the leader and talking with the database. This approach will change in future releases where multiple ThingWorx instances are actively processing requests.   How well does Azure Load Balancer scale? This question is best answered by Microsoft – as a starting point, we recommend reading the DevBlog post: Azure Load Balancing Solutions: A guide to help you choose the correct option.   How do I access logs for Azure Load Balancer? This question is best answered by Microsoft – as a starting point, we recommend reviewing the Microsoft article Azure Monitor logs for public Basic Load Balancer.   Do I need to configure specifically for Websocket and/or AlwaysOn communication? No – Azure Load Balancer is a Layer 4 (Transport Protocol) router - it only handles TCP traffic forwarding.   Can I leverage this load balancer to access all VMs behind it via ssh? Yes – you could configure Inbound NAT rules for this. If you require specific help in configuring this, the question is best answered by Microsoft. As a starting point, we recommend reviewing the Microsoft tutorial Configure port forwarding in Azure Load Balancer using the portal.   Can I view current health probe status on a portal? No – Unfortunately there is no current approach to do this with Azure Load Balancer.
View full tip
The natively exposed ThingWorx Platform performance metrics can be extremely valuable to understanding overall platform performance and certain of the core subsystem operations, however as a development platform this doesn't give any visibility into what your built solution is or is not doing.   Here is an amazing little trick that you can use to embed custom performance metrics into your application so that they show up automatically in your Prometheus monitoring system. What you do with these metrics is up to your creativity (with some constraints of course). Imaging a request counter for specific services which may be incredibly important or costly to run, or an exception metric that is incremented each time you catch an exception, or a query result size metric that informs you of how much data is being queried from the database.   Refer to Resources > MetricsServices: GetCounterMetric GetGaugeMetric IncrementCounterMetric DecrementCounterMetric SetGaugeMetric You'll need to give your metric a name - identified by key - and this is meant to be dotted notation* which will then be converted to underscores when the metric is exposed on the OpenMetrics endpoint.  Use sections/domains in the dotted notation to structure your metrics in-line with your application design.   COUNTER type metrics are the most commonly used and relate to things happening through time.  They are an index which will get timestamped as they're collected by Prometheus so that you will be able to look back in time and analyse and investigate what happened when and what the scale or impact was.  After the fact functions and queries will need to be applied to make these metrics most useful (delta over time, increase, rate per second).   Common examples of counter type metrics are: requests, executions, bytes transferred, rows queried, seconds elapsed, execution time.     Resources["MetricServices"].IncrementCounterMetric({ basetype: "LONG", value: 1, key: "__PTC_Reported.integration.mes.requests", aggregate: false });     GAUGE type metrics are point-in-time status of some thing being measured.   Common gauge type metrics are: CPU load/utilization, memory utilization, free disk space, used disk space, busy/active threads.     Resources["MetricServices"].SetGaugeMetric({ basetype: "NUMBER", value: 12, key: "__PTC_Reported.Users.ConnectedOperatorCount", aggregate: true });     Be aware of the aggregate flag, as it will make this custom metric cluster level which can have some unintended consequences.  Normally you always want performance metrics for the specific node as you then see what work is happening where and can confirm that it is being properly distributed within the cluster.  There are some situations however where you might want the cluster aggregation however, like with this concurrently connected operators.   Happy Monitoring!  
View full tip
  Dev Ops is a crucial process that exists in any software setting, whether you plan on it or not. Chaos in the dev ops process, say because less time is spent here than on the shiny new features that are easy to sell, results in bottlenecks in the dev ops process. Bottlenecks reduce efficiency, and leave you open to vulnerabilities as well. The faster you can get a change properly tested and safely into production, the safer and more stable the system is all along.   Issues will arise, they always arise. Are you ready for them? Watch this video, see some of these additional links, and think about your dev ops process now, before the fires start!   Useful Links:  ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 1 ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 2 Overview of Monitoring Tools and Diagnostics The System Health Timer
View full tip
Architecting Reason Code Trees in DPM Tori Firewind, IoT EDC   What are Machine Codes? Factory hardware devices communicate status changes to their human operators and other machines (IoT) via machine codes. The manufacturers often determine the machine codes for different types of factory hardware, so those are often pre-determined. However, how the reason trees map these machine codes to corresponding business logic in ThingWorx is entirely customizable. Knowing the best way to design your reason trees for this purpose can be challenging, so this guide is here to help with your conceptual knowledge. Using the UI to create, edit, and configure reason codes in technical detail can be found in the Help Center.    The Tree Trunk At the highest level of the reason tree, the trunk, there are really 3 categories: Availability (A), Performance or Productivity (P), and Quality (Q). These should look familiar; they are the three dimensions of OEE (Overall Equipment Effectiveness). Fg 1. Calculation of OEE Availability refers to long stops, events that stop planned production long enough that it makes sense to track a reason for being down (typically several minutes, but the threshold between a long stop and a short stop can vary depending on the ideal rate of production of materials).                  Availability = Run Time / Planned Production Time   Productivity/Performance really refers to short stops, things that cause the machine to run at less-than- optimal speeds. This can include stops caused by running out of materials for production, doing minor maintenance like switching out a single, easily-changed part, or even frequent breaks due to ill health of an operator. User error can be a cause as well, say if the machine needs a certain heat to produce parts, and the heat keeps fluctuating (requiring the machine to take the time to calibrate for this before starting on production) because operators are smoking out a back door or adjusting thermostat temperatures. Fg 2. Levels of Runnable Time   Operator influence often is a factor when it comes to the conditions that permit optimal performance from machinery, and every factory may face different challenges. Stops like these are not really outages; the amount of downtime isn’t enough to consider the production block entirely unproductive. Production was continuing and ongoing throughout most of the block despite the issues; the rate was just slower than ideal.                  Performance/Productivity = (Total Count / Run Time) / Ideal Run Rate   Quality refers mostly to the number of items that are considered scrap or rework, and it can be split into two categories: start up scrap (that which is expected because the machine is in the process of warming up or being fine-tuned by the operator) and production scrap (things which come out wrong and must be tossed or reworked because the conditions under which they were produced weren’t ideal; this is called first-pass yield only, meaning it's only a "good" product if it passes inspection the first time).                  Quality = Good Count / Total Count​   The Branches and the Leaves of the Tree The “leaves” are the reason codes which directly map to machine codes , and the “branches” are the method of categorization that connects them to the trunk. Both the leaves and the trees, the children and the parent nodes of the tree, are split into two states: planned versus unplanned downtime. Changeovers, maintenance, and even scrap, can be broken down into this dichotomy.   For scrap, there are startup rejects (planned, because the machines have ramp up periods) and production rejects (unplanned, because the conditions weren’t ideal). For maintenance there is planned and unplanned, small changes that occur on the fly that result in productivity loss, and maybe also reduce availability in the long run. Small, unplanned changes can occasionally shift into the availability loss category if a simple, quick repair winds up being complex and time-consuming. A good reason tree can differentiate easily between short and longer stops in order to respond to each in a deliberate way.   To start off in the process of architecting your reason tree, try writing the three categories on a board in a common room in an average factory (or several as a survey). Ask operators to stop in over the course of a few days and write various machine codes that they see often and find useful under one of these categories, or more than one if the machine code pops up under different circumstances and can mean different things. Have them write a 10 word justification, if the association isn’t obvious. Gather all of the “leaves” in this way, and then begin to associate them with the “trunk”, forming the “branches”.   An example tree can be seen in figure 3 here, with leaves like “Changeover” and “Maintenance” being semi-ambiguous; they could just as easily be seen as unplanned stops. Therefore, there may be multiple reason codes mapping up to the top of the tree in more than one branch, and these can have different categories, which controls how the business logic responds to the different codes. The Help Center has more details about how the events are mapped to types, and each type contains multiple categories, as configured by you when you set-up the DPM model. Fg 3: Different types of changeovers may have different codes, and can map up as either planned or unplanned, but all planned and unplanned stops (long stops) are under the Availability category of the trunk. Similarly, small stops can involve idling, like if there are not enough materials, reduced speed if the conditions are not ideal, or other small stops, usually caused by human error or unforeseeable circumstances. Quality loss then refers to the products which fail quality checks, either because the machine still has the wrong paint in the applicator and needs a few rounds to be ready for the next production item, or because the conditions are again, not ideal, and items wind up scrapped.   Example Reason Tree Fg 5 example tree with more specific tags (there may be dozens or hundreds in a full reason tree, though the fewer are needed to capture the events we care about, the better).   Theory of Constraints Fg 6 theory of constraint wheel: an industry process for gradual OEE improvement in factories that has been adapted into the PTC methodology as well. While architecting your reason tree, always remember the key purpose: gathering only as much data as necessary to analyze the efficiency of a factory and to identify the bottleneck, or the most limiting factor. The important point is to identify not just the bottleneck that seems the most troublesome, but the one that actually results in the greatest impact to OEE across the entire factory.   Without software like DPM, and a properly designed reason code tree, the process of improving a factory can be very challenging, involving a lot of guesswork, and sometimes solving one problem at the cost of another. The issue is that these machines produce a LOT of raw data, and humans are not the best tool available to gather and aggregate this data in a consumable way. A good reason tree ensures a smart application that can quickly prioritize the machine (bottleneck) that most impacts production, and not just the machine that functions in the least optimal way.   So, the theory of constraints is really a process for identifying small, incremental changes, which together can make a big difference, and fast, in factory OEE. The rate at which this cycle can be completed varies, however. The slower the process of identifying constraints and the less information that is gathered, the slower and less precise the first two steps of this process. Alternatively, in a traditional constraint identification process, too much information can be a problem as well, due to human limitation, as discussed above. So, DPM is a great benefit in this regard, because it aggregates the data into a consumable, comparable way every 5 minutes, freeing up your human analytics for problem solving and prioritization, and not data gathering and sorting.   Other Key Tips Also remember that a good tree treats the trunk like a whole unit, with each category occupying a percentage of the overall OEE. Afterall, look back up at the 3 dimensions of OEE in the equation above. For example, the more you see issues with availability, the less you will see issues with scrap, for the machine simply doesn’t have as much time to produce scrap if it is constantly down. The more you see issues with quality loss, the less you should see of productivity loss, because these are simply inversely proportional modes; to say it differently, if a machine is running quickly and seeing few minor maintenance stops, then it is likely to produce more scrap (as well as more good product as well).   Another thing to remember is that even DPM is limited in its capacity to interpret raw data. Even while many magnitudes more efficient than any human gathering and analysis could ever be, there is an upper limit to how much raw data DPM can ingest and analyze before the system gets very expensive. For this reason, you want to ensure your reason trees use only as many reason codes as are required to capture the OEE of a factory site. This will mean using different codes for different types of things, most likely, which is easy to do maintainably across many sites using thing shapes. Keeping things tightly defined and organized is the easiest way to ensure a clean, efficient system for gathering and storing data.   Also remember that data will not need to persist very long once DPM is fully operational and adopted by your factories. DPM ensures that the changes made to the production line to improve efficiency are the highest impact, and the least difficult to implement, meaning that there will be a very rapid return on investment, and a process to ensure future issues are identified and resolved quickly. Data from past issues in the factory won’t be as relevant, and historical data stores can be kept smaller than one might think. It is the power of ingesting data directly into the processing and aggregation process, the automatic reduction of data down into presentable, consumable webpages, that makes DPM and ThingWorx such a great factory solution for optimizing OEE.
View full tip
ThingWorx Monitoring and Alerting, Part 2 Using Prometheus and Grafana By Tori Firewind, IoT EDC Building Dashboards     To add a panel which monitors some component of the ThingWorx application to a dashboard in Grafana, click to add a new panel. Under “Metrics” in the box at the bottom of the screen, select what ThingWorx metrics you wish to monitor (type “thingworx” in the search box to see them all). For example, select the Platform Subsystem memory in use:     Label filters aren’t necessary, though you may want to sort by instance if you are monitoring multiple ones with the same dashboard. You may also want to take some time to format the Y axis, which by default will show in bytes. Go to the formatting panel on the right side and scroll down to the section called “Standard options”. For the Unit dropdown, start typing “data” and then select “bytes (SI)”. This will automatically determine if the bytes you’ve provided should really print as MB or GB based on how large the numbers are.     Rename the panel, modify it in any other way desired, and then click Apply (last 5 minutes):     Once you add the panel, you can watch the memory usage as it is scraped by selecting the refresh option (10s or 30s, whatever makes sense based on your scrape interval).     The viewing window is stored in the URL, so that you can generate a report for a specific interval (like when a test was occurring), and then store that result or share it in a more compact way: http://localhost:3000/d/nleucPv4k/thingworx-monitoring?orgId=1&from=1668528038732&to=1668536503953  (absolute timestamps):     Dashboards are just collections of panels which report on all of the various metrics of performance and stability that exist for single components of a system. This is because there can be quite a few metrics worth watching for each individual component. Most of the third-party tools come with their own dashboards, but the ThingWorx component is one which for now, requires some thought and creativity.     Consider your use case carefully and look over the various subsystems contained within ThingWorx. Each part of an application is localized to specific subsystems, and some are more business critical than others. What will go at the top of your dashboard? Add rows, add panels per row, and see what the many choices are for watching your system.     Don’t forget that with Telegraf running, VM or machine usage metrics are also available for display on a dashboard. Things like overall CPU and Memory usage are critical to determining the health of a system, as we have demonstrated in our own reasoning in past benchmarks and scale tests. You can create a panel to monitor the mem_used versus mem_total, like so:     Another metric from Telegraf worth adding is the CPU usage, which should be given “percentage” for the units and which needs a label filter of cpu = cpu-total. If we do some resizing and drag-and-dropping, then we now have the first row of a dashboard:     See how the Platform usage climbs steadily and is purged in a cycle? That is the Java Garbage Collection mechanism, and it’s important to remember to leave room for spikes on top of those peaks. Data can also be calculated or processed in some way to make it more useful for determining system health and stability.     The data in the picture below uses the formula submitted = completed + number queued + number failed. It shows the current queue on the left Y-axis and the max queue on the right (since the two numbers usually are drastically different). It looks pretty, but it doesn’t really tell us much about the system in this format, so let’s do some math and find a representation that is a bit more helpful.     Performing a “non-negative derivative” calculation over the submitted and the completed queue counts over time allows for us to look at the status of the queue as a velocity. When the “complete” speed appears behind the “submitted” speed for too long at a time, then that means the queue is filling up and will eventually result in data loss.     If we take this one step further and calculate the average of the submitted minus the completed over time, then we can actually predict approximately when the queue will fill up. This can then be displayed on a dashboard in Grafana, or used as the basis for an alert.   What to Monitor     In addition to monitoring the system which ThingWorx runs upon, ThingWorx itself can easily be monitored down to the subsystems level by Prometheus due to the Metrics endpoint. Many applications have support built into the way they format the data for scraping, including the JVM (which exposes Prometheus-formatted metrics with the JMX Exporter) and the OS (which can use the Node Exporter or Telegraf for the same purpose). For these more generic components, there are popular community dashboards which can be downloaded and used in Grafana for data analysis and review.     For ThingWorx, there’s different kinds of data to track: subsystem data (see the list on the right) and non-subsystem data. There’s queue based versus non-queue data. These different metrics can collectively characterize the overall health of the application, depending on the use case.     For instance, if this is a system with very many connected devices, one metric which may be important to track is the number of total devices defined on the Foundation server vs. the number of devices which are currently connected. If there are relays involved, then many devices suddenly going offline can mean a relay has failed. Another example is if the system sizing depends on an assumption that there will only ever be a fraction of the total number of devices connected at a time. Use cases like these could be monitored easily by keeping track of the total vs. the number of connected devices.     Other common indicators of a healthy ThingWorx application might include the value stream and stream queues. These queues should fluctuate over time as the data is ingested and processed, but they should never be growing in size. If the stream queues are growing, then that means the data is writing to the queues faster than the queues can write to the database. Eventually, when the system runs out of resources to keep track of the queues, data will be lost. Having the stream information displayed in a chart can make it very easy to spot an upward trend in resource usage early on, which can catch a blockage or bottleneck that needs attention before it starts to affect the larger system in catastrophic ways later.     Memory usage information from the various subsystems might be something worth tracking, as well as the event queue. These can indicate that the business logic is functioning with room to handle spikes, and that the server has enough memory to service all three dimensions of an IoT application: the ingestion, the business logic and thing-based alerting, and the user experience and UI. If file transfers are a key part of the use case, then the number of concurrent transfers, the average speed of them, the size of the files, all of this kind of stuff can be tracked and charted in Grafana by making use of the ThingWorx metrics which automatically show up there once you import the Prometheus data source.     A mature dashboard used for a production environment might look a little like this: For further reading about subsystem monitoring, check out the Help Center.   How to Alert     The alerting mechanism built-in to Prometheus is incredibly easy to configure, so it might be tempting to generate tons of alert rules. However, remember that the more noise a system makes, the harder it is for those monitoring that system to know when action is really required. Playbooks which document how to respond to alerts, who to contact, how to act, and all the information necessary to handle an alert, should be created as an ongoing part of the DevOps process.     Alerts should fire with the right severity in the subject line, as well as all of the information about the issue that is currently known, presented in a concise way, so that whoever receives the alert starts thinking about the root cause sooner and recovers the system faster. Those who receive the alert should have the ability to facilitate its resolution, and know who is expected to react to any alerts which come in.     In the ThingWorx monitoring stack, Prometheus handles the alert rules and the generation of alerts, but alert filtering and delivery is managed in an external alerts manager.     Generally, you want your alerts to follow a curve. If the current queue size exceeds 50% of the maximum, perhaps that isn’t a huge deal, if the application catches up quickly. How long are spikes in queue processing expected to last? Perhaps if the queue size is over half-full for 10 seconds, 30 seconds, then that means the queue is falling behind and not catching up. Ok, so this might be a warning level alert. When does this become an error? Well, let’s say the queue exceeds 90% of the max queue size. This might want to alert the moment it hits the mark. Now, farther along the curve, it may not take as long before data gets lost.     As the severity of the situation increases, the threshold for alerting should increase as well. That way when errors do alert, it is a sure thing that they require a response immediately. The alerts are then pushed into the “Alerts Manager” for delivery based on your management rules. The Alerts Manager may decide to withhold warnings altogether, or send them to a much smaller mailing list, whatever filtering helps to ensure the right people receive the right alerts, right when they need them.   In Conclusion, A Healthy Application...     Has stable memory usage that fluctuates predictably and doesn’t grow over time. In a system experiencing mild issues, the memory starts to trend upward:     If left unattended, systems like this may eventually experience outages. Finding the issue this early means there is even time to do some digging, debugging, taking of stack traces, and other such troubleshooting steps before the system must be restarted or recovered. That can really make the difference in identifying and resolving before there are real problems.     One metric which makes for good alerting is the total number of failed stream entries, which can indicate there’s an issue writing to the database even before the queue has started to fill up. Other alerts may include warnings and errors based around percentages of memory used or queues filled, which depend on how long the queues take to fill up and how long the state has been at its increased usage.     Prometheus has all of the tools necessary to make this possible across a variety of infrastructures and use cases. Set it up on a local machine and poke around at what ThingWorx metrics are available to meet your monitoring needs.
View full tip
ThingWorx Monitoring and Alerting, Part 1 Using Prometheus and Grafana By Tori Firewind, IoT EDC Introduction and Getting Started     As ThingWorx has become a more mature product during the lifetime of the IoT EDC, so too have our dev ops recommendations. As we’ve stated throughout many posts now, testing is a key part of ensuring enterprise readiness, and it occurs at every stage of the process: from unit testing to preserve individual service logic, to integration tests which preserve the functionality of the application as a whole, to user and edge load testing and user experience testing, which ensure enterprise readiness. So testing is a critical component, but the process of dev ops never stops. In order to effectively test the system, a comprehensive monitoring solution is also required.     Once the application is tested and the changes pushed into production, there is no knowing with certainty that everything will run smoothly indefinitely. Random spikes in usage, server bandwidth or availability, any unforeseeable factors like these can come along and cause issues for a system. If these issues aren’t detected and addressed early, then they can very rapidly morph into much larger problems: outages, data loss, inflated data tables which are hard to revert due to their size. It is critical to detect performance issues on a system as early as possible, to have as much information as is necessary to figure out where the problem is heading, and what may have started it. Monitoring is key to a healthy system. CI/CD stands for “Continuous Integration/Continuous Deployment”, a never-ending cycle of improvement. Testing just once before the initial go live isn’t enough. Each system should have automated tests that run continuously, as well as monitors and alerts which reveal problems sooner. Diagnostic tools play a role as well, being the bridge from the end of the dev ops process cycle back to the beginning (monitoring into planning). A good CI/CD dev ops process will ensure that problems are found earlier, fixed more rapidly, and fixed for everyone using the system.       In a fully mature dev ops pipeline, issues are anticipated, discovered and researched before they become production outages or critical issues. These investigations or testing follow-ups produce development tasks (usually bugs, but also features at times) which then start the dev ops cycle all over again. This is why a good, efficient dev ops pipeline is needed, one which allows changes to quickly and safely go from development to production.     This is also why diagnostic tools play a role in the monitoring piece of the dev ops process. They are the bridge between monitoring and planning. Tools like Dynatrace can be configured to provide call stacks and take thread dumps when issues start to occur, before the system is performing so poorly it needs a restart, which happens automatically in a cluster and can clear out any trace of the issue.     Thread dumps are often necessary to diagnosing the root cause of the issue (to permanently fix it), and doing so quickly ensures application stability and availability. That is, after all, the purpose of the dev ops process. Diagnostics is therefore an equally important piece of the dev ops Figure-8-shaped pie, and one which deserves its own spotlight in an article to come.     Every piece of the dev ops process must be viewed as equally important in its own way, lest the dev ops cycle get hung up on bottlenecks of its own. A safe and stable system is not one which never experiences issues, it is one which has a good, efficient plan in place to handle recovery and prevention of repetition. A wholesome dev ops process is a happy dev ops process.   The Monitoring Stack     There are many monitoring options available, but in our experience one of the easiest and most effective monitoring stacks to use with ThingWorx is Prometheus for metrics gathering with Grafana for metrics analysis and review. In a mature monitoring stack, Telegraf is also commonly installed on each VM/host to gather the system metrics (like CPU and Memory usage, things we’ve stated are good metrics of system performance and stability in past articles on scale and size testing) and output them in Prometheus format.     Prometheus is a highly scalable open-source monitoring framework that contains out of the box monitoring and alert capabilities for Kubernetes-based deployments (not covered in this article). Using Prometheus is very simple because the ThingWorx application exposes a metrics endpoint which is formatted directly for use by Prometheus. There is also built-in alerting in Prometheus, but not the ability to form dashboards for reviewing data or screenshotting it for documentation purposes. That’s where Grafana comes into play. Grafana has a preconfigured Prometheus-type data source and many preconfigured dashboard templates for various applications and services. Telegraf is also easily imported into Grafana, as is shown in the section below. The Prometheus targets in the larger diagram are expanded out on the left. For each target, some tool exports the data in a syntax which Prometheus can scrape. For VMs, this can be Telegraf, for Kubernetes, the Node Exporter. JVM has a JMX Exporter, and other tools like CX Server use Graphite. Many apps already have a Prometheus endpoint built-in, like ThingWorx and Zookeeper. Telegraf is not strictly necessary; the node exporter can also be used on VMs, but Telegraf is the more common choice since it is a more mature dev ops tool.     Once Prometheus is scraping the targets, alerting on them can be done with OOTB Prometheus functionality, and dashboards for monitoring can be made easily in Grafana (with built-in support as well). This stack does not include the diagnostics piece, something which triggers thread dumps or the like when issues do occur. There are too many ways to conduct a successful diagnostic piece to cover here.   How to Get Started     Getting started monitoring a ThingWorx application is incredibly easy in the latest versions. Simply open up a browser, and type in the ThingWorx URL, followed by “/Metrics”. At this endpoint, there is a specially formatted response that can automatically be read by the Prometheus monitoring software which contains subsystem and service data. In addition to the application metrics, Prometheus can be configured to collect metrics from a node exporter at the (virtualized) operating system or container (Kubernetes) level as well.     If you haven’t already, install Grafana, install Telegraf as a service, and install Docker Desktop. These are the tools required (in addition to ThingWorx of course) to set-up a simple sandbox system for familiarization with the monitoring stack recommended by PTC. The easiest way to try Prometheus on a local Windows instance is to use Docker. The command for that will be found below, but first open up Docker Desktop to set contextual parameters that the command line will need. Then, modify the configuration file for Telegraf or create one (called telegraf.conf in the same folder as the exe file), and put the following into the file (or uncomment it; the default config file has thousands of lines, so just search for “prometheus”):             Output plugin [[outputs.prometheus_client]] listen = ""             Alternatively, install the Prometheus Node Exporter tool, which will likely require some additions to the Prometheus config file (not covered here) which we are about to create.     Then, create a configuration file (called prom_config_localhost_scraper.yml in the command to come), add the following (assuming a standard localhost installation of ThingWorx):             # my global config global: scrape_interval: 45s evaluation_interval: 30s scrape_timeout: 30s # scrape_timeout is set to the global default (10s). rule_files: - prom_config_rules.yml scrape_configs: - job_name: thingworx static_configs: - targets: ['host.docker.internal:8080'] basic_auth: username: "Administrator" password: "admin!123456789" metrics_path: /Thingworx/Metrics scheme: http params: x-thingworx-session: - "false" - job_name: prometheus static_configs: - targets: ['localhost:9090'] - job_name: Telegraf # If telegraf is installed, grab stats about the local # machine by default. static_configs: - targets: ['host.docker.internal:9125']                 This example script file uses the host.docker.internal instead of localhost for the server target for ThingWorx because it is running outside of the Docker container which contains Prometheus. This yml file configures Prometheus to monitor both ThingWorx and itself, as well as the server metrics coming from Telegraf (as long as they are configured to push). It’s a sandbox-only configuration, really, as you wouldn’t want to use the Administrator user, or have the password printed in plain text in the config file in a real system. Also note the need for the x-thingworx-session parameter, as runaway sessions which spawn every 30s or so (whatever the scrape interval is) will result in memory issues over time (so we don’t want to use sessions here).     The rules file given here (prom_config_rules.yml) needs to be created separately. This is where all of the alert rules will be defined. This will determine if an alert state is happening, but without configuring the alert manager, there won’t be any notification. That isn’t covered here but is covered extensively in the Grafana docs. Here is an alert example:             groups: - name: alert.rules rules: # Alert for any instance that is unreachable for >5 minutes. - alert: HighMemory expr: mem_used > 14000000 for: 1s labels: severity: page annotations: summary: "High Memory" description: "Localhost Memory Usage is High"             Now, save these files and use Powershell to run the Docker container:             docker run -p 9090:9090 -v C:\<path_to_document>\prom_config_localhost_scraper.yml:/etc/prometheus/prometheus.yml prom/prometheus                 It should download Prometheus and install it in that container (if this is the first time), allowing you to very rapidly deploy it to an endpoint of localhost:9090 by default. If there is an error like the one shown below, this means that you forgot to start Docker Desktop (the application) before opening Powershell. Docker Desktop sets system parameters required for containers to run in a command line (in Linux, it should work if Docker is installed for use by the command line, simple as that).     The localhost endpoints are accessible in a browser. ThingWorx defaults to localhost:8080 endpoint. Prometheus defaults to localhost:9090. Telegraf is on port 9125. Open any of these in a browser tab to see the full monitoring stack. You can see easily if Prometheus is working by clicking “Status” > “Targets” at localhost:9090:     If all of the targets appear as blue and say “last scrape” and a time stamp, then they’re working as expected. If they don’t, ensure you have the right ports, that there aren’t any firewall issues (if things aren’t all on localhost), and that everything is running without errors.     The last step in the process here is to install a dashboard tool like Grafana. Once this is installed and running on localhost:3000 (by default), you can display the data from Prometheus with a few configuration steps the Grafana UI. Highlight over the settings icon in the bottom left of the screen, and then click on “Data sources”. Select the “Add data source” button, and then click on Prometheus. You have to type the URL again  (localhost:9090), but most of the defaults will be ok here, and all you have to do is click “Save and test”.     Now both targets should appear within Grafana, with their metrics showing up throughout the Grafana UI. This data source is what allows for the building of monitoring dashboards.    
View full tip