IoT Tips

Interested in learning how others using and/or hosting ThingWorx solutions can comply with various regulatory and compliance frameworks? Based on inquiries regarding the ability of customers to meet a wide range of obligations – ranging from SOC 2 to ISO 27001 to the Department of Defense’s Cybersecurity Maturity Model Certification (CMMC) – the PTC's IoT Product Management and EDC teams have collaborated on a set of detailed articles explaining how to do so. Please check out the ThingWorx Compliance Hub (support.ptc.com login required) for more information!

Apr 20, 2021

The New and Improved DGIS Guide to ThingWorx Development Written by Victoria Firewind of the IoT EDC The classic Developing Great IoT Solutions guide has been reskinned and revamped for newer versions of ThingWorx! The same information on how to build a quality IoT application is now available for versions of ThingWorx 9.1+, and now, a complete sample application is included to demonstrate these ideas. Find within the attached archive a PDF with high-level overview information on development and application design geared towards managers and business users, so that everyone can understand the necessary requirements, common terms, and key tips on how to ensure an application is scalable and maintainable right from the very start. Reduce your chances of running into issues between PoC and Go Live by reviewing this information today! Also find within this PDF a series of tutorials which teach not just how to use the ThingWorx software, but which also educate on how to make good application design choices. A basic rules engine for sending real-time notifications is included here, as well as a complete demo application which illustrates each concept in a real-world use case. This Coffee Machine Demo App relies upon the tutorial entities, which can also now be imported directly using the other XML files provided here. This ensures that anyone can review these concepts, regardless of how much time one can commit or how much knowledge one already has on the subject. This is a complex guide, and any issues, questions, or bugs found within can be reported right here on this thread. Happy developing from the IoT EDC!

Apr 13, 2021

By Tim Atwood and Dave Bernbeck, Edited by Tori Firewind Adapted from the March 2021 Expert Session Produced by the IoT Enterprise Deployment Center The primary purpose of monitoring is to determine when your application may be exhausting the available resources. Knowledge of the infrastructure limits help establish these monitoring boundaries, determining straightforward thresholds that indicate an app has gone too far. The four main areas to monitor in this way are CPU, Memory, Networking, and Disk. For the CPU, we want to know how many cores are available to the application and potentially what the temperature is for each or other indicators of overtaxation. For Memory, we want to know how much RAM is available for the application. For Networking, we want to know the network throughput, the available bandwidth, and how capable the network cards are in general. For Disk, we keep track of the read and write rates of the disks used by the application as well as how much space remains on those. There are several major infrastructure categories which reflect common modes of operation for ThingWorx applications. One is Bare Metal, which relies upon the traditional use of hardware to connect directly between operating system and hardware, with no intermediary. Limits of the hardware in this case can be found in manufacturing specifications, within the operating system settings, and listed somewhere within the IT department normally. The IT team is a great resource for obtaining these limits in general, also keeping track of such things in VMware and virtualized infrastructure models. VMware is an intermediary between the operating system and the hardware, and often its limits are determined based on the sizing of the application and set by the IT team when the infrastructure is established. These can often be resized as needed, and the IT team will be well aware of the limits here, often monitoring some of the performance themselves already. This is especially so if Cloud Providers are used, given that these are scaled up virtualizations which are configured in easy-to-use cloud portals. These two infrastructure models can also be resized as needed. Lastly Containers can be used to designate operating system resources as needed, in a much more specific way that better supports the sharing of resources across multiple systems. Here the limits are defined in configuration files or charts that define the container. The difficulties here center around learning what the limits are, especially in the case of network and disk usage. Network bandwidth can fluctuate, and increased latency and network congestion can occur at random times for seemingly no reason. Most monitoring scenarios can therefore make due with collecting network send and receive rates, as well as disk read and write rates, performed on the server. Cloud Providers like Azure provide VM and disk sizing options that allow you to select exactly what you need, but for network throughput or network IO, the choices are not as varied. Network IO tends to increase with the size of the VM, proportional to the number of CPU cores and the amount of Memory, so this may mean that a VM has to be oversized for the user load, for the bulk of the application, in order to accommodate a large or noisy edge fleet. The next few slides list the operating metrics and common thresholds used for each. We often use these thresholds in our own simulations here at PTC, but note that each use case is different, and each situation should be analyzed individually before determining set limits of performance. Generally, you will want to monitor: % utilization of all CPU cores, leaving plenty of room for spikes in activity; total and used memory, ensuring total memory remains constant throughout and used memory remains below a reasonable percentage of the total, which for smaller systems (16 GB and lower) means leaving around 20% Memory for the OS, and for larger systems, usually around 3-4 GB. For disks, the read and write rates to ensure there is ample free space for spikes and to avoid any situation that might result in system down time; and for networking, the send and receive rates which should be below 70% or so, again to leave room for spikes. In any monitoring situation, high consistent utilization should trigger concern and an investigation into what’s happening. Were new assets added? Has any recent change caused regression or other issues? Any resent changes should be inspected and the infrastructure sizing should be considered as well. For ThingWorx specific monitoring, we look at max queue sizes, entries performed, pool sizes, alerts, submitted task counts, and anything that might indicate some kind of data loss. We want the queues to be consistently cleared out to reduce the risk of losing data in the case of an interruption, and to ensure there is no reason for resource use to build up and cause issues over time. In order for a monitoring set-up to be truly helpful, it needs to make certain information easily accessible to administrative users of the application. Any metrics that are applicable to performance needs to be processed and recorded in a location that can be accessed quickly and easily from wherever the admins are. They should quickly and easily know the health of the application from a glance, without needing to drill down a lot to be made aware of issues. Likewise, the alerts that happen should be meaningful, with minimal false alarms, and it is best if this is configurable by the admins from within the application via some sort of rules engine (see the DGIS guide, soon to be released in version 9.1). The monitoring tool should also be able to save the system history and export it for further analysis, all in the name of reducing future downtime and creating a stable, enterprise system. This dashboard (above) is a good example of how to rollup a number of performance criteria into health indicators for various aspects of the application. Here there is a Green-Yellow-Red color-coding system for issues like web requests taking longer than 30s, 3 minutes, or more to respond. Grafana is another application used for monitoring internally by our team. The easy dashboard creation feature and built-in chart modes make this tool super easy to get started with, and certainly easy to refer to from a central location over time. Setting this up is helpful for load testing and making ready an application, but it is also beneficial for continued monitoring post-go-live, and hence why it is a worthy investment. Our team usually builds a link based on the start and end time of tests for each simulation performed, with all of the various servers being monitored by one Grafana server, one reference point. Consider using PTC Performance Advisor to help monitor these kinds of things more easily (also called DynaTrace). When most administrators think of monitoring, they think of reading and reacting to dashboards, alerts, and reports. Rarely does the idea of benchmarking come to mind as a monitoring activity, and yet, having successful benchmarks of system performance can be a crucial part of knowing if an application is functioning as expected before there are major issues. Benchmarks also look at the response time of the server and can better enable tracking of actual end user experience. The best option is to automate such tests using JMeter or other applications, producing a daily snapshot of user performance that can anticipate future issues and create a more reliable experience for end users over time. Another tool to make use of is JMeter, which has the option to build custom reports. JMeter is good for simulating the user load, which often makes up most of the server load of a ThingWorx application, especially considering that ingestion is typically optimized independently and given the most thought. The most unexpected issues tend to pop up within the application itself, after the project has gone live. Shown here (right) is an example benchmark from a Windchill application, one which is published by PTC to facilitate comparison between optimized test systems and real life performance. Likewise, DynaTrace is depicted here, showing an automated baseline (using Smart URL Detection) on Response Time (Median and 90th percentile) as well as Failure Rate. We can also look at Throughput and compare it with the expected value range based on historical throughput data. Monitoring typically increases system performance and availability, but its other advantage is to provide faster, more effective troubleshooting. Establish a systematic process or checklist to step through when problems occur, something that is organized to be done quickly, but still takes the time to find and fix the underlying problems. This will prevent issues from happening again and again and polish the system periodically as problems occur, so that the stability and integrity of the system only improves over time. Push for real solutions if possible, not band-aids, even if more downtime is needed up front; it is always better to have planned downtime up front than unplanned downtime down the line. Close any monitoring gaps when issues do occur, which is the valid RCA response if not enough information was captured to actually diagnose or resolve the issue. PTC Tech Support developed a diagnostic data gathering query for Oracle that customers can use, found in our knowledgebase. This is an example of RCA troubleshooting that looks at different database factors, reporting on which queries perform the worst based on inputted criteria. Another example of troubleshooting is for the Java JVM, where we look at all of the things listed here (below) in an automated, documented process that then generates a report for easy end user consumption. Don’t hesitate to reach out to PTC Technical Support in advance to go over your RCA processes, to review benchmark discrepancies between what PTC publishes and what your real-life systems show, and to ensure your monitoring is adequate to maintain system stability and availability at all times.

Mar 24, 2021

How to Scale Vertically and Horizontally, and When to Use Sharding Written by Mike Jasperson, VP of IOT EDC Deployment architecture describes the way in which an IOT application is deployed, or where each of the components are hosted on the network. There are deployment architecture considerations to make when scaling up an application. Each approach to deployment expansion can be described by the “eggs in a basket” analogy: vertical scale is like one person carrying a bigger basket, horizontal scale is like one person carrying more baskets, and sharding is like more than one person carrying the baskets (see below). All of these approaches result in the eggs getting from point A to point B (they all satisfy the use case), but the simplest (vertical scale) is not necessarily the best. Sure, it makes sense on paper for one person to carry everything in one big basket, but that doesn’t ensure that all of the eggs arrive intact. Selecting the right deployment architecture is a way to ensure the use cases are satisfied in the best and most efficient ways possible, with the least amount of application downtown or data loss. Vertical Scale - a.k.a. "one person carrying a bigger basket" The most common scalability approach is to simply size the IOT server larger, or scale up the server. This might mean the server is given additional CPU cores, faster CPU clock speeds, more memory, faster disks, additional network bandwidth or improved network cards, and so on. This is a very good idea when the application logic is increased in complexity, when more data is therefore needed in memory at a time, or when the processing of said data has to occur as quickly as possible. For example, adding additional devices to the fleet increases the size of the “Thing Model” in the process and will require additional heap memory be available to the Foundation server. However, there are limitations to this approach. Only so many concurrent operations and threads can be performed at once by a single server. Operations trying to read and write to the disks at once can introduce bottlenecks and reduce server performance. Likewise, “one person, one basket” introduces a single-point-of-failure operating risk. If for some reason the server’s performance does degrade or cease altogether, then all of the “eggs” go down with it. Therefore, this approach is important, but usually not sufficient on its own for empowering an enterprise level deployment. Horizontal Scale - a.k.a. "one person, with two or more baskets" As of ThingWorx version 9.0, Foundation servers can be deployed in clusters, meaning more baskets to carry the eggs. More baskets means that if even one of these servers is active, the application remains up in the event of an individual node failure or maintenance. So clustered deployments are those which facilitate High Availability. Clustered servers save on some resources, but not others. For instance, every server in a cluster will need to have the same amount of memory, enough to store the entire Thing Model. Each of the multiple baskets in our analogy has to have the same type of eggs. One basket can’t have quail eggs if the rest have chicken eggs. So, each server has to have an identical version of the application, and therefore enough memory to store the entire application. Also keep in mind that not all application business logic can scale horizontally. Event queues are local to each ThingWorx node, so the events generated within each node are processed locally by that particular node, and not the entire network (examples are timer and scheduler-based activities). Likewise, data ingestion done through an extension or other background process, like MQTT, emits events within a node that therefore must be processed by that particular node, since that's where the events are visible. On the other hand, load distribution that happens external to ThingWorx in either the Connection Server (for AlwaysOn based data coming from ThingWorx SDKs, EMS, eMessage agents, or Kepware) or REST API calls through a load balancer (i.e. user activity) will be distributed across the cluster, facilitating greater scaling potential in terms of userbase and mashup complexity. Also note that batched data will be processed by the node that received it, but different batches coming through a connection server or load balancer will still be distributed. Another consideration with clusters pertains to failure modes. While each node in the cluster shares a cache for many things, Stream and Value Stream queues are only stored locally. In the event of a node failure, other nodes will pick up subsequent requests, but any activity already queued on the failed node will be lost. For use cases where each and every data point is critical, it is important to size each node large enough (in other words, to vertically scale each node) such that queue sizes are constantly kept low and the data within them processed as quickly as possible. Ensuring sufficient network and database throughput to handle concurrent writes from the many clustered nodes is key as well. Once each node has enough resources to handle local queues, the system is highly available with low risk for outages or data loss. However, when multiple use cases become necessary on single deployments, horizontal scaling may no longer be enough to ensure things run smoothly. If one use case is logic-heavy, something non-time-critical which processes data for later consumption, it can use too many resources and interfere with other, lighter but more time-critical use cases. Clustering alone does not provide the flexibility to prioritize specific operations or use cases over others, but sharding does. Sharding - a.k.a. "more than one person carrying baskets" “Sharding” generally refers to breaking up a larger IOT enterprise implementation into smaller ones, each with its own configuration and resources. More server maintenance and administration may be required for each ThingWorx implementation, but the reduction in risk is worth it. If each of the use cases mentioned above has its own implementation, then any unexpected issues with the more complex, analytical logic will not affect the reaction time of operators to time-sensitive matters in the other use case. In other words, “don’t keep all your eggs in one basket”. The best places to break up an implementation lie along logical boundaries already accepted by the business. Breaking things down in other ways might look nice on paper, but encouraging widespread adoption in those cases tends to be an uphill battle. In connected products use cases, options for boundaries could be regional, tied more towards business vertical, or centered around different products or models. These options can be especially beneficial when data needs to stay in particular countries or regions due to regulatory requirements. In connected operations use cases, the most common logical boundaries would be site-based, with smaller IOT implementations serving just a smaller number of related factories in a particular area. Use-case or product-line boundaries can also make sense here, in-line with the above comments about keeping production-critical or time-sensitive use cases isolated from interference from business-support and analysis use cases. Ideally, a shard model will put the IOT implementation “closer” to both the devices communicating with it and the users that interact with the data. This minimizes the amount of data to be sent or received over long distances, reducing the impact from bandwidth and latency on performance. When determining which approach is best, consider that smaller, more focused implementations offer more flexibility, but are harder to manage. Having different versions of the same applications deployed in multiple places can easily become a maintainability nightmare. It’s therefore best not to combine a regional model with a use case model when it comes to determining sharding boundaries. Also consider using deployment automation tools like Solution Central. These enable tracking and managing version-controlled deployments to multiple IOT implementations, whether they are deployed on-premise, or in the cloud, giving one central location of all source code. Another benefit to sharding is the focused investment of server resources in a more targeted way. For instance, if one region is larger than another, it may need more CPU and memory. Or, perhaps only part of an application requires High Availability, the time-critical use cases which are best suited to small, clustered deployments. The larger, analysis-centered use cases can then remain non-clustered (but still vertically scaled of course). Sharding can also make access control simpler, as those who need access to only one region or use case can just be given a user account on that particular shard. However, certain use cases need data from more than one shard in order to operate, turning the data storage and access control benefits into challenges. Luckily, ThingWorx has an excellent toolkit for overcoming such issues. For one thing, REST API calls are readily available in ThingWorx, allowing each shard to exchange information with each other, as well as other enterprise data systems, like ERP or Service Ticketing. Sometimes, lower-level data replication strategies are the way to go, say if downsampling or data transfer from one store to another is necessary, and built-in database tools can more easily handle the workload. Most of the time, however, REST API calls are used to define the business logic within the application layer so that copying data actively between shards is unnecessary, using fewer resources to control what information is shared overall. There are several design approaches for REST API communication between shards, the two most common being the “peer” model and the “layered” model. In a peer model, one shard may call upon another shard using REST whenever it needs more information. In a layered model, there are “front-line” shards which handle most (if not all) of the device communication and time-critical use cases, things which require only the information in one shard to operate. Then there are also “back-line” shards that aggregate data from the many front-line shards, performing any operations that are less time-sensitive and more complex or analytical. For any of these approaches, it remains important to keep your data archival and purge strategy in mind. It is a best practice in ThingWorx to only retain as much data as is absolutely necessary, purging the rest periodically. If the front-line shards only ever need the last 7 days of raw data for 5 properties, plus the last 52 weeks of min/max/average data, then implementing an approach where each shard computes the min/max/average values and then archives the older data to a shared “data lake” before purging it would be ideal. This data lake then serves as the data store for all back-line shard operations. There is also the option to consider sharing some infrastructure between ThingWorx instances when using sharding in a deployment, which can create more flexible, scalable architectures, but can also introduce issues where more than one shard is affected when issues occur on only one shard. For instance, a common shared infrastructure piece is at the database level; each ThingWorx instance needs its own database, but a single database server instance (such as a PostgreSQL HA cluster) could serve separate database namespaces to multiple ThingWorx instances. This is an attractive option where an existing enterprise-scale database infrastructure with experienced DBAs is already in place. Similarly, load balancers can often be configured to manage load for multiple servers or URLs. If properly configured, an experienced load balancer could direct traffic for multiple applications, but it can also create a bottleneck for inbound connections if not properly sized. Load balancers designed for High Availability can also be considered. Apache Zookeeper is another tool often deployed once for an entire cluster to monitor the health and availability of individual components, or to vote them in or out of operations if problems are detected. With all of these options, remember that sharing infrastructure increases the chances of sharing issues from one ThingWorx system to the next, which can reduce the overall infrastructure complexity at the cost of increased administrative complexity. Bringing it All Together Vertical and Horizontal scale are both effective ways to add more capacity and availability to software implementations, but there are typically some diminishing returns in the investment of additional infrastructure. For example, consider two large, vertically-scaled implementations – one running on a VM with 64 vCPUs and 256 GiB RAM, and another running on a VM with 96 vCPUs and 384GiB of RAM. While the 96-core server has 1.5 times the compute capacity, in sizing tests with 1 million simulated assets, these two systems tend to fall behind on WebSocket execution at approximately the same point. In a horizontal scale example, with two nodes each sized the same (64 vCPU and 256GiB RAM), one would expect High Availability to occur, where one node picks up the other’s slack in a failover scenario. However, what if that singular node can’t handle the entire workload? Should both machines be sized vertically such that either can take on the full load, and if so, then what is the cost-benefit of that? It would be less expensive in this case to have a third server. Optimizing the deployment architecture for a ThingWorx application will therefore usually involve a blended approach. With more than two nodes, High Availability is more readily obtained, as two servers can almost certainly share the load of the third, failed node. Likewise, some workload aspects do not scale well until multiple additional nodes are added. For instance, spreading out the user load from mashup requests across multiple nodes to give the singleton more resources for the tasks it alone can perform doesn’t have much benefit if there are just two nodes. However, with horizontal scaling alone, the servers may still need to be vertically scaled larger than is ideal in terms of cost. Each one has to hold the entire Thing Model in memory, which means that sometimes, some of the nodes may be oversized for the tasks actually performed there. Sharding allows for each node to have a different Thing Model as necessary based around what boundaries are selected, which can mean saving on costs by sizing each server only as large as it really needs to be. So, a combination of approaches is typically the best when it comes to deployment architecture. The key is to break things up as much as possible, but in ways that make sense. Determine where the boundaries of the shards will be such that each machine can be as light and focused as possible, while still not introducing more work in terms of user effort (having to access two system to get the job done), application development (extra code used to maintain multiple systems or exchange information between them), and system administration (monitoring and maintaining multiple enterprise systems). Find the right balance for your systems, and you’ll maximize your cost-benefit ratio and get the most out of your ThingWorx application. Happy developing!

Jan 31, 2021

Thread Safe Coding, Part 2: The Database Locker Approach and Comparison Written by Desheng Xu and edited by @vtielebein Overview This is the second on this topic, describing an alternate approach to thread safe coding than one which requires the Java extension. The demo use case here is the same as in the previous post, and there is a section at the end comparing the two approaches. Database Locker for Thread Safe Coding The database locker is an advanced topic, so some experience with the database thing is assumed. The following steps demonstrate how to be thread safe with a database thing. Create New Database Instance, and New Table for counter It is strongly recommended that a new database instance be created outside of the ThingWorx database schema. This guide will NOT include instructions to create the new database instance. Use the following SQL commands to create a new table: DROP table IF EXISTS counters; CREATE TABLE counters ( name VARCHAR(100) unique , value integer NULL, PRIMARY KEY(name) ); INSERT INTO counters values('DemoCounter',0); This will create a new table called counters, initializing the first counter, called DemoCounter with the value 0. Create a Function to Increase and Return the New counter Value Use the following sample code to create a table lock function: CREATE OR REPLACE FUNCTION IncreaseCounter(coutner_name VARCHAR(100), OUT newvalue INTEGER) AS $$ BEGIN LOCK TABLE counters IN ACCESS EXCLUSIVE MODE; SELECT(SELECT value FROM counters WHERE name = $1) + 1 INTO newvalue; UPDATE counters SET value = newvalue WHERE name = $1; END; $$ language plpgsql; Or use the following SQL command to create a new row level locker function: CREATE OR REPLACE FUNCTION IncreaseCounter(counter_name VARCHAR(100), OUT newvalue INTEGER) AS $$ BEGIN SELECT value FROM counters WHERE name = $1 FOR UPDATE INTO newvalue; newvalue := newvalue + 1; UPDATE counters SET value = newvalue WHERE name = $1; END; $$ language plpgsql; Create a Database Thing Create a thing with the template "database" within ThingWorx, and use the PostgreSQL Driver to connect to the new database instance created above. Create New Services in the Database Thing The service IncreaseCounterDB would be a SQL Query service: SELECT * FROM public.IncreaseCounter([[counter_name]); counter_name would be the input parameter, a STRING which is marked as required. The service GetCounterDB would be another SQL Query service: SELECT value FROM public.counters WHERE name=[[counter_name]] LIMIT 1; counter_name would be another input parameter, a STRING which is also marked as required. The service ResetCounterDB would be a SQL Command service: UPDATE public.counters SET value = 0 WHERE name=[[counter_name]]; counter_name is yet another input parameter, also a STRING and also required. Wrap the Database Thing Service The above database thing service will return an InfoTable, but not an integer. If it's inconvenient to use an InfoTable, wrap the service up into a local Javascript service and return an integer value. The service IncreaseCounter is a wrap up of IncreaseCounterDB and returns an integer value: // result: INFOTABLE dataShape: "" var query_result = me.IncreaseCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = query_result.rows[0]["newvalue"]; Similarly wrap up GetCounter into GetCounterDB: // result: INFOTABLE dataShape: "SingleIntegerDatashape" var query_result = me.GetCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = query_result.rows[0]["value"]; And ResetCounter into ResetCounterDB: // result: NUMBER var query_result = me.ResetCounterDB({ counter_name: 'DemoCounter' /* STRING */ }); var result = 0; Run the Test Again If necessary, head back to the previous post to obtain the tool. Then just change the end point and run a new test: { "host":"twx85.desheng.io", "port":443, "protocol":"https", "endpoint":"/Thingworx/Things/DatabaseDemo/services/IncreaseCounter", "headers":{ "Content-Type":"application/json", "Accept": "application/json", "AppKey":"5cafe6eb-adba-41df-a7d6-4fc8088125c1" }, "payload":{}, "round_break":50000, "req_break":0, "round_size":50, "total_round":20 } Run: Validate the Result Execute the service GetCounter to validate the result: Overall Performance Comparison The Java Extension performance looks the best here, but the database row lock will perform better if there are multiple counters. InfoTable Type Property InfoTable properties have the same thread-safe challenges discussed previously, but they also have some additional challenges due to the way data change events are triggered. This is outside of the scope of this document, but it is worth a very brief mention here. In general, the data change event for an InfoTable fires when the reference to the table is updated, and not the contents of the table. If the values of an InfoTable are updated directly, say by adding or removing a row, then the data change event will not be triggered because the value has technically not changed. Instead, the InfoTable has to be cloned, then modified, and then assigned back to the Thing so that the reference changes as well. Such additional considerations must be made when using other property types than those shown here.

Dec 10, 2020

Hi All Our expert session: Thingworx Flow Overview is tomorrow!!! Click the link below to register and remember to talk about it to colleagues that might benefit from its content. Expert Session: Thingworx Flow Overview Date and Time: December 10th, 8h00 EST Duration: 1 hour Host: Antony Moffa; Vinay Vaidya - Thingworx IoT Platfom Senior Directors Registration Here: https://www.ptc.com/en/customer-success/expert-sessions-for-thingworx-foundation-webcasts See you there! Here are other upcoming sessions that might be of your interest: Upgrade to Thingworx 9 – How to Plan / Evaluate Impacts This session will highlight the key points you should evaluate to properly plan your upgrade to Thingworx 9 Register Here Active Active Clustering This session will cover the main aspects of the High Availability Clustering feature launched with the ThingWorx 9.0 release Register Here

Dec 9, 2020

Thread Safe Coding, Part 1: The Java Extension Approach Written by Desheng Xu and edited by @vtielebein Overview Time and again, customers report that one of their favorite ThingWorx features is using However, the Javascript language doesn't have a built-in semaphore locker mechanism, nothing to enable thread-safe concurrent processing, like you find in the Java language. This article demonstrates why thread safe coding is necessary and how to use the Java Extension for this purpose. Part 2 presents an alternative approach using database lockers. Demo Use Case Let's use a highly abstracted use case to demo thread-safe code practices: There are tens of machines in a factory, and PLC will emit a signal to indicate an issue happens during run-time. The customer expects to have a dashboard that shows today's total count of issues from all machines in real-time. The customer is also expecting that a timestamp of each issue can be logged (regardless of the machine). Similar use cases might be to: Show the total product counts from each sub-line in the current shift. Show the total rentals of bicycles from all remote sites. Show the total issues of distant cash machines across the country. Modeling Thing: DashboardCounter, which includes: 1 Property: name:counter, type:integer, logged:true, default value:0 3 services: IncreaseCounter(): increase counter value 1 GetCounter(): return current counter value ResetCounter(): set counter value to 0 1 Subscription: a subscription to the data change event of the property counter, which will print the new value and timestamp to the log. GetCounter var result = me.counter; IncreaseCounter me.counter = me.counter + 1; var result = me.counter; ResetCounter me.counter = 0; var result = 0; Subscription MonitorCounter Logger.info(eventData.newValue.value+":"+eventData.newValue.time.getTime()); ValueStream For simplicity, the value stream entity is not included in the attachment. Please go ahead and assign a value stream to this Thing to monitor the property values. Test Tool A small test tool mulreqs is attached here, along with some extensions and ThingWorx entities that are useful. The mulreqs tool uses a configuration file from the OS variable definition MULTI_REQUEST_CONFIG. In Linux/MacOS: export MULTI_REQUEST_CONFIG="./config.json" in config.json file, you can use the following configuration: { "host":"twx85.desheng.io", "port":443, "protocol":"https", "endpoint":"/Thingworx/Things/DashboardCounter/services/IncreaseCounter", "headers":{ "Content-Type":"application/json", "Accept": "application/json", "AppKey":"5cafe6eb-adba-41df-a7d6-4fc8088125c1" }, "payload":{}, "round_break":50000, "req_break":0, "round_size":50, "total_round":20 } host, port, protocol, headers are very identical to define a ThingWorx server. endpoint defines which service is called during the test. payload is not in use at this moment but you have to keep it here. total_round is how many rounds of the test you want to run. round_size defines how many requests will be sent simultaneously during each round. round_break is the pause time during each round in Microseconds, so 50000 in the above example means 50ms. req_break is 0, this is the delay between requests. "0" means requests to the server will happen simultaneously. The expectation from the above configuration is service execution a total of 20*50 times, 1000 times. So, we can expect that if the initial value is 0, then counter should be 1000 at the end, and if the value stream is clean initially, then the value stream should have a history from 1 to 1000. Run Test Use the following command to perform the test: .<your path>/mulreqs Execution output will look like: Check Result You will be surprised that the final value is 926 instead of 1000. (Caution: this value will be different in different tests and it can be any value in the range of 1 and 1000). Now, look at the value stream by using QueryPropertyHistory. There are many values missing here, and while the total count can vary in different tests, it is unlikely to be exactly the last value (926). Notice that the last 5 values are: 926, 925, 921, 918. The values 919, 920, 922, and 923 are all missing. So next we check if there are any errors in the script log, and there are none. There are only print statements we deliberately placed in the logs. So, we have observed two symptoms here: The final value from property counter doesn't have the expected value. The value stream doesn't have the expected history of the counter property changes. What's the reason behind each symptom, and which one is a thread-safe issue? Understanding Timestamp Granularity ThingWorx facilitates the collection of time series data and solutions centered around such data by allowing for use of the timestamp as the primary key. However, a timestamp will always have a minimal granularity definition when you process it. In ThingWorx, the minimal granularity or unit of a timestamp is one millisecond. Looking at the log we generated from the subscription again, we see that several data points (922, 923, 924, 925) have the same timestamp (1596089891147), which is GMT Thursday, July 30, 2020, 6:18:11.147 AM. When each of these data points is flushed into the database, the later data points overwrite the earlier ones since they all have the same timestamp. So, data point 922 went into the value stream first, and then was overwritten by data point 923, and then 924, and then 925. The next data point in the value stream is 926, which has a new timestamp (1596089891148), 1ms behind the previous one. Therefore, data points 925 and 926 are stored while 922, 923, 924 are not. These lost data points are therefore NOT a thread-safe issue. The reason why some of these data points have the same timestamp in this example is because multiple machines write to the same value stream. The right approach is to log data points at the individual machine level, with a different value stream per machine. However, what happens if one machine emits data too frequently? If data points from the same machine still have a timestamp clash issue, then the signal frequency is too high. The recommended approach would be to down-sample the update frequency, as any frequency higher than 1000Hz will result in unexpected results like these. Real Thread Safe Issue from Demo Use Case The final value of the counter being an arbitrary random number is the real thread-safe coding issue. if we take a look at the code again: me.counter = me.counter + 1; This piece of code can be split into three-piece: Step 1: read current value of me.counter Step 2: increase this value Step 3: set me.counter with new value. In a multi-threaded environment, not performing the above three steps as a single operation will lead to a race issue. The way to solve this issue is to use a locking mechanism to serialize access to the property, which will acquire the lock, perform the three operations, and then release the lock. This can be done using either the Java Extension or the database thing to leverage the database lock mechanism. Use Java Extension to Handle Thread Safe Challenge This tutorial assumes that the Eclipse plug-in for ThingWorx extension development is already installed. The following will guide you through creating a simple Java extension step by step: Create a Java Extension Project Choose the minimal ThingWorx version to support and select the corresponding SDK. Let's name it JavaExtLocker, though it’s best to use lower-case in the project name. Add a ThingWorx Template in the src Folder Right-click the src folder and a a Thing Template. Add a Thing property Right click on the Java source file created in the above step and click the menu option called Thingworx Source, then select Add Property. Add Three Services: IncreaseCounter, GetCounter, ResetCounter Right click the Java source file and select the menu option called Thingworx source, then select Add Service. See above for the IncreaseCounter service details. Repeat these same steps to add GetCounter and ResetCounter: (Optionally) Add a Generated Serial ID Add Code to the Three Services @SuppressWarnings("deprecation") @ThingworxServiceDefinition(name = "IncreaseCounter", description = "", category = "", isAllowOverride = false, aspects = {"isAsync:false" }) @ThingworxServiceResult(name = "Result", description = "", baseType = "INTEGER", aspects = {}) public synchronized Integer IncreaseCounter() throws Exception { _logger.trace("Entering Service: IncreaseCounter"); int current_value = ((IntegerPrimitive (this.getPropertyValue("Counter"))).getValue(); current_value ++; this.setPropertyValue("Counter", new IntegerPrimitive(current_value)); _logger.trace("Exiting Service: IncreaseCounter"); return current_value; } @ThingworxServiceDefinition(name = "GetCounter", description = "", category = "", isAllowOverride = false, aspects = {"isAsync:false" }) @ThingworxServiceResult(name = "Result", description = "", baseType = "INTEGER", aspects = {}) public synchronized Integer GetCounter() throws Exception { _logger.trace("Entering Service: GetCounter"); int current_value = ((IntegerPrimitive)(this.getPropertyValue("Counter"))).getValue(); _logger.trace("Exiting Service: GetCounter"); return current_value; } @SuppressWarnings("deprecation") @ThingworxServiceDefinition(name = "ResetCounter", description = "", category = "", isAllowOverride = false, aspects = {"isAsync:false" }) @ThingworxServiceResult(name = "Result", description = "", baseType = "INTEGER", aspects = {}) public synchronized Integer ResetCounter() throws Exception { _logger.trace("Entering Service: ResetCounter"); this.setPropertyValue("Counter", new IntegerPrimitive(0)); _logger.trace("Exiting Service: ResetCounter"); return 0; } The key here is the synchronized modifier, which is what allows for Java to control the multi-threading to prevent data loss. Build the Application Use 'gradle build' to generate a build of the extension. Import the Extension into ThingWorx Create Thing Based on New Thing Template Check the New Thing Property and Service Definition Use the Same Test Tool to Run the Test Again { "host":"twx85.desheng.io", "port":443, "protocol":"https", "endpoint":"/Thingworx/Things/DeoLockerThing/services/IncreaseCounter", "headers":{ "Content-Type":"application/json", "Accept": "application/json", "AppKey":"5cafe6eb-adba-41df-a7d6-4fc8088125c1" }, "payload":{}, "round_break":50000, "req_break":0, "round_size":50, "total_round":20 } Just change the endpoint to point to the new thing. Check the Test Result Repeat the same test several times to ensure the results are consistent and expected (and don't forget to reset the counter between tests). Summary of Java Extension Approach The Java extension approach shown here uses the synchronized keyword to thread-safe the operation of several actions. Other options are to use a ReentryLock or Semaphore locker for the same purpose, but the synchronized keyword approach is much cleaner. However, the Java extension locker will NOT work in 9.0 horizontal architecture since Java doesn't a have distributed locker. IgniteLocker wouldn't work in the current horizontal architecture, either. So if using a thread-safe counter in version 9.0+ horizontal architecture, then leverage the database thing, as discussion below.

Nov 30, 2020

Update to Connected Factories Benchmark Scenario Three: One Kepware Server in ThingWorx 9.0 The goal of this scenario is to confirm the same performance in ThingWorx 9.0 as seen in scenario one, where one Kepware Server represented a single factory in version 8.5. Matrix 1 - Slow (15s slow properties, 1s fast) The lower frequency tests performed the same in 9.0. Even the 10k ingestion test, which lies very close to the boundary for a single Kepware Server, passed with no errors. Matrix 2 – Fast (5s slow properties, 500ms fast) These showed similar results, but the 500 thing, 50-10 property test had data loss in 9.0. However, the write rate is much higher than PTC recommends for a single Kepware Server anyway. Matrix 3 – Faster (1s slow properties, 200ms fast) The fastest tests had similar results as well. The larger tests ran with more success with two Kepware Servers (data not shown here). Conclusions ThingWorx 9.0 is similarly capable of ingesting data using Kepware Server. A single instance can still achieve up to 10k wps. Future scenarios will now make use of ThingWorx 9.0. Download the updated draft here!

Oct 22, 2020

Announcing the Final Installment JMeter for ThingWorx, the Comprehensive Guide and Best Practice Tips This is the final post on using JMeter for ThingWorx. Below there are best practice tips for using JMeter and for load testing in general. Attached to this post is a comprehensive guide including all of the information from every post we've made on JMeter, including the tutorials. For a more central source, feel free to download the guide , or see the past posts here: JMeter for ThingWorx (original post) Building More Complex Tests in JMeter Distributed Testing with JMeter Generating and Reviewing JMeter Results JMeter Best Practice Tips Use Distributed Testing As already mentioned in a previous post, each JMeter client can only handle about 150-250 threads depending on the complexity of the tests, and each client will need around 1 CPU and up to 8 GB of RAM for the Java heap. Some test plans will run with fewer host resources, so resizing the test client VM up or down is often required during test development. Create a batch or shell script to start the multiple JMeter clients for greater ease of use. Use Non-Graphical Mode Non-graphical mode allows the system to scale up higher; client processing uses up resources just to keep the simulation running, but with graphical mode turned off, there is less of an impact on the response times and other results. Graphical mode is essentially only used for debugging. Turn off Embedded Resources This setting reloads all of the typically cached requests over and over; there will be far more download requests, and to the exclusion of other requests, than is helpful. Ensure this box is not checked, especially in the HTTP Requests Defaults element: Browser caching means that this setting doesn’t actually simulate a proper user load, given that many of the reloaded resources would not be reloaded by actual users. Use this incrementally, for one or two HTTP requests only, if there is a reason why those requests might need to download fresh images, scripts, or other resources with each call; for instance, simulate page timeouts using this once per hour or something similar. Using this across the whole project will prevent it from scaling well, while not actually simulating real-world conditions. Avoid Using Listeners For instance, the “View Results Tree”, which uses additional resources that may impact the results in disingenuous ways, based around the needs of the clients themselves and not the actual response times of the server. Many listeners are only for debugging a handful of threads while designing the tests. A list of recommended listeners for different purposes is in JMeter documentation. Summary Report is the only one you want enabled, as that exports the results as a csv or similarly formatted file, which can then be used to build reports. JMeter CAN handle SSO JMeter can authenticate into and test an SSO-enabled system. Sometimes the SSO configuration is essential for customers, and they may be quick to assume therefore that they cannot use JMeter, but that's not entirely true. Some external tools that might help with this are BlazeMeter (mentioned again in just a moment) and Fiddler, a good tool for decoding what data a particular SSO setup is exchanging during the authentication process. Use Logic Controllers for Parametrization Parametrization is critical to mirroring a proper user load, and allowing different data sets to be queried or created; the load should seem organic, random in the right ways, with actions occurring at random times, not predictable times, to prevent seeing artificial peaks of usage that don’t represent real usage of the Foundation server. Random order controllers direct the threads down different paths based on random dice rolls, allowing for a randomized collection of user activity each time, not something that has to be regenerated like a set of Boolean values that is specified in an input CSV and used to navigate a series of true or false switches. Switches just look for an environment variable to be either 1 or 0, and when it hits a switch that’s a 1, it triggers the switch below, running them in the order given under the transaction controller that goes with the switch. In this image, the 1’s and 0’s are given in the CSV input file; randomizing that input file therefore randomizes the execution of the switches too: Use Commercial Add-Ons There are many external, add-on tools and plugins which enhance JMeter’s capabilities. One external tool that can enhance JMeter’s capabilities is Blazemeter, which has some free and some paid options to help create better reports, removing automatically much of the “garbage” REST calls (which would otherwise need to be manually deleted), and provide more consumable test reports right out of the box. Other tools and plugins include: Maven Netbeans SonarQube Jenkins Autometer Gradle Amazon EC2 Lightning IntelliJ IDEA Cassandra Grafana For more best practice information, see the JMeter Best Practice Manual. General Load Testing Guidelines Concurrency Requirements – How to Properly Estimate the Size of the Load Test Take a brand new ThingWorx-based app. How people will be accessing the system and how often? How many are business users? How many are engineers? What do they do? Many assume that every named user in the corporate LDAP will need to access to the server, often 10s of thousands of users; this generally drastically oversizes the system. Load testing for many thousands of users is very hard and requires a lot of set-up, tuning, and optimization to get right; so if it seems that thousands of users are expected, then validating this claim is important: most customers don’t really have that many concurrent users in an engineering system. Use estimates based on how many people work at which offices, which time zones those people are in, and what kinds of users they are. Do they need access to engineering data? Perhaps there are simpler mashups for them that uses less resources. One tool for these sorts of estimations that PTC offers is the office time zone overlap Windchill Sizing Calculator (shown here) Other ways to estimate include: Analyzing the business processes, things like how long workloads typically take to complete and how many workloads are generated per day, converted into hour, minute, or second as desired for the peak duration, the length of the test. “Day in the Life” modelling, or considering things like “what does user X do in a day?” Maybe, user X checks out some drawings, edits them, and then checks them back in at 4:30. Maybe user Y actually digs into the underlying parts and assemblies, putting in change requests or orders throughout the day, instead of waiting for the end. Models are made based around the types of users. Also consider: What are worst case scenarios? What are the longest running activities? What produces the largest data transfers? What activities have large, heavy data base queries? When is the peak overlap of usage? Beginning and end of day downloads and check ins? Reports that are generated regularly? How do these impact the foreground users? For a simpler estimate, start with a percentage of the named user count, anywhere from 5-15% is a good ballpark percentage. Don’t overestimate to feel like the application has been financially worth it; even if everyone is logged in and using it all at once, which is unlikely, load testing for every single user doesn’t take into account the fact that people pause in between clicking on things to think, type emails, get coffee, and so forth. Fewer people than expected are actually doing concurrent activities like loading web pages and updating data streams. Whenever possible, use concurrency data from existing customer systems to guide the estimate for the new system. Legacy system are great places to start. Use Grafana to monitor the system side throughout the load test, which is also required to know the test has been successful; also set up Grafana to monitor the application once it goes live, to both prevent and mitigate more rapidly any technical issues with the server. Also remember that PTC Technical Support is here to help! Provide thread dumps with an open case to any TSE, and they will help troubleshoot the tests and review any errors in the ThingWorx or Tomcat logs.

Oct 22, 2020

Leveraging Dell and VMWare for Asset Monitoring in Connected Factories As an extension of the Connected Factory Reference Benchmark performed on Microsoft Azure , PTC partnered with Dell Technologies in producing this document, a baseline which illustrates the effectiveness of ThingWorx and Kepware when combined with Dell and VMWare technologies to create solutions for on-premises and hybrid Connected Factory implementations. Please join us in thanking Bhagyashree Angadi, Brian Anzaldua, Todd Edmunds, Mike Hayes, and the Dell Customer Solution Center team in Limerick, Ireland for working with the IOT Enterprise Deployment Center on this benchmark! This benchmark is of a very similar design to a previous publication, but this time designed specifically with Dell Technologies in mind. In a Dell/VMWare architecture, the close proximity of Kepware Server and ThingWorx Foundation provides ideal conditions for network throughput between these components. Combined with the ability to easily monitor and resize virtual machines as your business needs evolve, these hardware configurations can be very effective in on-premises or hybrid deployment scenarios.

Oct 8, 2020

New Scenario Using Multi-Kepware for Asset Monitoring in Connected Factories A new scenario has been completed for Connected Factory implementations, furthering the IOT EDC's goal of providing a reference library of ThingWorx performance. This scenario builds upon the first, with additional tests being performed to demonstrate the capabilities of multiple Kepware Servers running side-by-side. Horizontal scaling is very common for multi-line factory implementations, so be sure to check out the new scenario in this ever-expanding benchmark document. Note that tests below 10,000 writes per second were not repeated with multiple Kepware Servers, since there is little reason to desire such a configuration in implementations that small. ThingWorx deployment sizing was also held constant throughout these tests to demonstrate the limits of a given configuration. Changes that may improve the results of a failed test (such as adding CPUs or Memory) will be mentioned but not validated as part of this benchmark. Let us know about your applications and how they compare with the data shared here. Happy developing!

Sep 30, 2020

Generating and Reviewing JMeter Results Overview The 4th in a series of articles on load testing with JMeter, this one covers pushing the limits of a test to see how much the application can handle, as well as generating and analyzing reports once the testing completes. This article rounds off the basics of JMeter, such that anyone should be able to perform enterprise-level load testing after reviewing the content here. Multiple criteria can be used to evaluate results, including: response time (as monitored both by JMeter, and by some other tool on the system side) throughput number of errors resource saturation CPU, Memory, disk, and network utilization Depending on use case, some of these may be considered more important than others. For instance, some customers don't care if users wait a while for results to appear on the page (response time), because they set their users' expectations and mitigate the experience with well-designed loading graphics. With response times secondary, the real issues center around data loss or system outages, with resource utilization and number of errors becoming the more important indicators of system health. Request and database timeout errors are more important indicators, as they occur most often when resources are saturated and there is data loss. It is typical for many customers to find preventing data loss and/or promoting data integrity to be more important than preventing long response times. Consider which of these factors is most important to your use case as you determine what kind of information to gather and review in your reports. How to Create Client-Side Reports in JMeter Creating reports for the client-side data is very simple using JMeter, both from the command line and within the UI (as shown in the tutorial below). These reports have graphical displays of response times, information about the number and type of response errors, and other criteria of performance used to gauge the success or failure of a load test. Follow these steps to generate an index file, which when opened in your browser of choice, will show all of the relevant JMeter data. Tutorial: Create an empty directory in which to store reports: Start the JMeter test with these options, or run these commands after the fact, to generate the HTML report: Once the test completes, use: jmeter -g <outputfile.jtl/csv> -o <path to output folder for html report> To start a test with the correct command for report generation, use this command: jmeter -n -t <test JMX file> -l <outputfile.jtl/csv> -e -o <Path to output folder> Running the above commands will generate these files: When the test is complete, the many JMeter client consoles will look like this: Go ahead and close the windows to terminate once they are finished. Optionally you can run multiple tests sequentially using the same jmeter-server windows. Click on the “index.html” file to open the results viewing window: At any time, modify the settings of this “HTML dashboard” using the details from the JMeter user manual. This citation describes many options for these dashboards, as well as recommendations on how to group and format the results in ways which best convey the success or failure of the test, based on the custom requirements of the application and how granular the view needs to be. Most of the time, the default settings work ok, showing something similar to this: The charts aren’t labeled very well here, so click on the Response Times submenu: This page may take some time to render if there is a lot of data: Next, scroll down to see all the requests that occurred and sort them by how long they took to complete. Anything which took over 5 seconds (or more depending on what is expected) should be investigated as part of the post-test analysis. Does something need to be tuned or optimized? This is how to tell which request is holding things up for your customers. There is also a chart that shows the overview, grouping the response times by how long they took to demonstrate the health of the system more concretely. Typically, the bars look something like this: This represents expected behavior, where most of the requests are quite fast, and then there are a few that had errors or took a bit longer. This is pretty typical for web activity. You can also generate the report through the main JMeter client: Give it a results file and an output directory to generate the same index file: There are log files in each of the JMeter client directories called “jmeter-server.log”: These files may show the wrong timezone, but the elapsed times are correct, and they will show when the JMeter clients started, how many threads they ran, which servers were which, and if there were any errors. Not all errors will mean a failed test, so review anything that appears and determine what is expected. Consider designing a batch script to gather all of these logs together, or even analyze them automatically to extract only relevant information. How to Create Server-Side Results in DynaTrace Collecting data from the environment, including CPU usage, Memory utilization (used vs. total), Garbage Collection times and other metrics of system health on the server, will require the use of an external tool. PTC’s official tool for this is called DynaTrace (PTC System Monitor), shown here. PTC offers a runtime license for DynaTrace to anyone who buys certain products, including Kepware Server, ThingWorx Foundation and Navigate, Windchill, Integrity, and more. Read more information about DevOps on the PTC Community, and stay tuned for more articles on the subject to come from the EDC. Another option would be something like telegraf and Grafana (from the previous blog post), which facilitate the option to create dashboards around the data output specific to the needs of the application, which can still be monitored even once the application goes live. It can certainly be worth it to use such a tool for monitoring the server-side, but the set-up takes more time. Likewise, many VMs have monitoring faculties for CPU usage and memory utilization built-in, but DynaTrace also has visualization, consolidation of system elements, and other features that make it easy to use right out of the box. See the screenshots below for some examples on how to use DynaTrace, and be sure to review PTC’s full documentation here. The example shown here is a ThingWorx Navigate system, with Windchill and ThingWorx Foundation set up side-by-side. This chart shows the overall response times of the server-side of the system. JMeter collects the statistics on what the client looks like, while another tool is required to collect the server-side metrics like CPU usage and Memory utilization, things that indicate the health of the VM or computer hosting the clients. An older version of DynaTrace is depicted here, available for free for all ThingWorx customers from the PTC Downloads Site (under various product listings). In DynaTrace, you can build new dashboards using PurePaths: You can also look at the response times for each service, but be sure to change the response limit to a large number so that all the results are returned. Changing the response limit to a large number to ensure all of the results show in the PurePaths dashboard. Highlighted here in DynaTrace is the longest service that ran, which in this case took 95 seconds to fully respond: More specific analysis of this service can now begin. Perhaps it needs to be tuned, or otherwise optimized to handle the number of threads, i.e. the number of users. Perhaps the system needs more resources or the VM isn’t large enough for the test. Perhaps more JMeter clients and system resources are required. Something will explain this long response time, and that will inform as to what work might still remain before this system can scale up to the enterprise level. How to Use the Test Results Load Testing often means scaling the test up a little more each time until the system eventually breaks, or the target performance is reached. Within JMeter, this won’t mean increasing the overall number of threads per one JMeter client, but instead, scaling horizontally to other JMeter clients (as covered in the previous blog post). Now that the remote or distributed clients are configured and the test running, how do we know when the test is beginning to fail? It turns out that this answer is not a simple one. Which results are considered desirable will vary from one customer to the next based on many factors, and analyzing the test results is a massive topic all on its own. However, there is one thing that any customer would care to review, and that is the response time overview chart found within the JMeter reports. This chart can be used to compare the performance of the majority of threads against a baseline, indicating the point at which the test begins to fail, i.e. the point at which the limits of the system are reached. The easiest way to determine a good standard response time for a load test, a baseline, is to start with a single JMeter client and record the response times for just 1-5 threads. You can record the response times for individual requests, particularly queries and other services with expected long response times, or the average response times across all requests or groups of requests, if the performance of some mashups are more important than others. This approach is better than relying on the response times seen in a browser because HTML pages load differently when rendered in a browser, with differing graphical resource requirements than what is requested in JMeter. Note that some customers will also manually record response times within a separate browser-based test scenario during load testing as either a sanity check or as part of their overall benchmarking in order to further validate the scalability of the application, but this wouldn’t involve JMeter given that browsers load things differently and cross-comparison is a bad idea. Once the baseline response times are established, start increasing the thread counts across the many JMeter clients until you see the response times go up on average. PTC’s standard criteria for load testing is exceeded when the average response times are roughly doubled, or when the system seems overwhelmed with the user load on the server side (which is what to look out for in DynaTrace or the external system monitor). At this point, the application is said to have reached a bottleneck, which could be a simple tuning problem, or it could be saturated by resource requirements. Either way, the bottleneck is proof that the system can’t take any more threads without users beginning to notice and the response times approaching an unreasonable delay. Other criteria can be used as well, say if any one thread takes more than 5 seconds to respond. Also ensure there are no unexpected errors, as gateway errors represent failed tests too. Sometimes there will be errors even when the test is successful, though, so consider monitoring the error percentage, a column in the Summary Report tab of JMeter, to see what is normal. The throughput column may also be something to monitor. Many watch for increases in throughput as the thread count increases to ensure there is no degradation in performance (which may indicate hardware or sizing constraints). The Summary Report will look something like this, with thread group results from all of the clients appearing side by side, differentiated from each other by the unique port: Conclusions Generating and reviewing reports within JMeter is straight-forward and easily customizable. Be sure to also monitor the system itself using an external tool like DynaTrace, PTC’s official System Monitor, which has a lot of value considering how easy it is to use out of the box. If the system looks healthy on the server side and the response times are within an acceptable range on the client side, then the application is ready for enterprise use. Be sure to generate a baseline for response times within JMeter, remembering that browsers have different loading processes than JMeter, and not to cross-compare. This article constitutes the end of the basics. The final article to come will talk about more advanced test design features and best practices, so stay tuned!

Sep 28, 2020

Distributed Testing with JMeter Overview Running JMeter to the scale required by most customers is something that demands additional considerations than discussed in the previous two articles. At scale, a test may need to simulate thousands of users, which will require more than just one JMeter client be set-up on one or many hosts, as shown in the 3rd JMeter article here, in a tutorial on Distributed Testing. Distributed Testing Remote Testing configuration in which the main JMeter client is located at one IP address, controlling the rest as they step through their own copies of the JMeter tests, based on their own unique data files as necessary, to simulate a user load across a network, a series of regions, or simply across many machines if limited by the size of the physical hardware [JMeter link for this image in text body below] One key aspect of a proper JMeter load test is distributed or remote testing, i.e. making use of more than one JMeter client at a time to simulate the user load on the Application server. There are many reasons to make use of a network of clients such as this, like mimicking cross-region user access to the Foundation server, simulating different levels of latency for different users, and increasing the overall number of users which can contribute to the load test, while minimizing the performance cost of hosting that many threads on any single server. A single JMeter client has a practical limit of 150-250 threads across all groups and requires about 1 CPU and 8 Gb of RAM. After this point, the amount of garbage collection and other processing there is for each client to do is substantial. As the client processes its own data and sends requests to the Application server at the same time, there are diminishing returns, and the responses begin to take longer (or errors start occurring) simply because of resource starvation within the client process rather than on the Application server. Therefore, distributed testing is required for most customers doing larger load tests using JMeter. Many applications will have more than a few hundred users and/or will have users accessing the system from a variety of regions and networks, each of which could have significantly different network latency. So, in order to work with the limitations of the JMeter executable and address regional concerns, distributed or remote testing is typically required for almost all of PTC’s customers who scale test with JMeter. With a simple (monolithic) distributed test, all of the JMeter clients are located on the same host and share an IP address, but each must be configured with a unique RMI port to connect to the controlling process. If these are located on a VM, then the resource specifications can merely be increased and the VM sized larger as necessary to ensure the network of JMeter clients runs as expected. Each JMeter client requires around 8 GB for its heap size and 1 CPU (with some additional resources for the host operating system). Multi-hosted testing becomes the required option when limited by physical hardware (or a relatively small VM hardware host). If there are only 4-core, 32-GB machines, then plan for a machine per every 3 JMeter clients. If simulating thousands of users, this could mean half a dozen machines or more are required, which can still sometimes work out to be more cost efficient than one large, 256 GB, VM hosted in the cloud. Using many hosts in physical locations can also simulate regions with different network characteristics. A tutorial for distributed testing across one host is shown here. For more information, see the Apache web articles on each topic: Remote Testing and Distributed Testing Step by Step. Tutorial: Step Up Distributed Test on One Host Copy the source directory for the whole JMeter project and rename it however many times as required. Here there are 22 JMeter clients side-by-side on a single, 256-GB VM (3000+ users): Each directory (shown above) is identical, except that the “jmeter.properties” files (found in the bin directory in each project) have unique settings, namely the RMI port: Each JMeter client must contain a copy of the same test scripts found on the main server: In the “jmeter.properties” file for the main server, specify the IPs and ports for each remote/distributed client (under remote_hosts), as shown: In this image, the IPs are all the same, with just the port differing from client to client. Here only 4 clients are in use, with the rest commented out for future tests. This is how to scale up and test incrementally more users each time. Just add another server to add another 150-250 users, until eventually the target number of users is reached, or the server is saturated. These IPs will differ if doing a true remote test, with each being the server location of the JMeter client within the same network. The combination of IP address and port will all still need to be unique, and communication between the overall jmeter controller and the clients over the RMI ports needs to be allowed by the network/firewalls. Note that the number of users is set using the parameter under “Test Plan” which was set-up last time. This value represents the number of users by specifying the number of threads per thread group, and it can remain the same for every client or vary accordingly, if for instance one region is smaller than another. The “Test Plan” parameters are shown here: To optionally start all of the clients at once in preparation for test execution, create a basic batch or shell script which goes to the bin directory of each agent and calls the start command: “jmeter-server”. In this image from a Windows JMeter host, only the first few agents are in use, but removing the “rem” to uncomment the other start command lines in this file would add more servers to be started. Note how the Java parameter for java.rmi.server.hostname must match the main JMeter client network configuration here for them to connect (see Apache links above for more information). This will start each of them in their own CMD window, which once closed, will terminate the JMeter client processes. Parameter like rampUp time within the main test script will scale with the number of client processes. For example, 100 users and 300 seconds rampUp with 4 clients results in 400 overall user threads that are all logged in after 300 seconds. Once all clients are running, then click Remote Start All to start the test across every server from a GUI (usually for debugging) or execute the test using command line: jmeter -n -r -t <test.jmx> -l <results.jtl> The main server sends the actions to the remote clients to run, so all the clients need is input parameters. For instance, a CSV file may exist in each directory which has different data from client to client, to create pseudo-random user loads and represent different kinds of user activity. The file shown in this image is different, and unique, in each of the client directories: Conclusion Here, we learned how to horizontally scale the load test, setting up more JMeter clients to facilitate larger, more complete user loads. We also discussed the difference between distributed and remote testing, and how the former is easier to set up and use, especially on VMs, but the latter might be better for simulating region differences and the impact of network latency. The latter will likely also be required if there are hardware constraints to consider, since each JMeter client needs about 8 GB for its heap, and another 8 GBs, or a core or two of similar size, is needed per every 3 JMeter clients for the communication and processing of data. Stay tuned for the next article on generating and reviewing the results of the load tests.

Aug 26, 2020

ThingWorx DevOps with Jenkins DevOps as a topic is vast and has been addressed at many times throughout the history of the PTC Community. Previous posts address what DevOps is, teach how to make use of DevOps like a pro, announce updates to the PTC Git Extension, and explain why this extension is so helpful to achieving continuous Git integration with ThingWorx. This post provides a PDF guide on Jenkins integration with ThingWorx, including tutorials with detailed information on how to setup your ThingWorx instance and how to configure your Jenkins Pipeline. The PDF is listed for download separately, but it is also included in the zip with the other required files for the tutorial. The Jenkins Pipeline provided here is intended as an example / starting point for managing your DevOps in ThingWorx and can easily be extended. Please note that this Pipeline is not officially supported by PTC.

Aug 26, 2020

ThingWorx and Azure IoT Hub Benchmark This Azure IoT Hub Reference Benchmark showcases the capabilities of ThingWorx and Microsoft Azure IoT Hub, a cloud-hosted solution backend that facilitates secure and reliable communication between an IoT application such as ThingWorx and the devices it manages. By making use of this third party tool, remote monitoring with ThingWorx has never been simpler. In this benchmark, PTC verifies the reliability and scalability of ingesting data through the Azure IoT Hub into the Azure IoT Hub Connector(s) and ThingWorx Foundation. The preliminary version of this document focuses primarily on how the Azure IoT Hub’s capabilities modify and/or enhance the data ingestion and device management capabilities of ThingWorx. Find the benchmark document attached here, and stay tuned for more reference benchmarks to come!

Aug 21, 2020

Remote Monitoring of Assets in Connected Factories As stated in the previous reference benchmark, one of the missions of the IOT Enterprise Deployment Center (EDC) is to showcase how real-world IOT business problems are solved. Our goal is that these benchmarks can be used as a reference or baseline for architects working on their own implementations, showing not only a successful at-scale implementation, but also what happens when that same implementation is pushed to, or even past, it's limits. The second in this series is attached here, this time reflecting a Connected Factory implementation. ThingWorx was deployed alongside Kepware Server, with the numbers of things, the number of properties, and the write rate for those properties being varied to once again test the capabilities of a remote monitoring use case, but this time in a Connected Factory setting. The business logic was kept simple to ensure it was not the limiting factor, as the throughput between Kepware Server and ThingWorx was pushed to the limit. See first hand the capabilities of Kepware Server and ThingWorx Foundation to handle implementations centered around real-time data reporting More Connected Factory implementations will be added to this document in time, with multiple Kepware Server deployments and other scenarios to come. Please feel free to use this community post to ask any questions about our approach and discuss any design, deployment, and simulation factors.

Jul 31, 2020

Building More Complex Tests in JMeter Overview This is the second in a series of articles which help inform how to do user load testing in ThingWorx. This article picks up where the previous left off, continuing with the project created there. The screenshots do appear a little differently here because a new “Look and Feel” was selected for the JMeter application (switched from “Metal” to “Windows Classic”) to provide more readable screenshots. In this guide, we are going to make the very simple project more complex, working towards a better representation of a real load test. The steps below walk you through how to create and configure thread groups and parameterize the processes and procedures defined by each thread group. Adding More Thread Groups Within JMeter, thread groups are used to organize the HTTP requests in a test into various processes or procedures, such that different mashups (and all of the HTTP requests required on each) or processes can be executed simultaneously by different thread groups throughout the test. Varying the number of threads in a group is how to vary the number of users accessing that mashup during the test, a number which increases over time in accordance with the ramp up time. The thread group name will also show up in the Summary Report tab at the end of the test, making it easier to parse through and graph the results. Start by renaming the existing thread groups so that their process or procedure names are recognizable at the end of the test: Highlight the line which reads “HTTP(S) Test Script Recorder”. (Optional) Add an Include filter to only capture the URLs relevant to your application using the Requests Filtering tab. For example, with the escape character \ necessary for ‘.’, myhost.mycompany.mydomain becomes: myhost\.mycompany\.mydomain Now record a new thread group clicking the “Start” button: Once the control box shows up in the top left corner, click to open a browser and access the ThingWorx Navigate application. Then click on “View Parts List” or some other mashup: Once the mashup loads, search using a string and/or wildcard, or click on one of the recent results if any exist: Wait for the mashup to fully load with the details on that part or assembly, and then click “Stop” in the recording controller window: All of the HTTP requests performed in the process of loading and using this mashup will be added to the JMeter project here: Next, add a new thread group manually to the project: Highlight the newly created “Thread Group” (default name) and rename it to something that relates to the nature of that process: Drag and drop the new collection of requests so that it is considered a part of the new thread group: Then drag the whole group up so that it is next to the other thread groups in the test: In more complex projects, different thread groups may be added at different times, and each time, the service calls are all assigned an index (at the end of the request URL, for example: <request>-344). These indexes may not be unique depending on how and when the thread groups were created, especially in more complex tests. The easiest way to fix this issue is save the test from the JMeter GUI, then open the JMX file in a text editor and perform a find and replace within the relevant section of text. This is usually done using a regular expression for the number. For example, if the request name indexes are numbered -500 through -525, a regular expression to increase them to -700 through -725 would be (in Notepad++): Find: -5([0-9])([0-9]) Replace: -7\1\2 Note that if you do not use a Request Filter, sometimes the recorder will log URLs that are not part of the target application, like these “generate” samplers. These URLs are typically happening in the background of the browser to track performance, security and errors. These can be deleted: At other times, you will be repeating steps that are already part of another thread group, for example: logging in. This genidkey is a part of the login, as you can see if you look back at the login thread group. Because logging in is only necessary once, and it is assumed to be complete by the time the test starts on the second thread group, this entire section can be deleted: To see for sure if a request can be removed because it is called in a previous thread group, do a non-case-sensitive search for the name of the request: All of the requests found in this particular instance were performed in the previous thread group, so therefore this entire category can also be deleted: Another odd thing you may see (if you do not use a Request Filter with the recording feature) are “blank” requests like these: The recorder isn’t sure what to call these “non-requests”, so anything like this that isn’t an actual URL within the target application should be deleted. Static downloads should be disabled or deleted from scale testing since they are usually cached by the user browser client. In this ThingWorx example, there are static “MediaEntites” which can be deleted or disabled: Within the JMeter client there is no good way to highlight and reset them all at once, unfortunately. The easiest way to remove all of these at once is to open the JMX file in a text editor and use regex expressions for search and replace “enabled=true” with “enabled=false”. Most text editors have examples on how to use regular expressions within their Help topics. The above example is for Notepad++. Parameterize Thread Groups Parameterization is usually the part of creating a JMeter test that takes the most effort and knowledge. Some requests will require the same information for every thread, information which can therefore be defined statically within the JMeter element rather than being parameterized. Some values used within the JMeter test script can be parameterized as inputs in the top level of the test controller, for example: Duration, RampUp time, ApplicationHost, ApplicationPort. Other values may be unique to only one thread group and could be defined in a User Defined Variables element within that group controller. The value(s) used within a request can also be determined on the fly by the results of earlier requests within a thread group. These request results typically must be post-processed and parameterized for later thread elements to function correctly. The highest level values that are unique to each thread should be inputs from a CSV file that are passed into the threads as parameters, for example Username and Password. Data used within the test is usually parameterized in order to better emulate real world application use by multiple users. In the following example, we will parameterize the number of users for each thread group by adding a user- defined variable. Start by selecting the new thread group and parametrizing the number of threads (i.e. the number of users accessing this mashup at a time during the test). The way to enter a variable is with syntax like this: $(searchandviewpartstructure_threads) In this case, make this a user defined variable: or a variable for the whole project, by highlighting “Test Plan” and adding the information there. Begin looking at the samplers to see what types of things need to be parameterized in your test. Consider such things as: thread count (as shown above), ramp up time (also depicted above), duration, timings, roles, URL arguments, info table information, search result information, etc. Another example here parameterizes the search parameters for a query by adding an overall search string column to a CSV file (which can then be randomly generated by some other script): First, parameterize the body data of the request by highlighting the request, and changing the value of the desired field to something like this: $(searchString) Next, define the parameter under the Test Plan and set a default value: Now define the searchString column again as part of the CSV Data Set: Now it can be varied simply by providing different pseudo-random values with wildcards and/or known values in the CSV file. Post Processors and Extractors Most JMeter load tests become more complex when the results of one request are sent as parameters into later requests. This is done in JMeter by using Post Processors (Extractors), tools which facilitate extracting information out of the request results so it can be assigned to JMeter parameters. There are many different types of extractors which can process the results of previous requests: CSS Selector Extractor – commonly used extractor for values returned as html attributes JSON Extractor – processes JSON objects using regular expressions BeanShell Post Processor –facilitates using code scripts to process return text when needed Regular Expression Extractor – JMeter supports use of regular expressions on request results The JSON Extractor can be used to find and store information like the partOID number for a Windchill part as a parameter in JMeter, which can then be used to build more realistic workflows within the JMeter test. The example below steps you through setting up a JSON Post Processor. Start by right-clicking the request that contains the results of our search. Then click “Add” > “Post Processors”> “JSON Extractor”, as shown in the image below: The extractor will now show up under that request as a sub-menu item. Select it, and name the variable something easy to reference. For the JSON Path expressions, pull the object number or some other identifying characteristic out of the search results: $.rows[0].objNumber for example. Another option would be to take information like the partOID number send that into the search string field, by defining both as properties and having one refer to the other. To pull the partOID out, use a Regular Expression Extractor: Another thing to parameterize is the summary report result file name. Adding in the number of users and ramp up time can result in files that are easier to reference later being stored on your machine. We will cover generating and reviewing Summary Reports in full in the next article in this series. Conclusion In this article we saw how to create new thread groups, removing extraneous requests from those groups, and reduce the overall ambiguity of which thread groups are representing which processes or mashup calls. We also covered how to parameterize the individual requests as well as the summary report. Note that things like Windchill URL and hostname, search parameters and part IDs, timings, durations, offsets, anything at all that influence the results of the test, should not be hard-coded. It is better to create variables for these things to ensure that all of the various simulated activities are configured in the exact same way every time. That way, the system can be tested again and again under various strains and loads until the capabilities of the application are verified.

Jul 30, 2020

JMeter for ThingWorx Overview Apache JMeter is an open-source tool designed for load testing and measuring the performance of a web application. JMeter has a wide range of features to facilitate this testing, including support for a variety of server and protocol types, a full-featured testing IDE with the ability to record the test steps from both a browser or a native application, and built-in debugging tools. Information about JMeter can be found on Apache’s website. Working with JMeter is not always intuitive, but it also isn’t that much harder than regular software development. Take some time to explore the official Apache JMeter Documentation and figure out where things go and how to mechanically make use of the JMeter IDE. Then step through this tutorial to create a basic test that logins to ThingWorx, accesses a mashup, and clicks on a few widgets. This is the first in a series to come, courtesy of IoT EDC Engineer Tim Atwood ( @atwood ) and the whole EDC team. Installation Download JMeter from Apache’s website. Unpack the archive and copy the files to a desired location. Run the application by double clicking on the “ApacheJMeter.jar” file within the bin directory. JMeter is now installed and ready to use. Creating a Test Set up a proxy in your browser of choice (or on the OS in settings). Select the green “templates” icon in JMeter, and then select “Recording” for the template. Configure the recording template to point towards your ThingWorx Navigate or Foundation server, then click “Create”. Hit “Start” under the “HTTP(S) Test Script Recorder” tab of the new JMeter project. Make sure the port is set correctly under Global Settings. A pop-up box will appear that always stays visible on top of the active browser window, so that the recording can be controlled and stopped at any time. Leave the “Transaction name” field empty so that each transaction recorded by the software is automatically named after the web request (this helps differentiate one from the other, and they can each be renamed later). Open your browser, and navigate (via direct URL if possible, to keep things simple) to the mashup you wish to test. Login and let the page load. Click on anything you’d like on the mashup to capture the activity of that test. Then click “Stop” on the pop-up recorder window to stop the recording. Each transaction will be assigned an index as well, and the source code behind each of these transactions can be reviewed and manually modified in the main JMeter window. Here is the login request for instance: The HTTP Authorization Manager is used to automatically authorize a defined user login for the thread to any of the Base URLs listed. In this case, though, there are two separate servers being accessed during the test, and one may need to be added manually: Save the project before continuing, as manual modifications come next. Within the task page as you do the recording, a set of parameters or body data will be recorded. Modifying this is how you want to parametrize the test scenario, variables like the username and password. To simulate logging in as other users, you have to parameterize this, and not rely on the administrator account name and password entered into the browser. Rename the task controller to “MyTasks” or something more easily identified than the long string it has now: Some recorded items like static images and stylesheets will be non-essential, things the browser processes for better graphical representation, but which are often cached and do not greatly affect the scalability results of the test. These can be highlighted and disabled all at once: Also ensure that any cascading stylesheets have been disabled. Enable the “View Results Tree” to ensure you can review the results of the test script during the editing phase. However, this “Listener” element has a high memory footprint during test execution, so it should be disabled before running an actual scale test. Next we need to parametrize the user login information and pull it from a csv file. The colon means that “Administrator” is the default user to use for login. You can add other properties as well, like ramp up time, run time, number of users, and protocols to use. The ramp up time determines how quickly the threads are allocated for the test, which if done slowly enough, prevents the thundering herd scenario. In more complex scenarios, logic controllers can be inserted to control the flow of the test. This allows for options such as if-then conditions for different user permissions, or parameter-based routes for better randomization of actions in different threads. This will be covered in more detail in a future article. Pre- and Post-Processors can be used as well, with the latter being used here much more than the former, to extract information from the response, in order to then use that as part of the variables going into one of the follow up requests. For example, see the script in this image: This one has a variable that it extracts from the object number property, defined in the CSV file, and converts it into another variable that is used in subsequent scripts. This script uses the object number reference to pull the name out of the body data and make the request, which is then post-processed by a bunch of these extractors. One is a JSON extractor which is trying to get an ID out of the JSON response. There is a regular expression extractor and a bean shell post-processor, which populates some variables based on what it responded with. Once it extracts all of the variables from the response to this particular request (GetSearchResults in this case), it then tailors the additional requests based on these. - Customize the script according to the needs of your own application. Alternate between recording and manually modifying the recording code to ensure the test performs exactly as required and from the perspective of different users with different permissions. Also vary the type of activity performed on the mashup. Highlight the “View Results Tree” tab and click the green start button at the top of the window to see the results appear. If you are getting an unauthorized message, ensure that the scope is right for the login information, which may require moving the “HTTP Authentication Manager” component around in the project. Be sure to check the URLs and credentials entered for each type of user. Occasionally the recorder will insert a long authentication string into the URL, and you want to manually set the URL for the credentials to the most generic URL possible for the server. This can be parametrized too: Referencing the CSV file defined here: Which looks like this for a more complicated scenario (covered in the future): The columns here represent the username, password, object number in Windchill, and object name in Windchill, as well as the wait time used to vary the way the logic is executed and some extra variables which differentiate for the switches what to do to create a more varied and realistic test. Conclusion Following these steps again and again on the various mashups throughout an application can ensure that a script for each web page and each type of user on each web page is created and added to the testing suite. This results in a load test that is perfectly representative of the real-world user load placed on an application. Load testing is a critical part of the development lifecycle in any application, and ThingWorx is no exception. Any further questions about the capabilities of JMeter not covered here, can be answered by the whole JMeter user manual, found on the Apache website. Future articles will include some basic scripts that test basic things, which can serve as an example for more complex ThingWorx JMeter script development. Here is an example of one tool PTC uses for internal QA of ThingWorx, designed to load test a Navigate application (specifically its built-in mashups): Something similar to this tool may be available for public use later this summer. In the meantime, feel free to use the tutorial above to create scripts of your own. Any issues building your custom load tests in JMeter can be discussed right here on this thread with our JMeter experts. Happy developing!

Jun 30, 2020

When to Include InfluxDB in the ThingWorx Development Lifecycle (this article is also available for download as a PDF attached) The Short Answer InfluxDB is a time series database designed specifically for data ingestion. Historically, InfluxDB has been viewed as a high-scale expansion option for ThingWorx: a way to ensure the application works as intended, even when scaled up to the enterprise level. This is certainly one way to view it, because when there are many, many remote things, each with a lot of properties writing to the Platform at short intervals, then InfluxDB is a sure choice. However, what about in smaller applications? Is there still a benefit to using an optimized data ingestion tool in any case? The short answer is: yes, there is! Using InfluxDB for optimized data ingestion is a good idea even in smaller-sized applications, especially if there are plans to scale the application up in the future. It is far better to design the application around InfluxDB from the start than to adjust the data model of the application later on when an optimized data ingestion process is required. PostgreSQL and InfluxDB simply handle the storage of data in different ways, with the former functioning better with many Value Streams, and the latter with fewer Value Streams. Switching the way data is retained and referenced later, when the application is already on the larger side, causes delays in growing the application larger and adding more devices. Likewise, if the Platform reaches its ingestion limits in a production environment, there can be costly downtime and data loss while a proper solution (which likely involves reworking the application to work optimally with InfluxDB) is implemented. Don’t think that InfluxDB is for expansion only; it is an optimized ingestion database that has benefits at every level of the application development lifecycle. From the end to end, InfluxDB can ensure reliable data ingestion, reduced risk of data loss, and reduced memory and CPU used by the deployment overall. Preliminary sizing and benchmark data is provided in this article to explain these recommendations. Consider how ThingWorx is ingesting data now, how much CPU and other resources are used just for acquiring the data, and perhaps InfluxDB would seem a benefit to improve application performance. The Long Answer In order to uncover just how beneficial InfluxDB can be in any size application, the IoT Enterprise Deployment Center has run some simulations with small and medium sized applications. The use case in the simulation is simple with user requests coming from a collection of basic mashups and data ingestion coming from various numbers of things, each with a collection of “fast” and “slow” properties which update at different rates. This synthetic load of data does not include a more complete application scenario, so the memory and CPU usage shown here should not be used as sizing recommendations. For those types of recommendations, stay tuned for the soon-to-release ThingWorx 9.0 Sizing (or check out the current 8.5 Sizing Guide). Comparing Runs When determining the health of the ThingWorx Platform, there are several categories to inspect: Value Stream Queue Rate and Queue Size, HTTP Requests, and the overall Memory and CPU use for each server. Using Grafana to store the metrics results in charts like those below which can easily be compared and contrasted, and used to evaluate which hardware configuration results in the best performance. The size of the numbers on the vertical axis indicate total numbers of resources used for that metric, while the slope or trend of each chart indicates bottlenecks and inadequate resource allocation for the use case. In this case, all darker charts represent data from PostgreSQL ONLY configurations, while the lighter charts represent the InfluxDB instances. Because this is not a sizing guide, whether each of these charts comes from the small or medium run is unimportant as long as they match (for valid comparisons between with Influx and without it). The smaller run had something like 20k Things, and the larger closer to 60k, both with 275 total Platform users (25 Admins) and 3 mashups, which were each called at various refresh rates over the course of the 1-hour testing period. Note that in the PostgreSQL ONLY instances, there were more Thing Templates and corresponding Value Streams. This change is necessary between runs because only with fewer Value Streams does InfluxDB begin to demonstrate notable improvements. The most important thing to note is that the lighter charts clearly demonstrate better performance for both size runs. Each section below will break down what the improvement looks like in the charts to show how to use Grafana to verify the best performance. Value Stream Queue The vertical axis on the Value Stream Queue Rate chart shows how many total writes per minute (WPS) the Platform can handle. The average is 10 WPS higher using InfluxDB in both scenarios, and InfluxDB is also much more stable, meaning that the writes happen more reliably. The Value Stream Queue Size chart demonstrates how well the writes within the queue are processed. Both of these are necessary to determine the health of data ingestion. If the queue size were to increase and trend upward in the lighter Queue Size chart, then that would mean the Platform couldn’t handle the higher ingestion rate. However, since the Queue Size is stable and close to 0 the entire time, it is clear that the Platform is capable of clearing out the Value Stream Queue immediately and reliably throughout the entire test. FIGURE 1 – THESE REPRESENT THE DATA GETTING STORED INTO THE DATA PROVIDER. NOTE: THE FORMER IS MUCH LOWER THAN THE LATTER. FIGURE 2 – NOTE THE DATA LOSS IN THE NON-INFLUX INSTANCE (THE QUEUE IN GREEN REACHES THE MAX IN YELLOW). THE INFLUX INSTANCE HAS LESS TROUBLE CLEARING OUT THE QUEUE, AS DEMONSTRATED BY THE CONSISTENTLY LOW QUEUE SIZE. HTTP Requests Taking the strain of ingestion off of the Platform’s primary database frees its resources up for other activities. This in turn improves the performance and reliability of the Platform to respond to HTTP requests, those which in a typical application are used to aggregate data into smaller data stores (depending on the use case) and which render the mashups for the end users. The business logic and mashups can be more complex when there is one database designated for ingestion (InfluxDB) and one for everything else (PostgreSQL). FIGURE 3 – THE DARKER CHART SHOWS A LOT OF CHOPPINESS, MEANING THAT WHILE THE PLATFORM WAS RESPONDING THE WHOLE TIME, IT WAS NOT DOING SO RELIABLY. THE SMOOTHER SECOND CHART SHOWS HOW MUCH EASIER THE PLATFORM CAN HANDLE THESE REQUESTS WHEN THE LOAD IS DISTRUBITED INTELLIGENTLY ACROSS MULTIPLE SERVERS, EACH OPTIMIZED FOR THE TYPE OF DATA THEY RECEIVE. THE “STAIRCASE” SHAPE OCCURS BECAUSE THE SIMULATOR INCREASES THE WORK LOAD EVERY 10 MINUTES UNTIL IT BREAKS. Likewise, the nature of Postgres lends well towards this differentiation, given that there are many more database tables required for supporting the HTTP requests, something Postgres does well. That leaves Influx to handle the time-series data and ingestion, and those are the primary strengths of that software as well. So, splitting the load across multiple servers in this way results in smaller server sizes overall, each which is stream-lined and optimized to handle exactly what it is given by the Platform. Note that in both of these charts, there are no bad requests, so both would seem to be successful runs. However, as future charts will demonstrate more clearly, there is a catastrophic failure when the load is increased around 12:30p. The simulation ends before the server begins to show any real symptoms of the issue, and that is why there are no bad requests. The maximum Operations Per Second (OPS) in the Hardware Specifications and Performance section is taken from before the failure begins. Clearly the InfluxDB instance has better performance given that the average Operations Per Second (OPS) is substantially higher, nearly 4 times what is seen in the PostgreSQL ONLY instance. Obviously how well the Platform manages the business logic and mashup loading will depend on a lot of factors. In this test scenario, the OPS was increased by increasing the mashup refresh rate on the InfluxDB instances (which could handle over double the operations). Likewise, the number of Stream writes to the PostgreSQL database could be double what it was when PostgreSQL was the only database. Therefore, configuring InfluxDB for the data ingestion and leaving Postgres for the rest of the application certainly makes the load much easier on the Platform, and the same would be true even in a much more complex scenario. Memory and CPU The important thing here is to keep the memory use low enough that any spikes in usage won’t cause a server malfunction. CPU Usage should stay at or below around 75%, and Memory should never exceed around 80% of the total allocated to the server. The sizing guides can help determine what this allocation of memory needs to be. Of note in these charts is the slight, upward slope of the CPU usage in the darker chart, indicating the start of a catastrophic failure, and the difference in the total memory needed for the ThingWorx Platform and Postgres servers when Influx is used or not. As is apparent, the servers use much less memory when the database load is split up intelligently across multiple servers. FIGURE 4 – THE THINGWORX CPU IS ABOUT THE SAME HERE AS IN THE INFLUXDB CONFIGURATION BELOW BUT LOOK AT HOW MUCH MORE MEMORY BOTH THE PLATFORM AND THE POSTGRES DATABASE NEED ALLOCATED TO THEM IN THIS CONFIGURATION (64 GB A PIECE). ALSO NOTE THE JUMP IN CPU AND MEMORY USAGE AFTER 12:30P. THIS IS REFERENCED IN THE PREVIOUS SECTION, AND THE SLOPE UPWARD OF THE USAGE AFTER THAT POINT INDICATES THE START OF A CATASROPHIC FAILURE. THE TEST ENDS TOO SOON TO SEE ANY SYMPTOMS OF FAILURE, BUT IT IS A SURE THING AFTER THE INCREASE IN LOAD AROUND 12:30P. FIGURE 5 – INFLUX NEEDS AN EXTRA SERVER, BUT THE SIZE OF THE INFLUX AND POSTGRES SERVERS TOGETHER IS LESS THAN HALF THE SIZE AS THAT REQUIRED FOR THE SINGLE POSTGRES DATABASE IN THE POSTGRES ONLY CONFIGURATION (8 GB). THINGWORX IS SMALLER TOO (32 GB). Hardware Specifications and Performance These are the exact specifications for each simulated instance, broken down by size and whether InfluxDB is configured or not. Note that some of the hardware specifications may be more than is necessary real-world use case depending. As stated previously, this document is not a sizing guide (use the official ThingWorx Sizing Guide). Note that the maximum number of WPS and OPS are shown here. The maximums are higher in the InfluxDB scenarios, meaning that even with smaller-sized servers, the InfluxDB configurations can handle much greater loads. Summary In conclusion, if InfluxDB may at some point be needed in the lifecycle of an application, because the expected number of things or the number of properties on each thing is large enough that it will max the limitations of the Platform otherwise, then InfluxDB should be used from the very start. There are benefits to using InfluxDB for data ingestion at every size, from performance to reliability, and of course the obviously improved scalability as well. Reworking the application for use with InfluxDB later on can be costly and cause delays. This is why the benefits and costs associated with an InfluxDB-centric hardware configuration should be considered from the start. More servers are required for InfluxDB, but each of these servers can be sized smaller (depending on the use case), and all of this will affect the overall cost of hosting the ThingWorx application. The benefits of InfluxDB are especially pronounced when used in conjunction with clusters, which will be demonstrate fully in the 9.0 Sizing Guide (soon to be released). If InfluxDB is used to interface with the clusters, then there are even more resources to spare for user requests. It is considered ThingWorx best practice for high ingestion customers to make use of InfluxDB in applications of any size. Note, though, that this will mean the number of Value Streams per Influx Database will need to be limited to single digits. We hope this helps, and from everyone here at the EDC, happy developing!

May 31, 2020

May 15, 2020

Apr 22, 2020

IoT Tips

Achieving Regulatory Compliance with ThingWorx

Introducing the New and Improved DGIS Guide to ThingWorx Development

Top 5 Monitoring Best Practices

How to Scale Vertically and Horizontally, and When to Use Sharding

Thread Safe Coding in ThingWorx, Part 2: The Database Locker Approach and Comparison

Live Expert Session: Thingworx Flow Overview is tomorrow (December 10th)!!!

Thread Safe Coding in ThingWorx, Part 1: The Java Extension Approach

IOT EDC Reference Benchmark - Updated Connected Factories Benchmark with ThingWorx 9.0

Announcing the Final Installment: JMeter for ThingWorx Comprehensive Guide and Best Practice Tips

IOT EDC Reference Benchmark - Leveraging Dell and VMWare for Asset Monitoring in Connected Factories

IOT EDC Reference Benchmark - New Scenario Using Multi-Kepware for Connected Factories

Generating and Reviewing JMeter Results

Distributed Testing with JMeter

ThingWorx DevOps with Jenkins

IOT EDC Reference Benchmark - ThingWorx and Azure IoT Hub

IOT EDC Reference Benchmark - Remote Monitoring of Assets in Connected Factories

Building More Complex Tests in JMeter

JMeter for ThingWorx

Why Use InfluxDB in a Small ThingWorx Application

ThingWorx 8.5 Architecture Deployment Guide Update

Video Abstract for IoT Reference Benchmark - Remote Monitoring of Assets

ThingWorx Learning Paths

Getting Started on the ThingWorx Platform Learning Path