IoT Tips

The natively exposed ThingWorx Platform performance metrics can be extremely valuable to understanding overall platform performance and certain of the core subsystem operations, however as a development platform this doesn't give any visibility into what your built solution is or is not doing. Here is an amazing little trick that you can use to embed custom performance metrics into your application so that they show up automatically in your Prometheus monitoring system. What you do with these metrics is up to your creativity (with some constraints of course). Imaging a request counter for specific services which may be incredibly important or costly to run, or an exception metric that is incremented each time you catch an exception, or a query result size metric that informs you of how much data is being queried from the database. Refer to Resources > MetricsServices: GetCounterMetric GetGaugeMetric IncrementCounterMetric DecrementCounterMetric SetGaugeMetric You'll need to give your metric a name - identified by key - and this is meant to be dotted notation* which will then be converted to underscores when the metric is exposed on the OpenMetrics endpoint. Use sections/domains in the dotted notation to structure your metrics in-line with your application design. COUNTER type metrics are the most commonly used and relate to things happening through time. They are an index which will get timestamped as they're collected by Prometheus so that you will be able to look back in time and analyse and investigate what happened when and what the scale or impact was. After the fact functions and queries will need to be applied to make these metrics most useful (delta over time, increase, rate per second). Common examples of counter type metrics are: requests, executions, bytes transferred, rows queried, seconds elapsed, execution time. Resources["MetricServices"].IncrementCounterMetric({ basetype: "LONG", value: 1, key: "__PTC_Reported.integration.mes.requests", aggregate: false }); GAUGE type metrics are point-in-time status of some thing being measured. Common gauge type metrics are: CPU load/utilization, memory utilization, free disk space, used disk space, busy/active threads. Resources["MetricServices"].SetGaugeMetric({ basetype: "NUMBER", value: 12, key: "__PTC_Reported.Users.ConnectedOperatorCount", aggregate: true }); Be aware of the aggregate flag, as it will make this custom metric cluster level which can have some unintended consequences. Normally you always want performance metrics for the specific node as you then see what work is happening where and can confirm that it is being properly distributed within the cluster. There are some situations however where you might want the cluster aggregation however, like with this concurrently connected operators. Happy Monitoring!

Jul 4, 2024

ThingWorx Monitoring and Alerting, Part 1 Using Prometheus and Grafana By Tori Firewind, IoT EDC Introduction and Getting Started As ThingWorx has become a more mature product during the lifetime of the IoT EDC, so too have our dev ops recommendations. As we’ve stated throughout many posts now, testing is a key part of ensuring enterprise readiness, and it occurs at every stage of the process: from unit testing to preserve individual service logic, to integration tests which preserve the functionality of the application as a whole, to user and edge load testing and user experience testing, which ensure enterprise readiness. So testing is a critical component, but the process of dev ops never stops. In order to effectively test the system, a comprehensive monitoring solution is also required. Once the application is tested and the changes pushed into production, there is no knowing with certainty that everything will run smoothly indefinitely. Random spikes in usage, server bandwidth or availability, any unforeseeable factors like these can come along and cause issues for a system. If these issues aren’t detected and addressed early, then they can very rapidly morph into much larger problems: outages, data loss, inflated data tables which are hard to revert due to their size. It is critical to detect performance issues on a system as early as possible, to have as much information as is necessary to figure out where the problem is heading, and what may have started it. Monitoring is key to a healthy system. CI/CD stands for “Continuous Integration/Continuous Deployment”, a never-ending cycle of improvement. Testing just once before the initial go live isn’t enough. Each system should have automated tests that run continuously, as well as monitors and alerts which reveal problems sooner. Diagnostic tools play a role as well, being the bridge from the end of the dev ops process cycle back to the beginning (monitoring into planning). A good CI/CD dev ops process will ensure that problems are found earlier, fixed more rapidly, and fixed for everyone using the system. In a fully mature dev ops pipeline, issues are anticipated, discovered and researched before they become production outages or critical issues. These investigations or testing follow-ups produce development tasks (usually bugs, but also features at times) which then start the dev ops cycle all over again. This is why a good, efficient dev ops pipeline is needed, one which allows changes to quickly and safely go from development to production. This is also why diagnostic tools play a role in the monitoring piece of the dev ops process. They are the bridge between monitoring and planning. Tools like Dynatrace can be configured to provide call stacks and take thread dumps when issues start to occur, before the system is performing so poorly it needs a restart, which happens automatically in a cluster and can clear out any trace of the issue. Thread dumps are often necessary to diagnosing the root cause of the issue (to permanently fix it), and doing so quickly ensures application stability and availability. That is, after all, the purpose of the dev ops process. Diagnostics is therefore an equally important piece of the dev ops Figure-8-shaped pie, and one which deserves its own spotlight in an article to come. Every piece of the dev ops process must be viewed as equally important in its own way, lest the dev ops cycle get hung up on bottlenecks of its own. A safe and stable system is not one which never experiences issues, it is one which has a good, efficient plan in place to handle recovery and prevention of repetition. A wholesome dev ops process is a happy dev ops process. The Monitoring Stack There are many monitoring options available, but in our experience one of the easiest and most effective monitoring stacks to use with ThingWorx is Prometheus for metrics gathering with Grafana for metrics analysis and review. In a mature monitoring stack, Telegraf is also commonly installed on each VM/host to gather the system metrics (like CPU and Memory usage, things we’ve stated are good metrics of system performance and stability in past articles on scale and size testing) and output them in Prometheus format. Prometheus is a highly scalable open-source monitoring framework that contains out of the box monitoring and alert capabilities for Kubernetes-based deployments (not covered in this article). Using Prometheus is very simple because the ThingWorx application exposes a metrics endpoint which is formatted directly for use by Prometheus. There is also built-in alerting in Prometheus, but not the ability to form dashboards for reviewing data or screenshotting it for documentation purposes. That’s where Grafana comes into play. Grafana has a preconfigured Prometheus-type data source and many preconfigured dashboard templates for various applications and services. Telegraf is also easily imported into Grafana, as is shown in the section below. The Prometheus targets in the larger diagram are expanded out on the left. For each target, some tool exports the data in a syntax which Prometheus can scrape. For VMs, this can be Telegraf, for Kubernetes, the Node Exporter. JVM has a JMX Exporter, and other tools like CX Server use Graphite. Many apps already have a Prometheus endpoint built-in, like ThingWorx and Zookeeper. Telegraf is not strictly necessary; the node exporter can also be used on VMs, but Telegraf is the more common choice since it is a more mature dev ops tool. Once Prometheus is scraping the targets, alerting on them can be done with OOTB Prometheus functionality, and dashboards for monitoring can be made easily in Grafana (with built-in support as well). This stack does not include the diagnostics piece, something which triggers thread dumps or the like when issues do occur. There are too many ways to conduct a successful diagnostic piece to cover here. How to Get Started Getting started monitoring a ThingWorx application is incredibly easy in the latest versions. Simply open up a browser, and type in the ThingWorx URL, followed by “/Metrics”. At this endpoint, there is a specially formatted response that can automatically be read by the Prometheus monitoring software which contains subsystem and service data. In addition to the application metrics, Prometheus can be configured to collect metrics from a node exporter at the (virtualized) operating system or container (Kubernetes) level as well. If you haven’t already, install Grafana, install Telegraf as a service, and install Docker Desktop. These are the tools required (in addition to ThingWorx of course) to set-up a simple sandbox system for familiarization with the monitoring stack recommended by PTC. The easiest way to try Prometheus on a local Windows instance is to use Docker. The command for that will be found below, but first open up Docker Desktop to set contextual parameters that the command line will need. Then, modify the configuration file for Telegraf or create one (called telegraf.conf in the same folder as the exe file), and put the following into the file (or uncomment it; the default config file has thousands of lines, so just search for “prometheus”): Output plugin [[outputs.prometheus_client]] listen = "0.0.0.0:9125" Alternatively, install the Prometheus Node Exporter tool, which will likely require some additions to the Prometheus config file (not covered here) which we are about to create. Then, create a configuration file (called prom_config_localhost_scraper.yml in the command to come), add the following (assuming a standard localhost installation of ThingWorx): # my global config global: scrape_interval: 45s evaluation_interval: 30s scrape_timeout: 30s # scrape_timeout is set to the global default (10s). rule_files: - prom_config_rules.yml scrape_configs: - job_name: thingworx static_configs: - targets: ['host.docker.internal:8080'] basic_auth: username: "Administrator" password: "admin!123456789" metrics_path: /Thingworx/Metrics scheme: http params: x-thingworx-session: - "false" - job_name: prometheus static_configs: - targets: ['localhost:9090'] - job_name: Telegraf # If telegraf is installed, grab stats about the local # machine by default. static_configs: - targets: ['host.docker.internal:9125'] This example script file uses the host.docker.internal instead of localhost for the server target for ThingWorx because it is running outside of the Docker container which contains Prometheus. This yml file configures Prometheus to monitor both ThingWorx and itself, as well as the server metrics coming from Telegraf (as long as they are configured to push). It’s a sandbox-only configuration, really, as you wouldn’t want to use the Administrator user, or have the password printed in plain text in the config file in a real system. Also note the need for the x-thingworx-session parameter, as runaway sessions which spawn every 30s or so (whatever the scrape interval is) will result in memory issues over time (so we don’t want to use sessions here). The rules file given here (prom_config_rules.yml) needs to be created separately. This is where all of the alert rules will be defined. This will determine if an alert state is happening, but without configuring the alert manager, there won’t be any notification. That isn’t covered here but is covered extensively in the Grafana docs. Here is an alert example: groups: - name: alert.rules rules: # Alert for any instance that is unreachable for >5 minutes. - alert: HighMemory expr: mem_used > 14000000 for: 1s labels: severity: page annotations: summary: "High Memory" description: "Localhost Memory Usage is High" Now, save these files and use Powershell to run the Docker container: docker run -p 9090:9090 -v C:\<path_to_document>\prom_config_localhost_scraper.yml:/etc/prometheus/prometheus.yml prom/prometheus It should download Prometheus and install it in that container (if this is the first time), allowing you to very rapidly deploy it to an endpoint of localhost:9090 by default. If there is an error like the one shown below, this means that you forgot to start Docker Desktop (the application) before opening Powershell. Docker Desktop sets system parameters required for containers to run in a command line (in Linux, it should work if Docker is installed for use by the command line, simple as that). The localhost endpoints are accessible in a browser. ThingWorx defaults to localhost:8080 endpoint. Prometheus defaults to localhost:9090. Telegraf is on port 9125. Open any of these in a browser tab to see the full monitoring stack. You can see easily if Prometheus is working by clicking “Status” > “Targets” at localhost:9090: If all of the targets appear as blue and say “last scrape” and a time stamp, then they’re working as expected. If they don’t, ensure you have the right ports, that there aren’t any firewall issues (if things aren’t all on localhost), and that everything is running without errors. The last step in the process here is to install a dashboard tool like Grafana. Once this is installed and running on localhost:3000 (by default), you can display the data from Prometheus with a few configuration steps the Grafana UI. Highlight over the settings icon in the bottom left of the screen, and then click on “Data sources”. Select the “Add data source” button, and then click on Prometheus. You have to type the URL again (localhost:9090), but most of the defaults will be ok here, and all you have to do is click “Save and test”. Now both targets should appear within Grafana, with their metrics showing up throughout the Grafana UI. This data source is what allows for the building of monitoring dashboards.

Dec 23, 2022

Distributed Timer and Scheduler Execution in a ThingWorx High Availability (HA) Cluster Written by Desheng Xu and edited by Mike Jasperson Overview Starting with the 9.0 release, ThingWorx supports an “active-active” high availability (or HA) configuration, with multiple nodes providing redundancy in the event of hardware failures as well as horizontal scalability for workloads that can be distributed across the cluster. In this architecture, one of the ThingWorx nodes is elected as the “singleton” (or lead) node of the cluster. This node is responsible for managing the execution of all events triggered by timers or schedulers – they are not distributed across the cluster. This design has proved challenging for some implementations as it presents a potential for a ThingWorx application to generate imbalanced workload if complex timers and schedulers are needed. However, your ThingWorx applications can overcome this limitation, and still use timers and schedulers to trigger workloads that will distribute across the cluster. This article will demonstrate both how to reproduce this imbalanced workload scenario, and the approach you can take to overcome it. Demonstration Setup For purposes of this demonstration, a two-node ThingWorx cluster was used, similar to the deployment diagram below: Demonstrating Event Workload on the Singleton Node Imagine this simple scenario: You have a list of vendors, and you need to process some logic for one of them at random every few seconds. First, we will create a timer in ThingWorx to trigger an event – in this example, every 5 seconds. Next, we will create a helper utility that has a task that will randomly select one of the vendors and process some logic for it – in this case, we will simply log the selected vendor in the ThingWorx ScriptLog. Finally, we will subscribe to the timer event, and call the helper utility: Now with that code in place, let's check where these services are being executed in the ScriptLog. Look at the PlatformID column in the log… notice that that the Timer and the helper utility are always running on the same node – in this case Platform2, which is the current singleton node in the cluster. As the complexity of your helper utility increases, you can imagine how workload will become unbalanced, with the singleton node handling the bulk of this timer-driven workload in addition to the other workloads being spread across the cluster. This workload can be distributed across multiple cluster nodes, but a little more effort is needed to make it happen. Timers that Distribute Tasks Across Multiple ThingWorx HA Cluster Nodes This time let’s update our subscription code – using the PostJSON service from the ContentLoader entity to send the service requests to the cluster entry point instead of running them locally. const headers = { "Content-Type": "application/json", "Accept": "application/json", "appKey": "INSERT-YOUR-APPKEY-HERE" }; const url = "https://testcluster.edc.ptc.io/Thingworx/Things/DistributeTaskDemo_HelperThing/services/TimerBackend_Service"; let result = Resources["ContentLoaderFunctions"].PostJSON({ proxyScheme: undefined /* STRING */, headers: headers /* JSON */, ignoreSSLErrors: undefined /* BOOLEAN */, useNTLM: undefined /* BOOLEAN */, workstation: undefined /* STRING */, useProxy: undefined /* BOOLEAN */, withCookies: undefined /* BOOLEAN */, proxyHost: undefined /* STRING */, url: url /* STRING */, content: {} /* JSON */, timeout: undefined /* NUMBER */, proxyPort: undefined /* INTEGER */, password: undefined /* STRING */, domain: undefined /* STRING */, username: undefined /* STRING */ }); Note that the URL used in this example - https://testcluster.edc.ptc.io/Thingworx - is the entry point of the ThingWorx cluster. Replace this value to match with your cluster’s entry point if you want to duplicate this in your own cluster. Now, let's check the result again. Notice that the helper utility TimerBackend_Service is now running on both cluster nodes, Platform1 and Platform2. Is this Magic? No! What is Happening Here? The timer or scheduler itself is still being executed on the singleton node, but now instead of the triggering the helper utility locally, the PostJSON service call from the subscription is being routed back to the cluster entry point – the load balancer. As a result, the request is routed (usually round-robin) to any available cluster nodes that are behind the load balancer and reporting as healthy. Usually, the load balancer will be configured to have a cookie-based affinity - the load balancer will route the request to the node that has the same cookie value as the request. Since this PostJSON service call is a RESTful call, any cookie value associated with the response will not be attached to the next request. As a result, the cookie-based affinity will not impact the round-robin routing in this case. Considerations to Use this Approach Authentication: As illustrated in the demo, make sure to use an Application Key with an appropriate user assigned in the header. You could alternatively use username/password or a token to authenticate the request, but this could be less ideal from a security perspective. App Deployment: The hostname in the URL must match the hostname of the cluster entry point. As the URL of your implementation is now part of your code, if deploy this code from one ThingWorx instance to another, you would need to modify the hostname/port/protocol in the URL. Consider creating a variable in the helper utility which holds the hostname/port/protocol value, making it easier to modify during deployment. Firewall Rules: If your load balancer has firewall rules which limit the traffic to specific known IP addresses, you will need to determine which IP addresses will be used when a service is invoked from each of the ThingWorx cluster nodes, and then configure the load balancer to allow the traffic from each of these public IP address. Alternatively, you could configure an internal IP address endpoint for the load balancer and use the local /etc/hosts name resolution of each ThingWorx node to point to the internal load balancer IP, or register this internal IP in an internal DNS as the cluster entry point.

Dec 1, 2021

Persistent vs. Logged Properties By Mike Jasperson, VP of IoT EDC Executive Summary ThingWorx provides several different “aspects” (or storage options) for how property values are saved. These options each have different implications for performance and scalability. Understanding those implications is important for designing a scalable IOT solution. Persistent Properties are best used for non-telemetry data which will change infrequently (for example only a few times in a day) and where historical values are not required. When overused, Persistent properties can put significant pressure on the database layer of your ThingWorx implementation, leading to poor performance of your IOT application. As the number of Things in your IOT application scales up, the quantity or frequency of persistent properties per Thing needs to be carefully considered. Logged Properties are best used for telemetry data where historical values need to be retained, but also for any other value that is expected to change frequently. Logged properties can create some additional requirements: a process for handling null/default values after restarts, more disk space, and a data retention policy. There are benefits as well, though, like more flexibility and scalability for the ingestion of larger volumes of data. Persistent + Logged Properties perform database operations of both aspects. Combined use should be very limited – only properties that update infrequently (a few times a day), and that must be in-memory in the event of a ThingWorx restart. In-Memory Only Properties are neither persistent nor logged – they are not stored to the database. These properties can greatly improve scale for values that need to be available for the application to drive UIs or compute other derived values that will be stored. However, high-frequency updates of in-memory properties can create scale challenges in HA (high availability) ThingWorx configurations where memory state needs to be constantly shared between multiple ThingWorx nodes. Find a complete summary as well as example cases in the document attached.

Oct 14, 2021

5 Common Mistakes to Developing Scalable IoT Applications by Tori Firewind and the IoT EDC Team Introduction To build scalable applications, it’s necessary to identify common mistakes and avoid them at the early stages of development. In an expert session this past month, the PTC Enterprise Deployment Team elaborated on why scalability is important and how to avoid the common development pitfalls in IoT. That video presentation has been adapted here for visual consumption of the content as well. What is Scalability and Why Does it Matter Enterprise ready applications can scale and easily be maintained, which is important even from day 1 because scalability concerns are the largest cause for delays to Go Lives. Applications balance many competing requirements, and performance testing is crucial to ensure an application is ready for Go Live. However, don't just test how many remote assets can connect at once, but also any metrics that are expected to increase in time, like the number of remote properties per thing, the frequency of reporting from those properties, or the number of users accessing the system at once. Also consider how connecting more assets will affect the user experience and business logic, and not just the ability to ingest data. Common Mistake 1: Edge Property Updates Because ThingWorx is always listening for updates pushed from the Edge and those resources are always in use, pulling updates from the Foundation side wastes resources. Fetch from remote every read is essentially a round trip, so it's slower and more memory intensive, but there are reasons to do it, like if the quality tag is needed since the cache doesn't store it. Say a property is pushed at 11:01, and then there's a network issue at 11:02. If the property is pulled from the cache, it will pull the value sent at 11:01 without any indication of there being a more recent value on the Edge device. Most people will use the default options here: read from server cache, which relies on the Edge to push updates, and the VALUE push type, and configuring a threshold is a good idea as well. This way, only those property updates which are truly necessary are sent to the Foundation server. Details on property aspects can be found in KCS Article 252792. This is well documented in another PTC Community post. This approach is necessary and considered a best practice if there is event logic which depends on multiple properties at once. Sending all of the necessary properties to determine if an event should fire in one Infotable ensures there is no need to query the database each time a property update comes in from the Edge, which ensures independent business logic and reduces the load on the database to improve ingestion performance. This is a very broad topic and future articles will address it more specifically. The When Disconnected property aspect is a good way to configure what happens with Edge property values in a mass disconnect scenario. If revenue depends on uptime, consider losing any data that changes while a device is disconnected. All of the updates can be folded into a single value if the changes themselves aren't needed but an updated value is needed to populate remote properties upon reconnect. Many customers will want to keep all of their data, even when a device is offline and use data stores. In this case, consider how much data each Edge device can store (due to memory limitations on the devices themselves), and therefore how long an outage can last before data is lost anyway. Also consider if Foundation can handle massive spikes in activity when this data comes streaming in. Usually, a Connection Server isn't enough. Remember that the more data needs to be kept, the greater the potential for a thundering herd scenario. Handling a thundering herd scenario goes beyond sizing considerations. It is absolutely crucial to randomize the delay each device will wait before attempting to reconnect. It should be considered a requirement to have the devices connect slowly and "ramp up" over time for multiple reasons. One is that too much data coming in too fast could overwhelm the ingestion queue and result in data loss. Another is that the business logic could demand so many system resources, that the Foundation server crashes again and again and cannot be recovered. Turning off the business logic it isn't possible if the downtime is unexpected, so definitely rely instead on randomized reconnection times for Edge devices. Common Mistake 2: Overlooking Differences in HA To accommodate a shared thing model across many servers, changes had to be made in how the thing model is stored and the model tree is walked by the Foundation servers. Model information is no longer cached at the Thing level, and the model tree is therefore walked every time model information is needed, so the number of times a Thing is directly referenced within each service should be limited (see the Help Center for details). It's best to store whatever information is needed from a Thing in an Infotable, making the Things[thingName] reference a single time, outside of any loops. Storing the property definitions outside of the loop prevents the repetitious Thing references within the service, which otherwise would have occurred twice for each property (for both the name and the description), and then again for every single property on the Thing, a runtime nightmare. Certain states previously held in memory are now shared across the cluster, like property values, Thing states, and connection statuses. Improvements have been made to minimize the effects of latency on queries, like how they now only return property values on associated Thing Shapes or Thing Templates. Filtering for properties on implementing Things is still possible, but now there is a specific service to do it, called GetThingPropertyValues (covered in detail in the Help Center). In the script shown above, the first step is a query to get the names of all implementing things of a particular Thing Shape. This is done outside of any loops or queries, so once per service call. Then, an Infotable is built to store what would have been a direct reference to each thing in a traditional loop. This is a very quick loop that doesn't add much by way of runtime since it is all in memory, with no references to the thing model or the database, instead using the results of the first query to build the Infotable. Finally, this thing reference Infotable is passed into the new service GetThingPropertyValues to retrieve all of the property info for all of these things at once, thereby only walking the thing model once. The easiest mistake people would make here is to do a direct thing reference inside of a loop, using code like Things[thingName].Get() over and over again, thereby traversing the thing model repeatedly and adding a lot of runtime. QueryImplementingThingsOptimized is another new service with new parameters for advanced configuration. Searches can now be done on particular networks or to particular depths, and there's an offset parameter that allows for a maximum number of items to be returned starting at any place in the list of Things, where previously if you needed the Things at the end of the list, you had to return all of the Things. All of these options are detailed in the Help Center, as well as the restrictions listed in the image above. Common Mistake 3: Async Service Misuse Async services are sometimes required, say if a user has to trigger many updates on many remote things at once by the click of a button on a mashup that should not be locked up waiting for service completion. Too many async service calls, though, result in spikes in activity and competition for resources. To avoid this mistake, do not use async unless strictly necessary, and avoid launching too many async threads in parallel. A thread dump will show how many threads there are and what they are doing. Common Mistake 4: Thread Pool Overload Adding more threads to the pool may be beneficial in certain circumstances, like if the threads are waiting on other resources to complete their tasks, look stuff up in the database (I/O), or unlock data that can only be accessed one thread at a time (property writes). In this case, threads are waiting on other resources, and not the CPU, so adding more threads to the pool can improve performance. However, too many threads and performance degradation will occur due to increased contention, wasted CPU cycles, and context switches. To check if there are too many or not enough threads in the pool, take thread dumps and time the completion of requests in the system. Also watch the subsystem memory usage, and note that the side of the queue should never approach the max. Also consider monitoring the overall performance of the system (CPU and Memory) with a tool like Grafana, and remember that a good performance test properly exercises all of the business logic and induces threads in a similar way to real world expectations. Common Mistake 5: Stream Etiquette Upserts, or updates to database tables, are expensive operations that can interfere with ingestion if they are performed on the wrong tables. This is why Value Stream and Stream data should never be updated by end users of the application. As described in the DGIS document on best practices, aggregation is the key to unlocking optimal performance because it reduces the size of database tables that require upserts. Each data structure shown here has an optimal use in a well-designed ThingWorx application. Data Tables are great for storing overview information on all of the Things in one view, and queries on this data source are the fastest. Update this data source as often as possible (by timer), allowing enough time for updates to be gathered and any necessary calculations made. Data Tables can also be updated by end users directly because each row locks one at a time during updates. Data Tables should be kept as small as possible to improve performance on mashups, so for instance, consider using one to show all Things per region if there are millions of Things. Roll up information is best stored here to avoid calculations upon mashup load, and while a real-time view of many thousands of things at once is practically impossible, this option allows for a frequently updates overview of many things, which can also drill down to other mashup views that are real-time for one Thing at a time. Value Streams are best used for data ingestion, and queries to these should be kept to a minimum, largely performed by the roll up logic that populates the Data Tables mentioned above. Queries that chart all of the data coming in are best utilized on individual Thing views so that only a handful of users are querying the same data sources at a time. Also be sure to use start and end dates and make use of the "source" field to improve query performance and create a better user experience. Due to the massive size of the corresponding database tables, it's best to avoid updating Value Streams outside of the data ingestion process altogether. Streams are similar, but better for storing aggregated, historical data. Usually once per day or per week (outside of business hours if possible), Value Stream data will be smoothed or reduced into less data points and then stored into Streams. This allows for data to be stored for longer periods of time on the server without using up as much memory or hurting query performance. Then the high volume ingested data sources can be purged frequently, as discussed below. Infotables are the most memory intensive, and are really designed to hold only a small number of rows at a time, usually to facilitate the business logic. Sometimes they will be stored in Streams or Data Tables if they aren't expected to grow larger (see the DGIS Coffee Machine App for an example). Infotables should never be logged; if they are used to transmit Edge property updates (like in the Property Set Approach), they should be processed into other logged (usually local) properties. Referring to the properties themselves is how to get real-time information on a mashup, say by using the GetProperties service and its auto-update option, which relies on internal websockets. This should be done on individual Thing views only, and sizing considerations need to be made if there will be many of these websockets open at once, say if there are many end users all viewing real-time data at a time. In the newer versions of ThingWorx, these cannot be updated directly, so find the system object called ThingWorxPersistenceProvider and use the service UpdateStreamDataProcessingSettings. ThingWorx Foundation processes data received from remote devices in batches in order to manage the data flow and reduce database churn. All of these settings configure how large those batches are and how frequently they are flushed to the database (detailed in full in KCS Article 240607). This is very advanced configuration that heavily depends on use case and infrastructure, but some info applies to most people: adjusting the scan rate is usually not beneficial; a healthy queue should never approach the max limit; and defaults differ by database because they function differently. InfluxDB generally works better when there are less processing threads and higher numbers of things per thread, while PostgresDB can have a lot of threads, preferably with less things per thread. That's why the default values shown here are given as the same number of threads (and this can be changed), but Influx has a larger block size and size threshold because it can handle more items per thread. Value Streams ingest all data into the Foundation server, and so the database tables that correspond with these data sources grow very large, very quickly and need to be purged often and outside of business hours, usually once a day or once per week. That's why it's important to reduce the data down to less points and push them into Streams for historical reference. For a span of years, consider a single point a day might be enough, for a span of hours, consider a data point a minute. Push aggregated data into Streams and then purge the rest as soon as it is no longer needed. In Conclusion

Jun 29, 2021

By Tim Atwood and Dave Bernbeck, Edited by Tori Firewind Adapted from the March 2021 Expert Session Produced by the IoT Enterprise Deployment Center The primary purpose of monitoring is to determine when your application may be exhausting the available resources. Knowledge of the infrastructure limits help establish these monitoring boundaries, determining straightforward thresholds that indicate an app has gone too far. The four main areas to monitor in this way are CPU, Memory, Networking, and Disk. For the CPU, we want to know how many cores are available to the application and potentially what the temperature is for each or other indicators of overtaxation. For Memory, we want to know how much RAM is available for the application. For Networking, we want to know the network throughput, the available bandwidth, and how capable the network cards are in general. For Disk, we keep track of the read and write rates of the disks used by the application as well as how much space remains on those. There are several major infrastructure categories which reflect common modes of operation for ThingWorx applications. One is Bare Metal, which relies upon the traditional use of hardware to connect directly between operating system and hardware, with no intermediary. Limits of the hardware in this case can be found in manufacturing specifications, within the operating system settings, and listed somewhere within the IT department normally. The IT team is a great resource for obtaining these limits in general, also keeping track of such things in VMware and virtualized infrastructure models. VMware is an intermediary between the operating system and the hardware, and often its limits are determined based on the sizing of the application and set by the IT team when the infrastructure is established. These can often be resized as needed, and the IT team will be well aware of the limits here, often monitoring some of the performance themselves already. This is especially so if Cloud Providers are used, given that these are scaled up virtualizations which are configured in easy-to-use cloud portals. These two infrastructure models can also be resized as needed. Lastly Containers can be used to designate operating system resources as needed, in a much more specific way that better supports the sharing of resources across multiple systems. Here the limits are defined in configuration files or charts that define the container. The difficulties here center around learning what the limits are, especially in the case of network and disk usage. Network bandwidth can fluctuate, and increased latency and network congestion can occur at random times for seemingly no reason. Most monitoring scenarios can therefore make due with collecting network send and receive rates, as well as disk read and write rates, performed on the server. Cloud Providers like Azure provide VM and disk sizing options that allow you to select exactly what you need, but for network throughput or network IO, the choices are not as varied. Network IO tends to increase with the size of the VM, proportional to the number of CPU cores and the amount of Memory, so this may mean that a VM has to be oversized for the user load, for the bulk of the application, in order to accommodate a large or noisy edge fleet. The next few slides list the operating metrics and common thresholds used for each. We often use these thresholds in our own simulations here at PTC, but note that each use case is different, and each situation should be analyzed individually before determining set limits of performance. Generally, you will want to monitor: % utilization of all CPU cores, leaving plenty of room for spikes in activity; total and used memory, ensuring total memory remains constant throughout and used memory remains below a reasonable percentage of the total, which for smaller systems (16 GB and lower) means leaving around 20% Memory for the OS, and for larger systems, usually around 3-4 GB. For disks, the read and write rates to ensure there is ample free space for spikes and to avoid any situation that might result in system down time; and for networking, the send and receive rates which should be below 70% or so, again to leave room for spikes. In any monitoring situation, high consistent utilization should trigger concern and an investigation into what’s happening. Were new assets added? Has any recent change caused regression or other issues? Any resent changes should be inspected and the infrastructure sizing should be considered as well. For ThingWorx specific monitoring, we look at max queue sizes, entries performed, pool sizes, alerts, submitted task counts, and anything that might indicate some kind of data loss. We want the queues to be consistently cleared out to reduce the risk of losing data in the case of an interruption, and to ensure there is no reason for resource use to build up and cause issues over time. In order for a monitoring set-up to be truly helpful, it needs to make certain information easily accessible to administrative users of the application. Any metrics that are applicable to performance needs to be processed and recorded in a location that can be accessed quickly and easily from wherever the admins are. They should quickly and easily know the health of the application from a glance, without needing to drill down a lot to be made aware of issues. Likewise, the alerts that happen should be meaningful, with minimal false alarms, and it is best if this is configurable by the admins from within the application via some sort of rules engine (see the DGIS guide, soon to be released in version 9.1). The monitoring tool should also be able to save the system history and export it for further analysis, all in the name of reducing future downtime and creating a stable, enterprise system. This dashboard (above) is a good example of how to rollup a number of performance criteria into health indicators for various aspects of the application. Here there is a Green-Yellow-Red color-coding system for issues like web requests taking longer than 30s, 3 minutes, or more to respond. Grafana is another application used for monitoring internally by our team. The easy dashboard creation feature and built-in chart modes make this tool super easy to get started with, and certainly easy to refer to from a central location over time. Setting this up is helpful for load testing and making ready an application, but it is also beneficial for continued monitoring post-go-live, and hence why it is a worthy investment. Our team usually builds a link based on the start and end time of tests for each simulation performed, with all of the various servers being monitored by one Grafana server, one reference point. Consider using PTC Performance Advisor to help monitor these kinds of things more easily (also called DynaTrace). When most administrators think of monitoring, they think of reading and reacting to dashboards, alerts, and reports. Rarely does the idea of benchmarking come to mind as a monitoring activity, and yet, having successful benchmarks of system performance can be a crucial part of knowing if an application is functioning as expected before there are major issues. Benchmarks also look at the response time of the server and can better enable tracking of actual end user experience. The best option is to automate such tests using JMeter or other applications, producing a daily snapshot of user performance that can anticipate future issues and create a more reliable experience for end users over time. Another tool to make use of is JMeter, which has the option to build custom reports. JMeter is good for simulating the user load, which often makes up most of the server load of a ThingWorx application, especially considering that ingestion is typically optimized independently and given the most thought. The most unexpected issues tend to pop up within the application itself, after the project has gone live. Shown here (right) is an example benchmark from a Windchill application, one which is published by PTC to facilitate comparison between optimized test systems and real life performance. Likewise, DynaTrace is depicted here, showing an automated baseline (using Smart URL Detection) on Response Time (Median and 90th percentile) as well as Failure Rate. We can also look at Throughput and compare it with the expected value range based on historical throughput data. Monitoring typically increases system performance and availability, but its other advantage is to provide faster, more effective troubleshooting. Establish a systematic process or checklist to step through when problems occur, something that is organized to be done quickly, but still takes the time to find and fix the underlying problems. This will prevent issues from happening again and again and polish the system periodically as problems occur, so that the stability and integrity of the system only improves over time. Push for real solutions if possible, not band-aids, even if more downtime is needed up front; it is always better to have planned downtime up front than unplanned downtime down the line. Close any monitoring gaps when issues do occur, which is the valid RCA response if not enough information was captured to actually diagnose or resolve the issue. PTC Tech Support developed a diagnostic data gathering query for Oracle that customers can use, found in our knowledgebase. This is an example of RCA troubleshooting that looks at different database factors, reporting on which queries perform the worst based on inputted criteria. Another example of troubleshooting is for the Java JVM, where we look at all of the things listed here (below) in an automated, documented process that then generates a report for easy end user consumption. Don’t hesitate to reach out to PTC Technical Support in advance to go over your RCA processes, to review benchmark discrepancies between what PTC publishes and what your real-life systems show, and to ensure your monitoring is adequate to maintain system stability and availability at all times.

Mar 24, 2021

New Scenario Using Multi-Kepware for Asset Monitoring in Connected Factories A new scenario has been completed for Connected Factory implementations, furthering the IOT EDC's goal of providing a reference library of ThingWorx performance. This scenario builds upon the first, with additional tests being performed to demonstrate the capabilities of multiple Kepware Servers running side-by-side. Horizontal scaling is very common for multi-line factory implementations, so be sure to check out the new scenario in this ever-expanding benchmark document. Note that tests below 10,000 writes per second were not repeated with multiple Kepware Servers, since there is little reason to desire such a configuration in implementations that small. ThingWorx deployment sizing was also held constant throughout these tests to demonstrate the limits of a given configuration. Changes that may improve the results of a failed test (such as adding CPUs or Memory) will be mentioned but not validated as part of this benchmark. Let us know about your applications and how they compare with the data shared here. Happy developing!

Sep 30, 2020

The IoT Enterprise Deployment Center’s goal is to create and share knowledge around the best practices for architecting, designing, and deploying successful, enterprise-scale Thingworx IoT Solutions. To accomplish this goal, the EDC team takes a “real world” approach, using simulated IoT assets and users to benchmark the capabilities of different Thingworx deployment configurations. First, each implementation is pushed to its limit in an effort to establish real-world baselines, metrics which can be used to help customers determine which architecture choices will work for their custom needs. Then, each implementation is pushed beyond its limits, providing useful insight into where and why things fail, and illuminating potential implementation changes which could push the boundaries further. Through the simulations testing to come, the EDC will be publishing the resulting benchmarks for all to see! These benchmarks will include details on implementation goals and performance metrics for different stages of deployment. Additionally, best-practice articles which illustrate how to deploy the different architectural components (those referenced within the benchmarks) will also be posted, highlighting the optimal approach to integrating everything into the Thingworx platform. Stay tuned to see more about just how versatile the ThingWorx Platform can be! We look forward to discussing these findings as they are published right here on the PTC Community.

Jul 17, 2019

Dev Ops is a crucial process that exists in any software setting, whether you plan on it or not. Chaos in the dev ops process, say because less time is spent here than on the shiny new features that are easy to sell, results in bottlenecks in the dev ops process. Bottlenecks reduce efficiency, and leave you open to vulnerabilities as well. The faster you can get a change properly tested and safely into production, the safer and more stable the system is all along. Issues will arise, they always arise. Are you ready for them? Watch this video, see some of these additional links, and think about your dev ops process now, before the fires start! Useful Links: ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 1 ThingWorx Monitoring and Alerting Using Prometheus and Grafana, Part 2 Overview of Monitoring Tools and Diagnostics The System Health Timer

Jan 31, 2023

Architecting Reason Code Trees in DPM Tori Firewind, IoT EDC What are Machine Codes? Factory hardware devices communicate status changes to their human operators and other machines (IoT) via machine codes. The manufacturers often determine the machine codes for different types of factory hardware, so those are often pre-determined. However, how the reason trees map these machine codes to corresponding business logic in ThingWorx is entirely customizable. Knowing the best way to design your reason trees for this purpose can be challenging, so this guide is here to help with your conceptual knowledge. Using the UI to create, edit, and configure reason codes in technical detail can be found in the Help Center. The Tree Trunk At the highest level of the reason tree, the trunk, there are really 3 categories: Availability (A), Performance or Productivity (P), and Quality (Q). These should look familiar; they are the three dimensions of OEE (Overall Equipment Effectiveness). Fg 1. Calculation of OEE Availability refers to long stops, events that stop planned production long enough that it makes sense to track a reason for being down (typically several minutes, but the threshold between a long stop and a short stop can vary depending on the ideal rate of production of materials). Availability = Run Time / Planned Production Time Productivity/Performance really refers to short stops, things that cause the machine to run at less-than- optimal speeds. This can include stops caused by running out of materials for production, doing minor maintenance like switching out a single, easily-changed part, or even frequent breaks due to ill health of an operator. User error can be a cause as well, say if the machine needs a certain heat to produce parts, and the heat keeps fluctuating (requiring the machine to take the time to calibrate for this before starting on production) because operators are smoking out a back door or adjusting thermostat temperatures. Fg 2. Levels of Runnable Time Operator influence often is a factor when it comes to the conditions that permit optimal performance from machinery, and every factory may face different challenges. Stops like these are not really outages; the amount of downtime isn’t enough to consider the production block entirely unproductive. Production was continuing and ongoing throughout most of the block despite the issues; the rate was just slower than ideal. Performance/Productivity = (Total Count / Run Time) / Ideal Run Rate Quality refers mostly to the number of items that are considered scrap or rework, and it can be split into two categories: start up scrap (that which is expected because the machine is in the process of warming up or being fine-tuned by the operator) and production scrap (things which come out wrong and must be tossed or reworked because the conditions under which they were produced weren’t ideal; this is called first-pass yield only, meaning it's only a "good" product if it passes inspection the first time). Quality = Good Count / Total Count The Branches and the Leaves of the Tree The “leaves” are the reason codes which directly map to machine codes , and the “branches” are the method of categorization that connects them to the trunk. Both the leaves and the trees, the children and the parent nodes of the tree, are split into two states: planned versus unplanned downtime. Changeovers, maintenance, and even scrap, can be broken down into this dichotomy. For scrap, there are startup rejects (planned, because the machines have ramp up periods) and production rejects (unplanned, because the conditions weren’t ideal). For maintenance there is planned and unplanned, small changes that occur on the fly that result in productivity loss, and maybe also reduce availability in the long run. Small, unplanned changes can occasionally shift into the availability loss category if a simple, quick repair winds up being complex and time-consuming. A good reason tree can differentiate easily between short and longer stops in order to respond to each in a deliberate way. To start off in the process of architecting your reason tree, try writing the three categories on a board in a common room in an average factory (or several as a survey). Ask operators to stop in over the course of a few days and write various machine codes that they see often and find useful under one of these categories, or more than one if the machine code pops up under different circumstances and can mean different things. Have them write a 10 word justification, if the association isn’t obvious. Gather all of the “leaves” in this way, and then begin to associate them with the “trunk”, forming the “branches”. An example tree can be seen in figure 3 here, with leaves like “Changeover” and “Maintenance” being semi-ambiguous; they could just as easily be seen as unplanned stops. Therefore, there may be multiple reason codes mapping up to the top of the tree in more than one branch, and these can have different categories, which controls how the business logic responds to the different codes. The Help Center has more details about how the events are mapped to types, and each type contains multiple categories, as configured by you when you set-up the DPM model. Fg 3: Different types of changeovers may have different codes, and can map up as either planned or unplanned, but all planned and unplanned stops (long stops) are under the Availability category of the trunk. Similarly, small stops can involve idling, like if there are not enough materials, reduced speed if the conditions are not ideal, or other small stops, usually caused by human error or unforeseeable circumstances. Quality loss then refers to the products which fail quality checks, either because the machine still has the wrong paint in the applicator and needs a few rounds to be ready for the next production item, or because the conditions are again, not ideal, and items wind up scrapped. Example Reason Tree Fg 5 example tree with more specific tags (there may be dozens or hundreds in a full reason tree, though the fewer are needed to capture the events we care about, the better). Theory of Constraints Fg 6 theory of constraint wheel: an industry process for gradual OEE improvement in factories that has been adapted into the PTC methodology as well. While architecting your reason tree, always remember the key purpose: gathering only as much data as necessary to analyze the efficiency of a factory and to identify the bottleneck, or the most limiting factor. The important point is to identify not just the bottleneck that seems the most troublesome, but the one that actually results in the greatest impact to OEE across the entire factory. Without software like DPM, and a properly designed reason code tree, the process of improving a factory can be very challenging, involving a lot of guesswork, and sometimes solving one problem at the cost of another. The issue is that these machines produce a LOT of raw data, and humans are not the best tool available to gather and aggregate this data in a consumable way. A good reason tree ensures a smart application that can quickly prioritize the machine (bottleneck) that most impacts production, and not just the machine that functions in the least optimal way. So, the theory of constraints is really a process for identifying small, incremental changes, which together can make a big difference, and fast, in factory OEE. The rate at which this cycle can be completed varies, however. The slower the process of identifying constraints and the less information that is gathered, the slower and less precise the first two steps of this process. Alternatively, in a traditional constraint identification process, too much information can be a problem as well, due to human limitation, as discussed above. So, DPM is a great benefit in this regard, because it aggregates the data into a consumable, comparable way every 5 minutes, freeing up your human analytics for problem solving and prioritization, and not data gathering and sorting. Other Key Tips Also remember that a good tree treats the trunk like a whole unit, with each category occupying a percentage of the overall OEE. Afterall, look back up at the 3 dimensions of OEE in the equation above. For example, the more you see issues with availability, the less you will see issues with scrap, for the machine simply doesn’t have as much time to produce scrap if it is constantly down. The more you see issues with quality loss, the less you should see of productivity loss, because these are simply inversely proportional modes; to say it differently, if a machine is running quickly and seeing few minor maintenance stops, then it is likely to produce more scrap (as well as more good product as well). Another thing to remember is that even DPM is limited in its capacity to interpret raw data. Even while many magnitudes more efficient than any human gathering and analysis could ever be, there is an upper limit to how much raw data DPM can ingest and analyze before the system gets very expensive. For this reason, you want to ensure your reason trees use only as many reason codes as are required to capture the OEE of a factory site. This will mean using different codes for different types of things, most likely, which is easy to do maintainably across many sites using thing shapes. Keeping things tightly defined and organized is the easiest way to ensure a clean, efficient system for gathering and storing data. Also remember that data will not need to persist very long once DPM is fully operational and adopted by your factories. DPM ensures that the changes made to the production line to improve efficiency are the highest impact, and the least difficult to implement, meaning that there will be a very rapid return on investment, and a process to ensure future issues are identified and resolved quickly. Data from past issues in the factory won’t be as relevant, and historical data stores can be kept smaller than one might think. It is the power of ingesting data directly into the processing and aggregation process, the automatic reduction of data down into presentable, consumable webpages, that makes DPM and ThingWorx such a great factory solution for optimizing OEE.

Jan 24, 2023

ThingWorx Monitoring and Alerting, Part 2 Using Prometheus and Grafana By Tori Firewind, IoT EDC Building Dashboards To add a panel which monitors some component of the ThingWorx application to a dashboard in Grafana, click to add a new panel. Under “Metrics” in the box at the bottom of the screen, select what ThingWorx metrics you wish to monitor (type “thingworx” in the search box to see them all). For example, select the Platform Subsystem memory in use: Label filters aren’t necessary, though you may want to sort by instance if you are monitoring multiple ones with the same dashboard. You may also want to take some time to format the Y axis, which by default will show in bytes. Go to the formatting panel on the right side and scroll down to the section called “Standard options”. For the Unit dropdown, start typing “data” and then select “bytes (SI)”. This will automatically determine if the bytes you’ve provided should really print as MB or GB based on how large the numbers are. Rename the panel, modify it in any other way desired, and then click Apply (last 5 minutes): Once you add the panel, you can watch the memory usage as it is scraped by selecting the refresh option (10s or 30s, whatever makes sense based on your scrape interval). The viewing window is stored in the URL, so that you can generate a report for a specific interval (like when a test was occurring), and then store that result or share it in a more compact way: http://localhost:3000/d/nleucPv4k/thingworx-monitoring?orgId=1&from=1668528038732&to=1668536503953 (absolute timestamps): Dashboards are just collections of panels which report on all of the various metrics of performance and stability that exist for single components of a system. This is because there can be quite a few metrics worth watching for each individual component. Most of the third-party tools come with their own dashboards, but the ThingWorx component is one which for now, requires some thought and creativity. Consider your use case carefully and look over the various subsystems contained within ThingWorx. Each part of an application is localized to specific subsystems, and some are more business critical than others. What will go at the top of your dashboard? Add rows, add panels per row, and see what the many choices are for watching your system. Don’t forget that with Telegraf running, VM or machine usage metrics are also available for display on a dashboard. Things like overall CPU and Memory usage are critical to determining the health of a system, as we have demonstrated in our own reasoning in past benchmarks and scale tests. You can create a panel to monitor the mem_used versus mem_total, like so: Another metric from Telegraf worth adding is the CPU usage, which should be given “percentage” for the units and which needs a label filter of cpu = cpu-total. If we do some resizing and drag-and-dropping, then we now have the first row of a dashboard: See how the Platform usage climbs steadily and is purged in a cycle? That is the Java Garbage Collection mechanism, and it’s important to remember to leave room for spikes on top of those peaks. Data can also be calculated or processed in some way to make it more useful for determining system health and stability. The data in the picture below uses the formula submitted = completed + number queued + number failed. It shows the current queue on the left Y-axis and the max queue on the right (since the two numbers usually are drastically different). It looks pretty, but it doesn’t really tell us much about the system in this format, so let’s do some math and find a representation that is a bit more helpful. Performing a “non-negative derivative” calculation over the submitted and the completed queue counts over time allows for us to look at the status of the queue as a velocity. When the “complete” speed appears behind the “submitted” speed for too long at a time, then that means the queue is filling up and will eventually result in data loss. If we take this one step further and calculate the average of the submitted minus the completed over time, then we can actually predict approximately when the queue will fill up. This can then be displayed on a dashboard in Grafana, or used as the basis for an alert. What to Monitor In addition to monitoring the system which ThingWorx runs upon, ThingWorx itself can easily be monitored down to the subsystems level by Prometheus due to the Metrics endpoint. Many applications have support built into the way they format the data for scraping, including the JVM (which exposes Prometheus-formatted metrics with the JMX Exporter) and the OS (which can use the Node Exporter or Telegraf for the same purpose). For these more generic components, there are popular community dashboards which can be downloaded and used in Grafana for data analysis and review. For ThingWorx, there’s different kinds of data to track: subsystem data (see the list on the right) and non-subsystem data. There’s queue based versus non-queue data. These different metrics can collectively characterize the overall health of the application, depending on the use case. For instance, if this is a system with very many connected devices, one metric which may be important to track is the number of total devices defined on the Foundation server vs. the number of devices which are currently connected. If there are relays involved, then many devices suddenly going offline can mean a relay has failed. Another example is if the system sizing depends on an assumption that there will only ever be a fraction of the total number of devices connected at a time. Use cases like these could be monitored easily by keeping track of the total vs. the number of connected devices. Other common indicators of a healthy ThingWorx application might include the value stream and stream queues. These queues should fluctuate over time as the data is ingested and processed, but they should never be growing in size. If the stream queues are growing, then that means the data is writing to the queues faster than the queues can write to the database. Eventually, when the system runs out of resources to keep track of the queues, data will be lost. Having the stream information displayed in a chart can make it very easy to spot an upward trend in resource usage early on, which can catch a blockage or bottleneck that needs attention before it starts to affect the larger system in catastrophic ways later. Memory usage information from the various subsystems might be something worth tracking, as well as the event queue. These can indicate that the business logic is functioning with room to handle spikes, and that the server has enough memory to service all three dimensions of an IoT application: the ingestion, the business logic and thing-based alerting, and the user experience and UI. If file transfers are a key part of the use case, then the number of concurrent transfers, the average speed of them, the size of the files, all of this kind of stuff can be tracked and charted in Grafana by making use of the ThingWorx metrics which automatically show up there once you import the Prometheus data source. A mature dashboard used for a production environment might look a little like this: For further reading about subsystem monitoring, check out the Help Center. How to Alert The alerting mechanism built-in to Prometheus is incredibly easy to configure, so it might be tempting to generate tons of alert rules. However, remember that the more noise a system makes, the harder it is for those monitoring that system to know when action is really required. Playbooks which document how to respond to alerts, who to contact, how to act, and all the information necessary to handle an alert, should be created as an ongoing part of the DevOps process. Alerts should fire with the right severity in the subject line, as well as all of the information about the issue that is currently known, presented in a concise way, so that whoever receives the alert starts thinking about the root cause sooner and recovers the system faster. Those who receive the alert should have the ability to facilitate its resolution, and know who is expected to react to any alerts which come in. In the ThingWorx monitoring stack, Prometheus handles the alert rules and the generation of alerts, but alert filtering and delivery is managed in an external alerts manager. Generally, you want your alerts to follow a curve. If the current queue size exceeds 50% of the maximum, perhaps that isn’t a huge deal, if the application catches up quickly. How long are spikes in queue processing expected to last? Perhaps if the queue size is over half-full for 10 seconds, 30 seconds, then that means the queue is falling behind and not catching up. Ok, so this might be a warning level alert. When does this become an error? Well, let’s say the queue exceeds 90% of the max queue size. This might want to alert the moment it hits the mark. Now, farther along the curve, it may not take as long before data gets lost. As the severity of the situation increases, the threshold for alerting should increase as well. That way when errors do alert, it is a sure thing that they require a response immediately. The alerts are then pushed into the “Alerts Manager” for delivery based on your management rules. The Alerts Manager may decide to withhold warnings altogether, or send them to a much smaller mailing list, whatever filtering helps to ensure the right people receive the right alerts, right when they need them. In Conclusion, A Healthy Application... Has stable memory usage that fluctuates predictably and doesn’t grow over time. In a system experiencing mild issues, the memory starts to trend upward: If left unattended, systems like this may eventually experience outages. Finding the issue this early means there is even time to do some digging, debugging, taking of stack traces, and other such troubleshooting steps before the system must be restarted or recovered. That can really make the difference in identifying and resolving before there are real problems. One metric which makes for good alerting is the total number of failed stream entries, which can indicate there’s an issue writing to the database even before the queue has started to fill up. Other alerts may include warnings and errors based around percentages of memory used or queues filled, which depend on how long the queues take to fill up and how long the state has been at its increased usage. Prometheus has all of the tools necessary to make this possible across a variety of infrastructures and use cases. Set it up on a local machine and poke around at what ThingWorx metrics are available to meet your monitoring needs.

Dec 23, 2022

The IoT Building Block Design Framework By Victoria Firewind and Ward Bowman, Sr. Director of the IoT EDC Building Block Overview As detailed quite extensively on its own designated Help Center page, building blocks are the future of scalable and maintainable IoT architecture. They are a way to organize development and customization of ThingWorx solutions into modular, well-defined components or packages. Each building block serves a specific purpose and exists as independently as possible from other modules. Some blocks facilitate external data integration, some user interface features, and others the manipulation or management of different kinds of equipment. There are really no limits to how custom a ThingWorx solution can be, and customizations are often a major hurdle to a well-oiled dev ops pipeline. It’s therefore crucial for us all to use a standard framework, to ensure that each piece of customization is insular, easy-to-replace, and much more maintainable. This is the foundation of good IoT application design. PTC’s Building Block Framework At PTC, building blocks are broken down in a couple different ways: categories and types. The category of a building block is primarily in reference to its visibility and availability for use by the greater ThingWorx community. We use our own framework here at PTC, so our solution offerings are based around solution-specific building blocks, things which we provide as complete, SaaS solutions. These solution-specific building blocks combine up into single solutions like DPM, offerings which require a license, but can then easily be deployed to a number of systems. PTC solutions provide many, complex, OOTB features, like the Production Dashboard of DPM, and the OperationKPI building block. Anyone doing any sort of development with ThingWorx, however, does still have access to many other building blocks, included with the domain specific building block category. These are the pieces used to build the solution building blocks like DPM, and they can be used to build other more custom solutions as well. Take the OperationKPI block which can vary greatly from one customer to the next in how it is calculated or analyzed. The pieces used to build the version that ships with DPM are right there, for instance, Shift and ReasonCode. They are designed to have minimal dependencies themselves, meant to be used as dependencies by custom blocks which do custom versions of the module logic found within the PTC solution-specific blocks. Then there are the common blocks as well, and these are used even more widely for things like user management and database connectivity. The type of the building block refers really to where that building block falls in the greater design. A UI building block consumes data and displays it, so that is the View of a classic MVC design pattern. However, sometimes user input is needed, so perhaps that UI building block will depend on another block. This other block could have a utility entity that fuels UI logic, and the benefit to having that be a separate block is then the ease with which it can be subbed out from one ThingWorx instance to the next. Let’s say there are regional differences in how people make use of technologies. If the differences are largely driven by what data is available from physical devices at a particular site, then perhaps these differences require different services to process user input or queries on the same UI mashups used across every site. Well in this case, having the UI block stand alone is smart, because then the Model and Controller blocks can be abstracted out and instantiated differently at different sites. The two types that largely define the Model and Controller of the classic MVC are called Abstract and Implementation building blocks, and the purposes are intertwined, but distinct: abstract building blocks expose the common API endpoints which allow for the implementation blocks to vary so readily. The implementation blocks are then those which actually alter the data model, which is what happens when the InitializeSolution method is called. In that service, they are given everything they need to generate their data tables and data constraints, so that they are ready to be used once the devices are connected. Occasionally, UI differences will also be necessary when factories or regions have different ways of doing things. When this happens, even mashups can be abstracted using abstract building blocks. Modular mashups which can be combined into larger ones can be provided in an abstract building block, and these can be used to form custom mashups from common parts across many sites in different implementation blocks all based on the same abstract one. The last type of building block is the standard building block, which is the most generic. This one is not intended to be overridden, often serving as the combination of other building blocks into final solutions. It is also the most basic combination of components necessary to adhere to the building block design framework and interface with the shared deployment infrastructure. The necessary components include a project entity, which contains all of the entities for the building block, an entry point, which contains the metadata (name, description, version, list of dependencies, etc.) and overrides the service for automatic model generation (DeployComponent), and the building block manager. The dashed arrows indicate an entity implements another, and the solid arrows, extension. The manager is the primary service layer for the building block, and it makes most of the implementation decisions. It also has all of the information required to configure menus and select which contained mashups to use when combining modular mashups into larger views. The manager really consists of 3 entities: a thing template with the properties and configuration tables, a thing which makes use of these, and a thing shape which defines all of the services (which can often be overridden) which the manager thing may make use of. Most building blocks will also contain security entities which handle user permissions, like groups which can be updated with users. Anyone requiring access to the contents of a particular building block then simply needs to be added to the right groups later on. This as well as model logic entities like thing shapes can be used concurrently to use organizational security for visibility controls on individual equipment, allowing some users to see some machines and not others, and so on. All of the Managers must be registered in the DefaultGlobalManagerConfiguration table on the PTC.BaseManager thing, or in the ManagerConfiguration table of any entity that implements the PTC.Base.ConfigManagement_TS thing shape. Naming conventions should also be kept standard across blocks, and details on those best practices can be found in the Build Block Help Center. Extending the Data Model Using Building Blocks Customizing the data model using building blocks is pretty straight-forward, but there are some design considerations to be made. One way to customize the data model is to add custom properties to existing data model entities. For instance, let’s say you need an additional field to keep track of the location of a job order, and so you could add a City field to PTC.JobOrder.JobOrder_AP data shape. However, doing this also requires substantial modification to the PTC.JobOrder.Manager thing, if the new data shape field is meant to interface with the database. So this method is not as straight-forward as it may seem, but it is the easiest solution in an “object-oriented” design pattern, one where the data varies very little from site to site, but the logic that handles that data does vary. In this design pattern, there will be a thing shape for each implementing building block that handles the same data shape the abstract building block uses, just in different ways. A common use case for why this may happen is if the UI components vary from site to site or region to region, and the logic that powers them must also vary, but the data source is relatively consistent. Another way to extend the data model is to add custom data shapes and custom managers to go with them. This requires you to create a new custom building block, one which extends the base entry point as needed for the type of block. This method may seem more complicated than just extending a data shape, but it is also easier to do programmatically: all you have to do is create a bunch of entities (thing templates, thing shapes, etc.) which implement base entities. A fresh data shape can be created with complete deliberation for the use case, and then services which already exist can be overridden to handle the new data shape instead. It is a cleaner, more automation friendly approach. However, the database will still need to be updated, and this time CRUD services are necessary as well, those for creating and managing instances of the data shape. So, this option is not really less effort in the short term. In the long-term, though, scripts that automate much of the process for building block generation can be used to quickly and easily allow for development of new modules in a more complex ThingWorx solution. The ideal is to use abstract blocks defined by the nature of the data shape which each of their implementing blocks will use to hook the view into the data model. This “data-driven” design pattern is one which involves creating a brand-new block for each customization to the data model. The functionality of each of these blocks centers around what the data table and constraints must look like for that data store, and the logic to handle the different data types should vary within each implementation block, but override the common abstract block interface so that any data source can be plugged into the thing model, and ThingWorx will know what to do with it. The requests can then contain whatever information they contain (like MQTT), and which logic is chosen to process that data will be selected based on what information is received by the ThingWorx solution for each message. Data shapes can be abstracted in this way, so that a single subscription need exist to the data shape used in the abstract building block, and its logic knows how to call whatever functions are necessary based on whatever data it receives. This allows for very maintainable API creation for both data ingestion and event processing.

Oct 5, 2022

Using the Solution Central API Pitfalls to Avoid by Victoria Firewind, IoT EDC Introduction The Solution Central API provides a new process for publishing ThingWorx solutions that are developed or modified outside of the ThingWorx Platform. For those building extensions, using third party libraries, or who just are more comfortable developing in an IDE external to ThingWorx, the SC API makes it simple to still utilize Solution Central for all solution management and deployment needs, according to ThingWorx dev ops best practices. This article hones in one some pitfalls that may arise while setting up the infrastructure to use the SC API and assumes that there is already AD integration and an oauth token fetcher application configured for these requests. CURL One of the easiest ways to interface with the SC API is via cURL. In this way, publishing solutions to Solution Central really involves a series of cURL requests which can be scripted and automated as part of a mature dev ops process. In previous posts, the process of acquiring an oauth token is demonstrated. This oauth token is good for a few moments, for any number of requests, so the easiest thing to do is to request a token once before each step of the process. 1. GET info about a solution (shown) or all solutions (by leaving off everything after "solutions" in the URL) $RESULT=$(curl -s -o test.zip --location --request GET "https://<your_sc_url>/sc/api/solutions/org.ptc:somethingoriginal12345:1.0.0/files/SampleTwxExtension.zip" ` --header "Authorization: Bearer $ACCESS_TOKEN" ` --header 'Content-Type: application/json' ` ) Shown here in the URL, is the GAV ID (Group:Artifact_ID:Version). This is shown throughout the Swagger UI (found under Help within your Solution Central portal) as {ID}, and it includes the colons. To query for solutions, see the different parameter options available in the Swagger UI found under Help in the SC Portal (cURL syntax for providing such parameters is shown in the next example). Potential Pitfall: if your solution is not published yet, then you can get the information about it, where it exists in the SC repo, and what files it contains, but none of the files will be downloadable until it is published. Any attempt to retrieve unpublished files will result in a 404. 2. Create a new solution using POST $RESULT=$(curl -s --location --request POST "https://<your_sc_url>/sc/api/solutions" ` --header "Authorization: Bearer $ACCESS_TOKEN" ` --header 'Content-Type: application/json' ` -d '"{\"groupId\": \"org.ptc\", \"artifactId\": \"somethingelseoriginal12345\", \"version\": \"1.0.0\", \"displayName\": \"SampleExtProject\", \"packageType\": \"thingworx-extension\", \"packageMetadata\": {}, \"targetPlatform\": \"ThingWorx\", \"targetPlatformMinVersion\": \"9.3.1\", \"description\": \"\", \"createdBy\": \"vfirewind\"}"' ) It will depend on your Powershell or Bash settings whether or not the escape characters are needed for the double quotes, and exact syntax may vary. If you get a 201 response, this was successful. Potential Pitfalls: the group ID and artifact ID syntax are very particular, and despite other sources, the artifact ID often cannot contain capital letters. The artifact ID has to be unique to previously published solutions, unless those solutions are first deleted in the SC portal. The created by field does not need to be a valid ThingWorx username, and most of the parameters given here are required fields. 3. PUT the files into the project $RESULT=$(curl -L -v --location --request PUT "<your_sc_url>/sc/api/solutions/org.ptc:somethingelseoriginal12345:1.0.0/files" ` --header "Authorization: Bearer $ACCESS_TOKEN" ` --header 'Accept: application/json' ` --header 'x-sc-primary-file:true' ` --header 'Content-MD5:08a0e49172859144cb61c57f0d844c93' ` --header 'x-sc-filename:SampleTwxExtension.zip' ` -d "@SampleTwxExtension.zip" ) $RESULT=$(curl -L --location --request PUT "https://<your_sc_url>/sc/api/solutions/org.ptc:somethingelseoriginal12345:1.0.0/files" ` --header "Authorization: Bearer $ACCESS_TOKEN" ` --header 'Accept: application/json' ` --header 'Content-MD5:fa1269ea0d8c8723b5734305e48f7d46' ` --header 'x-sc-filename:SampleTwxExtension.sha' ` -d "@SampleTwxExtension.sha" ) This is really TWO requests, because both the archive of source files and its hash have to be sent to Solution Central for verifying authenticity. In addition to the hash file being sent separately, the MD5 checksum on both the source file archive and the hash has to be provided, as shown here with the header parameter "Content-MD5". This will be a unique hex string that represents the contents of the file, and it will be calculated by Azure as well to ensure the file contains what it should. There are a few ways to calculate the MD5 checksums and the hash: scripts can be created which use built-in Windows tools like certutil to run a few commands and manually save the hash string to a file: certutil -hashfile SampleTwxExtension.zip MD5 certutil -hashfile SampleTwxExtension.zip SHA256 # By some means, save this SHA value to a file named SampleTwxExtension.sha certutil -hashfile SampleTwxExtension.sha MD5 Another way is to use Java to generate the SHA file and calculate the MD5 values: public class Main { private static String pathToProject = "C:\\Users\\vfirewind\\eclipse-workspace\\SampleTwxExtension\\build\\distributions"; private static String fileName = "SampleTwxExtension"; public static void main(String[] args) throws NoSuchAlgorithmException, FileNotFoundException { String zip_filename = pathToProject + "\\" + fileName + ".zip"; String sha_filename = pathToProject + "\\" + fileName + ".sha"; File zip_file = new File(zip_filename); FileInputStream zip_is = new FileInputStream(zip_file); try { // Calculate the MD5 of the zip file String md5_zip = DigestUtils.md5Hex(zip_is); System.out.println("------------------------------------"); System.out.println("Zip file MD5: " + md5_zip); System.out.println("------------------------------------"); } catch(IOException e) { System.out.println("[ERROR] Could not calculate MD5 on zip file named: " + zip_filename + "; " + e.getMessage()); e.printStackTrace(); } try { // Calculate the hash of the zip and write it to a file String sha = DigestUtils.sha256Hex(zip_is); File sha_output = new File(sha_filename); FileWriter fout = new FileWriter(sha_output); fout.write(sha); fout.close(); System.out.println("[INFO] SHA: " + sha + "; written to file: " + fileName + ".sha"); // Now calculate MD5 on the hash file FileInputStream sha_is = new FileInputStream(sha_output); String md5_sha = DigestUtils.md5Hex(sha_is); System.out.println("------------------------------------"); System.out.println("Zip file MD5: " + md5_sha); System.out.println("------------------------------------"); } catch (IOException e) { System.out.println("[ERROR] Could not calculate MD5 on file name: " + sha_filename + "; " + e.getMessage()); e.printStackTrace(); } } This method requires the use of a third party library called the commons codec. Be sure to add this not just to the class path for the Java project, but if building as a part of a ThingWorx extension, then to the build.gradle file as well: repositories { mavenCentral() } dependencies { compile fileTree(dir:'twx-lib', include:'*.jar') compile fileTree(dir:'lib', include:'*.jar') compile 'commons-codec:commons-codec:1.15' } Potential Pitfalls: Solution Central will only accept MD5 values provided in hex, and not base64. The file paths are not shown here, as the archive file and associated hash file shown here were in the same folder as the cURL scripts. The @ syntax in Powershell is very particular, and refers to reading the contents of the file, in this case, or uploading it to SC (and not just the string value that is the name of the file). Every time the source files are rebuilt, the MD5 and SHA values need to be recalculated, which is why scripting this process is recommended. 4. Do another PUT request to publish the project $RESULT=$(curl -L --location --request PUT "https://<your_sc_url>/sc/api/solutions/org.ptc:somethingelseoriginal12345:1.0.0/publish" ` --header "Authorization: Bearer $ACCESS_TOKEN" ` --header 'Accept: application/json' ` --header 'Content-Type: application/json' ` -d '"{\"publishedBy\": \"vfirewind\"}"' ) The published by parameter is necessary here, but it does not have to be a valid ThingWorx user for the request to work. If this request is successful, then the solution will show up as published in the SC Portal: Other Pitfalls Remember that for this process to work, the extensions within the source file archive must contain certain identifiers. The group ID, artifact ID, and version have to be consistent across a couple of files in each extension: the metadata.xml file for the extension and the project.xml file which specifies which projects the extensions belong to within ThingWorx. If any of this information is incorrect, the final PUT to publish the solution will fail. Example Metadata File: <?xml version="1.0" encoding="UTF-8"?> <Entities> <ExtensionPackages> <ExtensionPackage artifactId="somethingoriginal12345" dependsOn="" description="" groupId="org.ptc" haCompatible="false" minimumThingWorxVersion="9.3.0" name="SampleTwxExtension" packageVersion="1.0.0" vendor=""> <JarResources> <FileResource description="" file="sampletwxextension.jar" type="JAR"></FileResource> </JarResources> </ExtensionPackage> </ExtensionPackages> <ThingPackages> <ThingPackage className="SampleTT" description="" name="SampleTTPackage"></ThingPackage> </ThingPackages> <ThingTemplates> <ThingTemplate aspect.isEditableExtensionObject="false" description="" name="SampleTT" thingPackage="SampleTTPackage"></ThingTemplate> </ThingTemplates> </Entities> Example Projects XML File: <?xml version="1.0" encoding="UTF-8"?> <Entities> <Projects> <Project artifactId="somethingoriginal12345" dependsOn="{"extensions":"","projects":""}" description="" documentationContent="" groupId="org.ptc" homeMashup="" minPlatformVersion="" name="SampleExtProject" packageVersion="1.0.0" projectName="SampleExtProject" publishResult="" state="DRAFT" tags=""> </Project> </Projects> </Entities> Another large issue that may come up is that requests often fail with a 500 error and without any message. There are often more details in the server logs, which can be reviewed internally by PTC if a support case is opened. Common causes of 500 errors include missing parameter values that are required, including invalid characters in the parameter strings, and using an API URL which is not the correct endpoint for the type of request. Another large cause of 500 errors is providing MD5 or hash values that are not valid (a mismatch will show differently). Another common error is the 400 error, which happens if any of the code that SC uses to parse the request breaks. A 400 error will also occur if the files are not being opened or uploaded correctly due to some issue with the @ syntax (mentioned above). Another common 400 error is a mismatch between the provided MD5 value for the zip or SHA file, and the one calculated by Azure ("message: Md5Mismatch"), which can indicate that there has been some corruption in the content of the upload, or simply that the MD5 values aren't being calculated correctly. The files will often say they have 100% uploaded, even if they aren't complete, errors appear in the console, or the size of the file is smaller than it should be if it were a complete upload (an issue with cURL). Conclusion Debugging with cURL can be a challenge. Note that adding "-v" to a cURL command provides additional information, such as the number of bytes in each request and a reprint of the parameters to ensure they were read correctly. Even still, it isn't always possible for SC to indicate what the real cause of an issue is. There are many things that can go wrong in this process, but when it goes right, it goes very right. The SC API can be entirely scripted and automated, allowing for seamless inclusion of externally-developed tools into a mature dev ops process.

Jul 29, 2022

User Load Testing in ThingWorx Java Client Tutorial Written by Tori Firewind, IoT EDC Introduction As stated in previous posts, user load testing is a critical component of ensuring a ThingWorx solution is Enterprise-ready. Even a sturdy new feature that seems to function well in development can run into issues once larger loads are thrown into the mix. That's why no piece of code should be considered production-ready until it has undergone not just unit and integration testing (detailed in our Comprehensive DevOps Guide), but also load testing that ensures a positive user experience and an adequately sized server to facilitate the user load. The EDC has spent quite a few posts detailing the process of setting up an accurate, real-world testing suite using JMeter for ThingWorx. In this piece, we detail an alternative approach that makes use of the Java Spring Boot Framework to call rest requests against the ThingWorx server and simulate the user load. This Java Client tutorial produces a very immature user load client, one which would still take a lot of development to function as flexibly as the JMeter tutorial counterpart. For Java developers, however, this is still a very attractive approach; it allows for more custom, robust testing suites that come only as an investment made in a solid testing tool. For someone experienced in Java, the risk is smaller of overlooking some aspect of simulation that JMeter may have handled automatically. For example, JMeter automatically creates more than one HTTP session, and it's much easier to implement randomized user logins instead of one account. The Java Client could do it with some extra work (not demonstrated here), but it uses just the Administrator login by default for a quick and dirty sort of load test, one focused less on the customer experience and more on server and database performance under the strain of the user requests (the method used in our sizing guidance, for instance, to see if a server is sized correctly). The amount of time required to develop a Java Client isn't so bad for a Java developer, and when compared with learning the JMeter Framework, might be a better investment. A tool like this can handle a greater number of threads on a single testing VM; JMeter caps out around 250 threads per client on an 8Gb VM (under ideal conditions), while a Java Client can have thousands of threads easily. Likewise, a Java Client has less memory overhead than JMeter, less concern for garbage collection, and less likelihood that influence from heap memory management will affect the test results. However, remember that everything in a Java Client has to be built from scratch and maintained over time. That means that beyond the basic tutorial here, there needs to be some kind of metrics gathering and analysis tool implemented (JMeter has built-in reporting tools), the calls need to be randomized, and not called at set intervals like they are here (which is not a very accurate representation of user load compared to a real-world scenario), and the number of users accessing the system at once should probably vary over time (to resemble peak usage hours). JMeter has a recording tool to ensure all the necessary REST requests to simulate a mashup load are made, so great care has to be taken to ensure all of the necessary REST calls for a mashup are made by the Java Client if a true simulation is called for by that approach. Java Client Tutorial Conclusion Neither a Java Client nor a JMeter testing suite is inherently better than the other, and both have their place within PTC's various testing processes. The best test of all is to stand up any sort of user load testing client, either of these approaches, at the same time as the UAT or QA user experience testing. QA testers who load and click about on mashups in true, user fashion can then see most accurately how the mashups will perform and what the users will experience in the Enterprise-ready, production application once the changes go out.

Jun 28, 2022

Solution Central and Azure Active Directory Written by: Tori Firewind, IoT EDC As we’ve said in a previous post, Solution Central (SC) is a crucial part of any mature dev ops pipeline. In its latest version (3.1.0), it manages custom solutions even more easily due to the SC API. However, this comes with a few requirements that can be a little tricky. One of the more complex configuration pieces for using this new API involves a cloud-hosted Active Directory (AD) application within Azure AD. In order to make use of the new API, your organization’s users must exist on an Azure AD tenant that is separate from the PTC tenant. Most PTC customers are placed on this PTC-owned tenant by default, so additional configuration may be required to set up the AD instance within Azure before the SC API can be used. The type of tenant has to be Azure, of course, as that is what PTC uses: a multi-tenancy Azure infrastructure for all authentication. In this usage, “tenant” is a cloud-hosting, infrastructure term which essentially refers to an Azure VM hosting an Active Directory server, as well as managing the many AD components that would otherwise require a lot of oversight. Users within that AD server are grouped so that only those solutions published by their own organizations are visible within Solution Central. In this way, the term “tenant” can also refer to a partition of some kind of data; a Solution Central tenant includes the users, the solutions, and is sort of like a sub-tenant of the larger PTC infrastructure. PTC uses a global Solution Central tenant for deployment of its own solutions, like DPM (Digital Performance Management), which is therefore available to every user in the PTC Azure AD tenant or a tenant which has been connected using the tutorial below. There are good reasons to want to use your own Azure AD integrated into Solution Central that do not involve using the SC API. For one thing, it allows for direct control over the users that have access; otherwise, a ticket to PTC support is necessary every time a user needs access granted or removed. The tutorial below is a great reference for anyone using Solution Central, with steps 1-4 offering an easy guide for integrating your own Azure AD and simplifying your user management. To use the SC API, however, there are additional steps required to create an application for retrieving an oauth token. This application needs the “solution-publisher” role or else bad request/forbidden errors will pop up. This role is automatically available in your Azure AD tenant once it is linked to the PTC AD tenant, but it does have to be manually assigned to the application, whose sole function is to request this access token for authenticating requests against the SC server. The Azure application you need to create is essentially a plugin with permission to publish solutions against the SC API. It functions a little like a “login” in database terminology, serving the function of authenticating to an endpoint (in this case not tied to a user or any kind of identity). This application must be able to request an access token successfully and return that to the scripts which call upon it in order to perform the project creation and publication requests. The tutorial below steps you through how to create this application and request all of your solutions via the SC API. Tutorial: Create Azure AD Application to Access SC via API Create a new tenant in Azure AD for users who will have access to the SC API Create a ticket with PTC to provide the tenant ID and begin the onboarding process Once that process completes, you will receive an email with a custom link to login to the Solution Central portal for your organization, where only your solutions exist Users will then need to be added as “Custom Global Administrators” on the SC enterprise application within Azure Portal to grant them access to login to Solution Central in a browser In order for us to be able to use the SC API, however, more work needs to be done; an application to request an oauth token for requests must be created in Azure Portal Navigate to “App Registrations” from the Azure AD page in Azure Portal and then click “New registration” Enter the application name (ours is called “gradle-plugin-tokenfetcher” in the examples shown here), and select the “single tenant” radio button Enter a URL for redirect URI and for “Select a platform” select “Web” Click “Register”, and wait for the application created notification Now we need to add permissions to see Solution Central Open the app from under “App registrations” and select the “API permissions” tab Click “Add a permission” and in the pop-up window that appears, select the “APIs my organization uses” tab, which should have PTC Solution Central listed, if this tenant has been linked to the PTC tenant per step 2 above Select “Application Permissions” and then check the “solution-publisher” role Click “Add permissions” The application now has access to Solution Central as a publisher, able to send solution publications over the API and not just via the Platform interface The application is ready now, so we can create a request to test that it has access to SC Using Windows Powershell or something similar, create a script to make a couple of cURL requests against first the Azure AD and then the Solution Central API (attached as well): The tenant ID and the directory ID are the same value (listed on the tenant under “Overview”), and the client ID and application ID are the same as well (listed on the application, shown here), so don’t be confused by terminology The client app secret is the “Value” provided by the system when a new client secret is created under “Certificates & Secrets” (remember to copy this as soon as it is made and before clicking away, as after that, it will no longer be visible): The Solution Central app ID can be found under “App registrations”, listed within the details for the “solution-publisher” role: This access token request piece must be done before every request to the SC API, so this secret should be kept in some kind of password management tool (or a global environment variable in GitLab) so that it isn’t found anywhere in the source code If the script pings the SC API successfully, then a list of solutions will be returned

May 31, 2022

The DPM User Experience Written by Tori Firewind, IoT EDC Team As discussed in a previous post, DPM is a tool designed to be beneficial at all levels of a company, from the operators monitoring automated data on production events from the factory machines themselves, to the production supervisors who need to establish, task out, and track machine maintenance and improvement measures. DPM also engages the continuous improvement and plant leadership, by providing a standardized way to monitor performance that ultimately rolls up to the executive level. The end users of DPM are therefore diverse both in how they access DPM, and how they make use of its various features. One of the perks to building DPM on top of the ThingWorx Foundation is that many of the webpages (called “mashups”) within ThingWorx are already responsive, and any which aren’t responsive OOTB can be modified and custom designed for different size viewing screens to ensure that if necessary, end users can access DPM from a variety of locations and devices. Most of the time, end users will be accessing mashups from hard-wired dashboards mounted on the actual devices, or from wireless laptops which have standard size screens with standard resolutions. For use cases involving phones or tablets, however, it may be necessary to see how DPM will perform across a variety of bandwidth and latency conditions. Often, cellular or satellite connection is a must to facilitate field team cooperation, and 5G networks often result in worsened performance. So, to demonstrate the influence of bandwidth and latency on the responsiveness of DPM, the Production Dashboard was loaded in the Google Chrome browser repeatedly under varying conditions. This dashboard is the webpage most operators and field users would access to log event information and production details (so it is widely used by end users). This provides a sort of benchmark of the DPM solution, something which indicates what can be expected and tells us a few things about how DPM should be deployed and configured. Latency was introduced by hosting the servers involved in the test in different regions (all Azure cloud hosted servers, one in US East, one US West, and one in Japan East). Bandwidth was introduced using a tool on the PC with either no bandwidth or 4 megabits/second. Browser caching was turned on and off as well, to simulate the difference between new and return users; new users would not have the webpage cached, so their load times are expected to be longer. Tomcat compression was also configured in half of the runs to demonstrate the importance of compression for optimal performance. Each of these 24 scenarios was then tested 10 times from each location, and the actual data can be found in the attached benchmark document (a working solution benchmark, which is not designed to be referenced directly, as matters of infrastructure may influence the exact performance of the solution). Even with bandwidth, every region sees better performance for return users versus new users, which may be important to note. However, because DPM field users most commonly access DPM often, the return user time is a better indicator of adoption, and those numbers look great in our simulations. Notice the top line which shows the very worst of mobile performance, what happens over networks with bandwidth when Tomcat Compression is not enabled. Load times vary only slightly for regular networks when Tomcat Compression is enabled, and they vastly improve performance across regions and on mobile networks, so it is highly recommended (instructions on how to enable are below). Key Takeaways Latency and bandwidth impact DPM performance in exactly the way one would expect of a web application. While any DPM server can be accessed from any region, regions with more latency will experience delays proportional to the amount of latency. In the chart here, find the three regions represented three times by three different colors (different from the charts above): The three different shades of each color represent the different regions Green represents the optimal configuration settings (Tomcat compression enabled, caching turned on) for returning users with bandwidth limitations (i.e. mobile networks like 5G) Blue shows first-time page visitors with no bandwidth limitations Purple shows first-time visitors that do have bandwidth limitations The uncompressed first-time load for mobile users (those with bandwidth limitations imposed) within the same region is also given to demonstrate the importance of enabling Tomcat Compression (load times only get worse without compression the farther the region) Notice how the green series has lower load times across the board than the blue one, meaning that return users even with bandwidth limitations have better performance across every region than new users. Also notice how the gap is larger between lighter colors and darker colors, where the darker the color, the farther the region from the DPM servers. This indicates that network latency has a more significant influence on performance versus bandwidth, with only longer running transactions like file uploads seeing a significant performance hit when on a network with bandwidth limitations. Find out how to enable tomcat compression and review the full solution benchmark in the document attached.

Apr 25, 2022

Announcing: ThingWorx Solution Central 3.1.0 and its New API Written by: Tori Firewind of the IoT EDC Solution Central 3.1.0 ThingWorx Solution Central (SC) is the solution management tool for ThingWorx and Digital Performance Management (DPM), the latest version of which (DPM 1.1) can now be deployed directly from the PTC Solutions menu of SC. Streamlining packaging strategies and ensuring efficient solution deployment can now be done for all kinds of ThingWorx solutions, even those with heavy customization. The new API allows for updated building blocks from within the PTC Solutions menu to be easily discovered and deployed right alongside custom solutions. Even the most advanced developers can now house their deployment management process within SC. As discussed in a previous article, Solution Central forms a necessary part of a mature DevOps pipeline, usually as a set of services within Foundation which finalize and publish the solution to the Solution Central servers. The recommendation to utilize Solution Central from within Foundation remains a best practice for ThingWorx DevOps because the vast majority of solutions benefit from using the ThingWorx APIs, which scan and check for dependencies and proper XML formatting on each included entity. Packaging and publishing the solution from within ThingWorx is the easiest and most straightforward way recommended by PTC, but it is now possible to publish to Solution Central using a standard API for those who need to publish from Jenkins or other build jobs. If there are legacy extensions, 3rd party tool dependencies, or other customizations within the ThingWorx application, then it may be beneficial to use this new API instead. The new API also allows for editable extensions and entities within a published solution, though PTC still recommends avoiding this as a general practice. It is usually better for purposes of maintainability and ease of upgrades to just publish the solution again (with an incremented) version each time any changes are made. How to Use the API Within Solution Central, a new menu option has been added to review the API, with information about the different types of requests and their parameters and responses. To access this within the Solution Central UI, open the help menu and navigate to “Public APIs” (see the image on the right). To see a sample response, select a request type and scroll down to the “Responses” section (shown below). Examples of error responses are also provided, and it’s important to ensure that whatever makes these requests can properly report or log any errors for troubleshooting and maintenance of the DevOps process. The general steps for making use of this API are as follows: Create the solution resource with a POST Add some files to that solution with PUTs First, create the solution archive with the right Solution Identifiers; this should contain at least one project XML and all of the entities belonging to that ThingWorx project Next, compute MD5 using a tool like DigestUtils on the contents of the archive; this checksum is required for Solution Central Compute the SHA hash on the archive and save it; this will need to be provided along with the archive in the PUT requests Compute MD5 on the hash file also Finally, make the two PUT requests, one for the archive and one for its hash; for example cURL requests, see the Help Center Publish the solution using a PUT So, with a little more work it is now possible to make use of SC in a more custom DevOps process. It is now possible to build JSON or XML solutions using development tools outside of ThingWorx and still publish these customizations to Solution Central. The process of DevOps Management just became more versatile, and with the ease of deployment of DPM and other PTC building blocks as well, ThingWorx is now more accessible and easy to use than ever before.

Mar 21, 2022

MachNation Podcast Replay: Enterprise-Specific Implementation Testing a podcast, by Mike Jasperson, VP of the IoT EDC MachNation, a company exclusively dedicated to testing and benchmarking Internet of Things (IoT) platforms, end-to-end solutions, and services, has conducted a recent podcast series featuring our very own Mike Jasperson, Vice President of the IoT Enterprise Deployment Center here at PTC. Performance IoT is a podcast series that brings together experts who make IoT performance testing and high-resiliency IoT part of their IoT journey. Mike Jasperson's podcast is episode 5 in the series, titled: Enterprise-Specific Implementation Testing . Enjoy!

Mar 16, 2022

Introduction to Digital Performance Management (DPM) Written by: Tori Firewind, IoT EDC “Digital Performance Management (DPM) is a closed-loop, problem solving solution that helps manufacturers identify, prioritize, and solve their biggest loss challenges, resulting in reduced cost, increased revenue, and improved service levels.” – DPM Help Center What is DPM? Digital Performance Manager (DPM) is an application which improves factory efficiency across a variety of different areas, namely “the four P’s” of Digital Transformation: products, processes, places, and people. Each performance issue in a factory can be mapped to at least one of these improvement categories in a new strategy for Continuous Improvement (CI) founded by PTC. Figure 1 – Each performance issue in a factory can be mapped to at least one of 4 fundamental improvement categories: products, processes, places, and people. PTC’s new, industry-leading strategy for continuous improvement (CI) in factories is a “best practice” approach, taking the collective knowledge of many customers to form a focused, prescriptive path for success. 11 Closing the Loop Across Products, Processes, People, and Places, Manufacturing Leadership Journal At PTC, CI in factories is driven by a “best practice” approach, with years of experience in manufacturing solutions combining with the collective knowledge of the many diverse use cases PTC has encountered, to generate a focused, prescriptive path for improvement in any individual factory. Figure 2 – DPM is a closed loop for continuous improvement, a strategy built around industry standard best practices and years of experience. PTC is also defining new industry standards for OEE analysis by using time as a currency within DPM. This standardization technique improves intuitive impact assessment and allows for direct comparison of metrics (see the Help Center for details on how each metric is calculated). DPM creates a closed loop for CI, from the monitoring phase performed both automatically and through manual operator input, to the prioritization and analyzation phases performed by plant managers. DPM helps plant managers by tracking metrics of factory performance that often go overlooked by other systems. With Analytics, DPM can also do much of the analysis automatically, finding the root causes much more rapidly. Figure 3 – All levels of the company are involved in solving the same problems effectively and efficiently with DPM. Instead of 100 people working on 100 different problems, some of which might not significantly improve OEE anyway, these same 100 people can tackle the top few problems one at a time, knocking out barriers to continuous improvement together. Production supervisors who manage the entire production line then know which less-than-effective components on the line need help. They can quickly design and redesign solutions for specific production issues. Task management within DPM helps both the production manager and the maintenance engineer to complete the improvement process. Using other PTC tools like Creo and Vuforia make the path to improvement even faster and easier, requiring less expert knowledge from the front-line workers and empowering every level of participation in the digital transformation process to make a direct, measurable impact on physical production. How Does DPM Work? DPM as an IoT application sits on top of the ThingWorx Foundation server, a platform for IoT development that is extensible and customizable. Manufacturers therefore find they rarely have to rip and replace existing systems and assets to reap the benefits of DPM, which gathers, aggregates, and stores production data (both automatically and through manual input on the Production Dashboard), so that it can be analyzed using time as a currency. DPM also manages the process of implementing improvements (using the Action Tracker) based on the collected data, and provides an easy way to confirm that the improvements make a real difference in the overall OEE (through the Performance Analysis Dashboard). Because the analysis occurs before and after the steps to improve are taken, manufacturers can rest assured that any resources invested on the improvements aren’t done so in vain; DPM is a predictive and prescriptive analysis process. DPM makes use of an external SQL Server to run queries against collected data and perform aggregation and analysis tasks in the background, on a separate server location than the thing model and ingestion database. This ensures that use cases involving real-time alerts and events, high-capacity ingestion, or others are still possible on the ThingWorx Foundation server. The IoT EDC is focusing in on DPM alone for a series of technical briefs which provide insight and expert level recommendations regarding DPM usage and configuration. Stay tuned into the PTC Community for more updates to come.

Mar 1, 2022

ThingWorx DevOps with Azure: The Comprehensive DevOps Guide Written by Tori Firewind, IoT EDC As promised in a previous post, attached here is a comprehensive guide to DevOps in ThingWorx, including tutorials and instructions for creating a continuous integration, continuous deployment (CI/CD) process for application development. There are also updated scripts and entities attached, including an entire sample application for importing, exporting, and testing an application in ThingWorx. From Docker and Github to Azure DevOps and Solution Central, this guide has it all. Learn how to perform your role in the DevOps process whether an administrator or a developer, automate your deployments and testing, and create a more efficient process for publication changes to production. A complete DevOps process like this really does facilitate faster and easier updates with fewer risks, fewer delays, and a better pathway to success.

Nov 18, 2021

ThingWorx DevOps By Victoria Firewind, IoT EDC This presentation accompanies a recent Expert Session, with video content including demos of the following topics: found here! DevOps is a process for taking planned changes through development, through testing, and into production, where they can be accessed by end users. One test instance typically has automated tests (integration testing) which ensure application logic is preserved in spite of whatever changes the developers are making, and often there is another test instance to ensure the application is usable (UAT testing) and able to handle a production load (load testing). So, a DevOps Pipeline starts with a task manager tasking out planned changes, where each task will become a branch in the repository. Each time a new branch is created, a new pipe is needed, which in this case, is produced by Docker Hub. Developers then make changes within that pipe, which then flow along the pipe into testing. In this diagram, testing is shown as the valve which when open (i.e. when tests all pass) then allows the changes to flow along the pipe into production. A good DevOps process has good flow along the various pipelines, with as much automated or scripted as possible to reduce the chances for errors in deployments. In order to create a seamless pipeline, whether or not it winds up automated, several third party tools are useful: b Container software is a very good way to improve the maintainability and updatability of a ThingWorx instance, while minimizing the amount of resources needed to host each component. n 1. Create Docker Image Consult the Help Center if need be. Update your YML file with everything you need before starting the image: see the example in the PTC community. License the instance using the license management website. Follow the instructions from Docker for installing those tools: Docker itself (docker) and Docker Compose (docker-compose). n 2. Save Docker Image in Docker Repo Docker Hub has some free options, and if a license is purchased, can host more than a single Docker image and tag. It is also possible to set up your own Docker registry. n 3. Access the image in Docker Desktop Download Docker Desktop and sign-in to the Docker account which hosts the repository. Create some folders for storing the h2.env file and the ThingworxStorage and ThingWorxPlatform mounted folders. Remember to license these containers as well. Developers login to the license management site themselves and put those into the ThingWorxPlatform mounted folder (“license_capability_response.bin”). Git is a very versatile tool that can be used through many different mediums, like Azure DevOps or Github Desktop. To get started as a totally new Git user, try downloading Github Desktop on your local machine and create a local repository with the provided sample code. This can then be cloned on a Linux machine, presumably whichever instance hosts the integration ThingWorx instance, using the provided scripts (once they are configured). Remember to install Git on the Linux machine, if necessary (sudo apt-get install git). A sample ThingWorx application (which is not officially supported, and provided just as an example on how to do DevOps related tasks in ThingWorx) is attached to this post in a zip file, containing two directories, one for scripts and one for ThingWorx entities. Copy the Git scripts and config file into the top level, above the repository folder, and update the GitConfig.sh file with the URL for your Git repository and your login credentials. Then these scripts can be used to sync your Linux server with your Git repository, which any developer could easily update from their local machine. This also ensures changes are secure, and enables the potential use of other DevOps procedures like tasks, epics, and corresponding branches of code. Steps to DevOps using the provided code as an example: Clone the repository into the SystemRepository or any other created repository, use the provided scripts in a Linux environment. Import the DeploymentUtilities entity, which again is scripted for Linux or for use with a development IDE with bash support. Then import the ThingWorx application from source control or use the script (which itself makes use of that DeploymentUtilities entity). Now create some local changes, add things, etc. and try out the UpdateApplication script or export to source control and then push to the Git repo. Data and localization table exports are also possible. Run the tests using the provided IntegrationTester thing or create your own by overriding the IntegrationTestTS thing shape, or use the TestTwxApplication script from a Linux terminal. Design a process for your application which allows for easy application exports and updates to and from a repository, so that developers can easily send in their changes, which can then be easily loaded and tested in another environment. In Conclusion: DevOps is a complex topic and every PTC customer will have their own process based around their unique requirements and applications. In the future, more mature pipeline solutions will be covered, ones that involve also publishing to Solution Central for easier deployment between various testing instances and production.

Sep 30, 2021

IoT Tips

Custom Solution Performance Metrics, Right Inside ThingWorx!

ThingWorx Monitoring and Alerting: Using Prometheus and Grafana, Part 1

Distributed Timer Execution in a HA Cluster

Persistent vs. Logged Properties

5 Common Mistakes for Developing Scalable IoT Applications

Top 5 Monitoring Best Practices

IOT EDC Reference Benchmark - New Scenario Using Multi-Kepware for Connected Factories

Introducing the IoT Enterprise Deployment Center

Dev Ops: The High Level Overview

Architecting Reason Code Trees in DPM

ThingWorx Monitoring and Alerting: Using Prometheus and Grafana, Part 2

The IoT Building Block Design Framework

Using the SC API: Pitfalls to Avoid

User Load Testing in ThingWorx: Java Client Tutorial

Solution Central and Azure Active Directory

The DPM User Experience

Announcing: ThingWorx Solution Central 3.1.0 and its New API

MachNation Podcast Replay: Enterprise-Specific Implementation Testing

Introduction to Digital Performance Management (DPM)

ThingWorx DevOps with Azure: The Comprehensive DevOps Guide

ThingWorx DevOps

ThingWorx Learning Paths

Getting Started on the ThingWorx Platform Learning Path