IoT & Connectivity Tips

Concepts of Anomaly Detection used in ThingWatcher ThingWatcher is based on anomaly detection with the normal distribution. What does that mean? Actually, normally distributed metrics follow a set of probabilistic rules. Upcoming values who follow those rules are recognized as being “normal” or “usual”. Whereas value who break those rules are recognized as being unusual. What is a normal distribution? A normal distribution is a very common probability distribution. In real life, the normal distribution approximates many natural phenomena. A data set is known as “normally distributed” when most of the data aggregate around it's mean, in a symmetric way. Also, it's extreme values get less and less likely to appear. Example When a factory is making 1 kg sugar bags it doesn’t always produce exactly 1 kg. In reality, it is around 1 kg. Most of the time very close to 1 kg and very rarely far from 1 kg. Indeed, the production of 1 kg sugar bag follows a normal distribution. Mathematical rules When a metric appears to be normally distributed it follows some interesting law. As does the sugar bag example. The mean and the median are the same. Both are equal to 1000. It’s because of the perfectly symmetric “bell-shape” It is the standard deviation called sigma σ that defines how the normal distribution is spread around the mean. In this example σ = 20 68% of all values fall between [mean-σ; mean+σ] For the sugar bag [980; 1020] 95% of all values fall between [mean-2*σ; mean+2*σ] For the sugar bag [960; 1040] 99,7% of all values fall between [mean-3*σ; mean+3*σ] For the sugar bag [940; 1060] The last 3 rules are also known as the 68–95–99.7 rule also called the three-sigma rule of thumb When the rules get broken: it’s an anomaly As previously stated, When a system has been proven normally distributed, it follows a set of rules. Those rules become the model representing the normal behavior of the metric. Under normal conditions, upcoming values will match the normal distribution and the model will be followed. But what happens when the rules get broken? This is when things turn different as something unusual is happening. In theory, in a normal distribution, no values are impossible. If the weights of the bags of sugar were really distributed, we would probably find a bag of sugar of 860 g every billion products. In reality, we approximate this sugar bag example as normally distributed. Also, almost impossible value are approximated as impossible Techniques of Anomaly Detection Technique n°1: outlier value An almost impossible value could be considered as an anomaly. When the value deviates too much from the mean, let’s say by ± 4σ, then we can consider this almost impossible value as an anomaly. (This limit can also be calculated using the percentile). Sugar bags who weigh less than 920 g or more than 1080 g are considered anomalous. Chances are, there is a problem in the production chain. This provides a simple way to define maximum and minimum thresholds. Technique 2: detecting change in the normal distribution Technique n°2 can detect unusual distribution fast, using only some points. But it can’t detect anomalies who move from one sigma σ to another in a usual manner. To detect this kind of anomaly we use a “window” of n last elements. If the mean and standard derivation of this window change too much from usual then we can deduce an anomaly. Using a big window with a lot of values is more stable, but it requires more time to detect the anomaly. The bigger the window is the more stable it becomes. But it would require more time to detect the anomaly as it needs to aggregate more values for the detection.

Jan 30, 2017

This blog is about Decision tree and it is aimed at providing the Analytics user with additional information about our default algorithm; Decision tree. More specifically we will clarify what structures builds the Decision tree, understand the purpose of these structures, and last we will look at a few examples of pros and cons of applying Decision tree. Decision tree is a great tool to help us making good decisions based on a huge amount of data. The algorithm maps information provided from the dataset and constructs a tree to predict our goal. Classification and regression trees are the structures behind Decision tree – Therefore when we refer to Decision tree we collectively include classification and regression as being part of Decision tree. But what is the difference between Classification and regression? 1) Classification can be used for predicting dependent categorical variables. For example if needed to predict what type of failure occurs with a machine, or what type of car a person would buy it would be a classification tree. 2) Regression is used for dependent continues numerical variables. For example if you want to predict the amount of sugar in a person’s blood or you need to predict the price of oil per gallon in 2020, regression is uses for the prediction. Regression is addressing predictions, where the value can be continues valued, and classification tree predict the correct label/type for the class. Example of a classification tree: Keep in mind that it is the goal variable that determines the type of decision tree needed. Using Decision tree is a powerful tool for prediction: Easy to understand and interpret. Help us to make the best decisions on the basis of existing information. Can handle missing values without needing to resort. Considerations: As with all analytics models, there are also limitations of the decision tree. Users must be aware of, Decision trees can be subject to overfitting and underfitting, particularly when using a small dataset. High correlation between different variables may cause very high model accuracy.

Jan 4, 2017

Steps Get the IP address of the ThingWorx Analytics Server Type ip a Put that IP address into the desired web browser Your IP address may be different from the one in the picture above Add the port number of the server to the end of the IP address The Default port number is 8080 Make sure to put a colon " : " between the end of the IP address and the start of the port number The port number could be different in some cases, depending if it was configured differently during installation Hit Enter and the main page will load.

Jan 3, 2017

Scoring is the process of making the prediction on the basis of the available data. Scoring is the process of assigning a predicted outcome to an individual record based on running that record’s conditions through the trained model. It allows you to request and retrieve individual record level prediction scores for a defined data set for a set of prediction topics. The accuracy of the score will likely be a direct reflection of the error rate produced by the Trained Model. Why the score value exceeds min or max value range of feature There are a few concepts to address with regards to this: Scoring outputs: It is important to note that when training an analytics model, the method is to create a generalizable model from a relatively small training dataset. By its nature, we expect the training process to see a limited subset and not an exhaustive list of all possible values for many constraints, especially time and practicality. As such, these generalized models will be expected to handle unseen data in the form of new combinations or values outside of previously observed ranges (more on this below). One common way to see scores that exceed the observed ranges in training, under the assumption that the goals are continuous, is to use prescriptive scoring. Prescriptive scoring attempts to find optimal values for lever, meaning tunable, features in order to maximize or minimize score values. Min/Max constraints: these are constraints that are placed upon the inputs for training and expected inputs for scoring. For training: If theses ranges were provided as part of the upload process, then training will raise exceptions regarding invalid data. However, if the ranges are not provided - they will be inferred from the data and, as such, training will not see values outside of observed ranges. For scoring: validation of the ranges will only be performed on the inputs - not the outputs. It is very important to note that the handling of these "constraints" is dependent upon the data type. For categorical (e.g. colors) and ordinal data (e.g. shirt sizes), the constraints are strict and data that was not observed in training will raise exceptions during scoring. However, for continuous values (e.g. temperature ranges) these constraints are more informational in nature. For predictive scoring, our code will accept records with values outside of those ranges. The rule of thumb is that values slightly outside these ranges are acceptable and that as the values stray farther from the ranges, the accuracy of the model degrades very quickly. For prescriptive scoring, these constraints are used to determine the acceptable ranges of values to try when determining the optimal values. Values outside of these constraints will NOT be tried. How to handle goal values while scoring What should be the value for the goal(objective TRUE) column in new data which would be scored using existing prediction model? <Dataset for making prediction model> Independent value goal field -0.65 0 -0.75 0 -0.85 0 0.85 1 0.45 1 ~~~ ~~~ <New data to be scored> Independent value goal field -0.25 ?? 0.35 ?? -0.45 ?? 0.95 ?? 0.15 ?? ~~~ ~~~ Now scoring, by its definition, does not take into consideration the goal column when being run. Seeing as the goal column above is a Boolean, we can populate the yet to be scored records with either a 0 or 1 and it won’t matter when it comes to scoring.

Dec 30, 2016

ThingWorx Analytics Interactive API Guide is a great way for users to familiarize themselves with ThingWorx Analytics APIs calls. It even gives users the ability to run jobs through its interface. This blog post will cover how to access the ThingWorx Analytics Interactive API Guide installed on a Virtual Machine or Standalone Server. Steps Get the IP address of the ThingWorx Analytics Server Type ip a Put that IP address into the desired web browser Your IP address may be different from the one in the picture above Add the port number of the server to the end of the IP address The Default port number is 8080 Make sure to put a colon " : " between the end of the IP address and the start of the port number The port number could be different in some cases, depending if it was configured differently during installation Hit Enter and the main page will load.

Dec 29, 2016

Metrics for Model evaluation used in ThingWorx Analytics In ThingWorx Analytics, we consider different kinds of metrics to evaluate our models. The choice of metric completely depends on the type of model and the implementation plan of the model. After you are finished building your model, these 3 metrics will help you in evaluating your model accuracy. Here are below further explanations about the 3 metrics used. 1-The ROC Curve: To understand what is ROC (Receiver operating characteristic) curve, let's look at the confusion matrix below. We observe that for a probabilistic model, we get a different value for each metric. Hence, for each sensitivity, we get a different specificity. The two vary as follows: The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate. Following is the ROC curve for the case in hand Let’s take an example of threshold = 0.5 (refer to confusion matrix). Here is the confusion matrix: As you can see, the sensitivity at this threshold is 99.6% and the (1-specificity) is ~60%. This coordinate becomes on point in our ROC curve. To bring this curve down to a single number, we find the area under this curve (AUC). Note that the area of the entire square is 1*1 = 1. Hence AUC itself is the ratio under the curve and the total area. For the case in hand, we get AUC ROC as 96.4%. Following are a few thumb rules: .90-1 = excellent (A) .80-.90 = good (B) .70-.80 = fair (C) .60-.70 = poor (D) .50-.60 = fail (F) We see that we fall under the excellent band for the current model. But this might simply be over-fitting. In such cases, it becomes very important to have in-time and out-of-time validations. Points to Remember: For a model which gives a class as an output, it will be represented as a single point in ROC plot. Such models cannot be compared with each other as the judgment needs to be taken on a single metric and not using multiple metrics. For instance, a model with parameters (0.2,0.8) and model with parameter (0.8,0.2) can be coming out of the same model, hence these metrics should not be directly compared. 2-Root Mean Squared Error (RMSE) RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that error are unbiased and follow a normal distribution. Here are the key points to consider on RMSE: The power of ‘square root’ empowers this metric to show large number deviations. The ‘squared’ nature of this metric helps to deliver more robust results which prevent canceling the positive and negative error values. In other words, this metric aptly displays the plausible magnitude of the error term. It avoids the use of absolute error values which is highly undesirable in mathematical calculations. When we have more samples, reconstructing the error distribution using RMSE is considered to be more reliable. RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers from your data set prior to using this metric. As compared to mean absolute error, RMSE gives higher weighting and punishes large errors. 3-Pearson Correlation Coefficient This metric measures how highly correlated are two variables and is measured from -1 to +1. A Pearson Correlation Coefficient of 1 indicates that the data objects are perfectly correlated but in this case, a score of -1 means that the data objects are not correlated. In other words, the Pearson Correlation score quantifies how well two data objects fit a line. There are several benefits to using this type of metric. The first is that the accuracy of the score increases when data is not normalized. As a result, this metric can be used when quantities (i.e. scores) varies. Another benefit is that the Pearson Correlation score can correct for any scaling within an attribute, while the final score is still being tabulated. Thus, objects that describe the same data but use different values can still be used. The below figure demonstrates how the Pearson Correlation score may appear if graphed. The chart demonstrates the Pearson Correlation Coefficient. The axes are the scores given by the labeled critics and the similarity of the scores given by both critics in regards to certain an_items. In essence, the Pearson Correlation score finds the ratio between the covariance and the standard deviation of both objects. In the mathematical form, the score can be described as: In this equation, (x,y) refers to the data objects and N is the total number of attributes

Dec 21, 2016

This is part of the continuing series of Blog posts regarding Troubleshooting the Application, this article will discuss more advance issues that some clients and customer have encountered while building or using ThingWorx Analytics. Packer Script Error – Unable to Download CentOS Image As the application is developed and built inside a CentOS image, the ThingWorx Analytics Packer Script tool for Virtual Machine Appliance creation utilizes the CentOS mirror repository in the creation process. When the end user is attempting to build the Virtual Machine Appliance with the Packer Script media creation tool, part of the process is to download the CentOS 7 ISO image file as the basis for the operating system that the ThingWorx Analytics Server software will be installed to. If CentOS updates or changes their mirror links for the source file ISO, you may encounter the following error: ==> virtualbox-iso: Downloading or copying Guest additions virtualbox-iso: Downloading or copying: file:///C:/Program%20Files/Oracle/VirtualBox/VBoxGuestAdditions.iso ==> virtualbox-iso: Downloading or copying ISO virtualbox-iso: Downloading or copying: file:///local-file-repo/CentOS-7-x86_64-Minimal-1511.iso virtualbox-iso: Error downloading: open local-file-repo/CentOS-7-x86_64-Minimal-1511.iso: The system cannot find the path specified. virtualbox-iso: Downloading or copying: http://mirror.spro.net/centos/7/isos/x86_64/CentOS-7-x86_64-Minimal-1511.iso virtualbox-iso: Error downloading: checksums didn't match expected: 88c0437f0a14c6e2c94426df9d43cd67 ==> virtualbox-iso: ISO download failed. Build 'virtualbox-iso' errored: ISO download failed. ==> Some builds didn't complete successfully and had errors: --> virtualbox-iso: ISO download failed. ==> Builds finished but no artifacts were created. Solution Method 1: Configuration File Replacement We have created a custom JSON configuration file that resolves the mirror issue for CentOS 7 v1611. You can download the JSON file here; you may have to right-click and “save link as” a JSON extension file. Also note, you will have to save/rename this JSON file as neuron-solo-variables.json. Using this file, navigate to your Packer Script builder directory, usually this is found in the following path: <PATH>\ThingWorx-Analytics-Server-Standalone\components\vm-builder\neuron-vm-builder Copy the new JSON file into this directory, and replace the current existing copy. You can now re-run the Packer Script for your desired Virtual Machine Appliance output. Method 2: Manual Configuration File Adjustment You will have to locate an active mirror for CentOS 7. A list of current active mirrors can be found here. When selecting a mirror, you will need to select the Minimal ISO install, as this is the base image that is used for the VM creation. Next, you will have to open the current neuron-solo-variables.json configuration file located in the <PATH>\ThingWorx-Analytics-Server-Standalone\components\vm-builder\neuron-vm-builder directory. You will have to replace the os_image_download_url value with an active Mirror URL from the list above. Next, for the os_iso_md5_checksum variable, you will need to replace the entry with the new SHA256 checksum from CentOS, which can be located here. Default Settings: New Settings: Save changes and close the neuron-solo-variables.json configuration file. CentOS has switched over from MD5 to SHA256 checksums. Even though in the following the variable name has “MD5” in the string, we will be modifying a second JSON configuration file to address this. In the same directory that we are currently working in, open the neuron-solo.json configuration file. You will need to modify the attribute iso_checksum_type to sha256 Default Settings: New Settings: Save changes and close the neuron-solo.json configuration file. You can now re-run the Packer Script for your desired Virtual Machine Appliance output.

Dec 19, 2016

Dec 9, 2016

Dec 2, 2016

Nov 29, 2016

Nov 28, 2016

Oct 10, 2016

In this video we cover the different configuration steps required for ThingWorx Analytics Builder extension This video applies to ThingWorx Analytics 52.1 till 8.1. Note though: - this video uses Classic Composer, the same operations can be done using the New Composer starting with version 8.0 as illustrated in the Help Center - For release 8.1, the Settings menu differs from previous versions, see Video Link : 2079 between times 00:12 sec to 00:40 sec for up to date menu selection. Updated Link for access to this video: Installing Thingworx Analytics Builder: Part 2 of 3

Sep 13, 2016

Aug 30, 2016

Best Practices in Data Preparation for ThingWorx Analytics Data Preparation is an important phase in the process of Data Analysis when using ThingWorx Analytics. Basically, it is getting your Data from being Raw Data that you might have gathered through your Operational system or from your Data warehouse to the kind of Data ready to be analyzed. In this Document we will be using “Talend Data Preparation Free Desktop” as a Tool to illustrate some examples of the Data Preparations process. This tool could be downloaded under the following Link: https://www.talend.com/products/data-preparation (You could also choose to use another tool) We would also use the Beanpro Dataset in our Examples and illustrations. Checking data formats The analysis starts with a raw data file. The user needs to make sure that the data files can be read. Raw data files come in many different shapes and sizes. For example, spreadsheet data is formatted differently than web data or Sensors collected data and so forth. In ThingWorx Analytics the Data Format acceptable are CSV. So the Data retrieved needs to be inputted into that format before it could be uploaded to TWA Data Example (BeanPro dataset used): After that is done the user needs to actually look at what each field contains. For example, a field is listed as a character field could actually contains none character data. Verify data types Verifying the data types for each feature or field in the Dataset used. All data falls into one of four categories that affect what sort of analytics that could be applied to it: Nominal data is essentially just a name or an identifier. Ordinal data puts records into order from lowest to highest. Interval data represents values where the differences between them are comparable. Ratio data is like interval data except that it also allows for a value of 0. It's important to understand which categories your data falls into before you feed it into ThingWorx Analytics. For example when doing Predictive Analytics TWA would not accept a Nominal Data Field as Goal. The Goal feature data would have to be of a numerical non nominal type so this needs to be confirmed in an early stage. Creating a Data Dictionary A data dictionary is a metadata description of the features included in the Dataset when displayed it consists of a table with 3 columns: - The first column represents a label: that is, the name of a feature, or a combination of multiple (up to 3) features which are fields in the used Dataset. It points to “fieldname” in the configuration json file. - The second column is the Datatype value attached to the label. (Integer, String, Datetime…). It points to “dataType” in the configuration json file. - The third column is a description of the Feature related to the label used in the first column. It points to “description” in the configuration json file. In the context of TWA this Metadata is represented by a Data configuration “json” file that would be uploaded before even uploading the Dataset itself. Sample of BeanPro dataset configuration file below: Verify data accuracy Once it is confirmed that the data is formatted the way that is acceptable by TWA, the user still need to make sure it's accurate and that it makes sense. This step requires some knowledge of the subject area that the Dataset is related to. There isn't really a cut-and-dried approach to verifying data accuracy. The basic idea is to formulate some properties that you think the data should exhibit and test the data to see if those properties hold. Are stock prices always positive? Do all the product codes match the list of valid ones? Essentially, you're trying to figure out whether the data really is what you've been told it is. Identifying outliers Outliers are data points that are distant from the rest of the distribution. They are either very large or very small values compared with the rest of the dataset. Outliers are problematic because they can seriously compromise the Training Models that TWA generates. A single outlier can have a huge impact on the value of the mean. Because the mean is supposed to represent the center of the data, in a sense, this one outlier renders the mean useless. When faced with outliers, the most common strategy is to delete them. Example of the effect of an Outlier in the Feature “AVG Technician Tenure” in BeanPro Dataset: Dataset with No Outlier: Dataset with Outlier: Deal with missing values Missing values are one of the most common (and annoying) data problems you will encounter. In TWA dealing with the Null values is done by one of the below methods: - Dropping records with missing values from your Dataset. The problem with this is that missing values are frequently not just random little data glitches so this would consider as the last option. - Replacing the NULL values with average values of the responses from the other records of the same field to fill in the missing value Transforming the Dataset - Selecting only certain columns to load which would be relevant to records where salary is not present (salary = null). - Translating coded values: (e.g., if the source system codes male as "1" and female as "2", but the warehouse codes male as "M" and female as "F") - Deriving a new calculated value: (e.g., sale_amount = qty * unit_price) - Transposing or pivoting (turning multiple columns into multiple rows or vice versa) - Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a string in one column, into individual values in different columns) Please note that: Issue with Talend should be reported to the Talend Team Data preparation is outside the scope of PTC Technical Support so please use this article as an advisable Best Practices document

Aug 30, 2016

This Guide contains all the Linux commands that you may have to use for ThingWorx Analytics Installation or day to day use. Command/Category Description Network/port ip a List the ips of all of the network interfaces ssh How to jump from one machine to another ping Send packets to a remote machine. useful for testing connectivity netstat –anp Check active port cat < /dev/tcp/localhost/8080 Test connection to a port Replace localhost with desired hostname or ip, replace 8080 with desired port number (/dev/tcp/host/port) exit Exit my current sign in. this lets one disconnect from remove ssh sessions or if one has changed one's user e.g. switched to root scp Retrieve something via ssh Resource Usage free -m Check memory -m is for output in Mb Mpstat -P ALL CPU usage top Process usage jvmtop Collect cpu usage of jvm and its thread https://github.com/patric-r/jvmtop (requires jdk to be installed) File Interaction cp / mv Copy and move respectively. mv just deletes the source file. Usage: cp /source/location/file /output/location/file cat Mostly used to just print the contents of a file to the command line. can also print multiple files at once: cat /var/log/gridworker/warning.log /var/log/gridworker/error.log vim / vi A command line text editor. Not the most user friendly (none of them are) but really useful. Here's a cheat sheet for the commands https://www.fprintf.net/vimCheatSheet.html rm Remove files chmod Change the access permissions of files chown Change the user or group ownership of files grep A text based filtering. Useful for making a larger list smaller and more targeted. Almost always used after a pipe (see pipe below) less Generally used to view the contents of a file with more friendly scrolling locate Find a file by name Directory ls What’s in the directory. Will do the current directory but you can also pass the directory e.g. ls /var/log/tomcat. Black writing is files Blue writing is directories Red writing is compressed file pwd Tell me which directory I'm currently in cd Change directory. provide the directory to change to or just use cd to return to the user's home directory clear Clears the screen Terminal clears all provided commands mkdir Creates Directories Running Processes ps Query what services are running. usually use ps -aux to get a full, sorted list. using grep with this is helpful systemctl The correct way to interact with services that are running Package installation yum install <packageName> Install a package. More useful commands: https://www.centos.org/docs/5/html/5.1/Deployment_Guide/s1-yum-useful-commands.html yum list installed List installed packages yum list <package> List available packages yum --showduplicates list java-1.7.0-openjdk-devel Use --showduplicates to see all versions Can use * for package name: *openjdk* rpm -ql <packagename> Find where package are installed Note: works if package installed with yum Yumdownloader --urls <packageName> Find URL where a package is downloaded from. Note: need to install yum-utils package Repoquery --requires <packageName> Find dependencies of a package Note: need to install yum-utils package repoquery --qf=%{name} -g --list --grouppkgs=all [groups] | xargs repotrack -a x86_64 -p /repos/Packages Download a package with all its dependencies. Need to install yum-utils package From <http://unix.stackexchange.com/questions/50642/download-all-dependencies-with-yumdownloader-even-if-already-installed> Other Commands curl http://localhost:8080/1.0/about/versioninfo Send REST call via command line Use -X POST (default GET) for a POST (see man page - https://curl.haxx.se/docs/manual.html for example) See also http://www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/ Find / -type f -exec grep -I mystring {} \; Search string in files Sudo -u user command Execute a command as different user The below helpers are not commands themselves, but can be used in conjunction with the above commands. Helper Description 'pipe' The | character. lets one chain commands. e.g. ps -aux | grep java ./ The shorthand way to refer to this directory explicitly ../ The shorthand way to refer to the parent directory 'tab completion' Pressing tab will let linux guess what command/option best fits what's currently written. very useful for navigating directories and long-named files (NOTE: not necessarily tab based upon one's keyboard layout/language) 'ctrl-r' Look up the mostly likely command that matches what one typing. so if one earlier used ps -aux | grep java | less and the hit ctrl-r and typed -aux it would likely pull that command or at least the most recent one that matches

Aug 30, 2016

Aug 29, 2016

In ThingWorx Analytics, you have the possibility to use an external model for scoring. In this written tutorial, I would like to provide an overview of how you can use a model developed in Python, using the scikit-learn library in ThingWorx Analytics. The provided attachment contains an archive with the following files: iris_data.csv: A dataset for pattern recognition that has a categorical goal. You can click here to read more about this dataset TestRFToPmml.ipynb: A Jupyter notebook file with the source code for the Python model as well as the steps to export it to PMML RF_Iris.pmml: The PMML file with the model that you can directly upload in Analytics without going through the steps of training the model in Python The tutorial assumes you already have some knowledge of ThingWorx and ThingWorx Analytics. Also, if you plan to run the Python code and train the model yourself, you need to have Jupyter notebook installed (I used the one from the Anaconda distribution). For demonstration purposes, I have created a very simple random forest model in Python. To convert the model to PMML, I have used the sklearn2pmml library. Because ThingWorx Analytics supports PMML format 4.3, you need to install sklearn2pmml version 0.56.2 (the highest version that supports PMML 4.3). To read more about this library, please click here Furthermore, to use your model with the older version of the sklearn2pmml, I have installed scikit-learn version 0.23.2. You will find the commands to install the two libraries in the first two cells of the notebook. Code Walkthrough The first step is to import the required libraries (please note that pandas library is also required to transform the .csv to a Dataframe object): import pandas from sklearn.ensemble import RandomForestClassifier from sklearn2pmml import sklearn2pmml from sklearn.model_selection import GridSearchCV from sklearn2pmml.pipeline import PMMLPipeline After importing the required libraries, we convert the iris_data.csv to a pandas dataframe and then create the features (X) as well as the goal (Y) vectors: iris_df = pandas.read_csv("iris_data.csv") iris_X = iris_df[iris_df.columns.difference(["class"])] iris_y = iris_df["class"] To best tune the random forest, we will use the GridSearchCSV and cross-validation. We want to test what parameters have the best validation metrics and for this, we will use a utility function that will print the results: def print_results(results): print('BEST PARAMS: {}\n'.format(results.best_params_)) means = results.cv_results_['mean_test_score'] stds = results.cv_results_['std_test_score'] for mean, std, params in zip(means, stds, results.cv_results_['params']): print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params)) We create the random forest model and train it with different numbers of estimators and maximum depth. We will then call the previous function to compare the results for the different parameters: rf = RandomForestClassifier() parameters = { 'n_estimators': [5, 50, 250], 'max_depth': [2, 4, 8, 16, 32, None] } cv = GridSearchCV(rf, parameters, cv=5) cv.fit(iris_X, iris_y) print_results(cv) To convert the model to a PMML file, we need to create a PMMLPipeline object, in which we pass the RandomForestClassifier with the tuning parameters we identified in the previous step (please note that in your case, the parameters can be different than in my example). You can check the sklearn2pmml documentation to see other examples for creating this PMMLPipeline object : pipeline = PMMLPipeline([ ("classifier", RandomForestClassifier(max_depth=4,n_estimators=5)) ]) pipeline.fit(iris_X, iris_y) Then we perform the export: sklearn2pmml(pipeline, "RF_Iris.pmml", with_repr = True) The model has now been exported as a PMML file in the same folder as the Jupyter Notebook file and we can upload it to ThingWorx Analytics. Uploading and Exploring the PMML in Analytics To upload and use the model for scoring, there are two steps that you need to do: First, the PMML file needs to be uploaded to a ThingWorx File Repository Then, go to your Analytics Results thing (the name should be YourAnalyticsGateway_ResultsThing) and execute the service UploadModelFromRepository. Here you will need to specify the repository name and path for your PMML file, as well as a name for your model (and optionally a description) If everything goes well, the result of the service will be an id. You can save this id to a separate file because you will use it later on. You can verify the status of this model and if it’s ready to use by executing the service GetDetails: Assuming you want to use the PMML for scoring, but you were not the one to develop the model, maybe you don’t know what the expected inputs and the output of the model are. There are two services that can help you with this: QueryInputFields – to verify the fields expected as input parameters for a scoring job QueryOutputFields – to verify the expected output of the model The resultType input parameter can be either MODELS or CLUSTERS, depending on the type of model, Using the PMML for Scoring With all this information at hand, we are now ready to use this PMML for real-time scoring. In a Thing of your choice, define a service to test out the scoring for the PMML we have just uploaded. Create a new service with an infotable as the output (don’t add a datashape). The input data for scoring will be hardcoded in the service, but you can also add it as service input parameters and pass them via a Mashup or from another source. The script will be as follows: // Values: INFOTABLE dataShape: "" let datasetRef = DataShapes["AnalyticsDatasetRef"].CreateValues(); // Values: INFOTABLE dataShape: "" let data = DataShapes["IrisData"].CreateValues(); data.AddRow({ sepal_length: 2.7, sepal_width: 3.1, petal_length: 2.1, petal_width: 0.4 }); datasetRef.AddRow({ data: data}); // predictiveScores: INFOTABLE dataShape: "" let result = Things["AnalyticsServer_PredictionThing"].RealtimeScore({ modelUri: "results:/models/" + "97471e07-137a-41bb-9f29-f43f107bf9ca", //replace with your own id datasetRef: datasetRef /* INFOTABLE */, }); Once you execute the service, the output should look like this (as we would have expected, according to the output fields in the PMML model): As you have seen, it is easy to use a model built in Python in ThingWorx Analytics. Please note that you may use it only for scoring, and the model will not appear in Analytics Builder since you have created it on a different platform. If you have any questions about this brief written tutorial, let me know.

Jun 2, 2022

With ThingWorx, we can already use univariate anomaly alerts (on a single sensor value). However, in many situations, the readings from an individual sensor may not tell you much about the overall issue and a multivariate anomaly detector can be more useful. This post is intended to provide an overview of the Azure Anomaly Detector and how it can be integrated with ThingWorx. The attachment contains: A document with detailed instructions about the setup; A .csv file with the multivariate timeseries dataset; A .twx file with some entities that need to be imported in ThingWorx as well as the CSVParser extension that needs to be installed; A .zip file that will need to uploaded in an Azure Blob Container at some point in the setup

Jul 21, 2021

Feb 22, 2016

Analytics projects typically involve using the Analytics API rather than the Analytics Builder to accomplish different tasks. The attached documentation provides examples of code snippets that can be used to automate the most common analytics tasks on a project such as: Creating a dataset Training a Model Real time scoring predictive and prescriptive Retrieving the validation metrics for a model Appending additional data to a dataset Retraining the model The documentation also provides examples that are specific to time series datasets. The attached .zip file contains both the document as well as some entities that you need to import in ThingWorx to access the services provided in the examples.

Apr 16, 2021

IoT & Connectivity Tips

Concepts of Anomaly Detection used in ThingWatcher

What is a Decision tree and how dos it work?

How to Acess ThingWorx Analytics Interactive API Guide

Scoring

How to Acess ThingWorx Analytics Interactive API Guide

Metrics for Model evaluation used in ThingWorx Analytics

Troubleshooting Steps for ThingWorx Analytics: Packer Script Error – Unable to Download CentOS Image

ThingWorx Analytics Builder - Upload Data

Getting Started with ThingWorx Analytics Part-2

Create Signals In ThingWorx Analytics Builder

Getting Started with ThingWorx Analytics Part-1

Installing ThingWorx Analytics Server 52.x - part 1 - pre-requisites

Installing Thingworx Analytics Builder part 2 of 3

Ensemble Learning Techniques

Best Practices in Data Preparation for ThingWorx Analytics

Linux Commands Quick Guide

Best Practices in Data Preparation for ThingWorx Analytics.docx

From Scikit-Learn To ThingWorx Analytics

Using the multivariate Azure Anomaly Detector with ThingWorx

All.json.postman_collection

Analytics Services Examples

ThingWorx Learning Paths

Transaction Handling in ThingWorx

Getting Started on the ThingWorx Platform Learning Path