cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Community Tip - Did you know you can set a signature that will be added to all your posts? Set it here! X

ThingWorx data hygiene

nquirindongo
10-Marble

ThingWorx data hygiene

Is there any ThingWorx built-in API or services for data hygiene?

 

ACCEPTED SOLUTION

Accepted Solutions

Clear. As stated before, before adding training data into ThingWorx Analytics, we usually use the external tools we're familiar with (Jupyter Notebook etc).

You can use the many many snippets available in ThingWorx services (see below)  to perform this type of cleaning, but they are not at the level of speed that tools like that offer. That does not mean they don't work - just that people are far more familiar in ML land with using stuff like Jupyter,

Snippets available in any service editor:

VladimirRosu_0-1689242061915.png

 

One last thing: typically what we see is the ETL process that takes most time, for example, what I saw is it takes 60-80% of the total time spent on Analytics.

Using really ThingWorx Analytics is usually a very easy process which takes much much less (just load the training dataset, set your goal and let the system crunch, rinse and repeat whenever needed).

Don't be shy on not using external ETL tools to process/clean your data. As I said, ThingWorx Analytics itself is not intended on being a replacement for an ETL tool.

View solution in original post

6 REPLIES 6
PaiChung
22-Sapphire I
(To:nquirindongo)

Could you define what you mean with 'data hygiene' please. Thanks!

LH_9794858
5-Regular Member
(To:PaiChung)

Hi, I'm working with Nelson who originally posed the question about data hygiene.

 

Commonly, raw data is data containing errors, incomplete, duplicated or incorrect. Having a data hygiene process is common in machine learning to clean the data. Data hygiene can ensure handling errors, standardization, normalization, missing data and duplicate data, It's also important to suppress data that doesn't provide value. Typically with Python, there are library such as pandas and NumPy that helps with dealing with missing values (NaN), removing whitespaces, checking unique values of columns are just a few basic techniques to clean data. 

 

In ThingWorx documentation, it does describe how to handle missing data. What other data hygiene features does ThingWorx have?

PaiChung
22-Sapphire I
(To:LH_9794858)

Similar to what you mention in regards to what you do with Python, you would do similar things within ThingWorx or even at the Edge before transmitting data. Using Services to detect those issues and resolve them before sending the data on to Analytics or something else.

You can use the 'InfoTable' Services and JavaScript to do this.

In addition to Pai's reply, I want to add that generally speaking ThingWorx is not designed to be an ETL tool, where you'd typically find all these libraries. If building services to clean data does not work fast (or requires building extensions to cover stuff like NumPy - an equivalent in Jave being https://github.com/mikera/vectorz ), then I suggest saving data straight from TW in a CSV format, then applying the tools you know to achieve a clean output.

Are you using ThingWorx Analytics after this step?

LH_9794858
5-Regular Member
(To:VladimirRosu)

Yes, we are using ThingWorx Analytics after data hygine (clean).

 

What process/steps have you used in the past to prepare the data for ThingWorx ML? Does PTC have any JavaScript or extentions that you're able to provide?

 

Is there a way to integrate Python with ThingWorx?

Clear. As stated before, before adding training data into ThingWorx Analytics, we usually use the external tools we're familiar with (Jupyter Notebook etc).

You can use the many many snippets available in ThingWorx services (see below)  to perform this type of cleaning, but they are not at the level of speed that tools like that offer. That does not mean they don't work - just that people are far more familiar in ML land with using stuff like Jupyter,

Snippets available in any service editor:

VladimirRosu_0-1689242061915.png

 

One last thing: typically what we see is the ETL process that takes most time, for example, what I saw is it takes 60-80% of the total time spent on Analytics.

Using really ThingWorx Analytics is usually a very easy process which takes much much less (just load the training dataset, set your goal and let the system crunch, rinse and repeat whenever needed).

Don't be shy on not using external ETL tools to process/clean your data. As I said, ThingWorx Analytics itself is not intended on being a replacement for an ETL tool.

Announcements


Top Tags