Solved: Re: Entity ID in time series scoring

precisionlabs · ‎Apr 12, 2024

In this article https://community.ptc.com/t5/IoT-Tips/Considerations-for-Handling-Time-Series-Data/ta-p/818763 entity id is defined as this:

“ENTITY_ID”, [is] the identifier for an entity, such as a machine serial number. The ENTITY_ID field should remain the same as long as there are no missing timestamps and it is within the same asset but should be different for different assets or asset runs in order to accurately assign history during model training and scoring."

"If there are gaps in the time series data, it is recommended to restart the series after the gap as a new entity."

This makes perfect sense to me, in order to avoid mixing training data from different machines or different runs you should separate the dataset with the entity id label. In my case, I have only one machine/system, but several different runs spanning a big time window. I would therefore assign a different entity id for each of this runs.

My doubt comes when asking for predictions. The dataset for scoring needs to include an entity id, this makes total sense when the entity id is separating between different assets, it's basically another feature/label. Now for my case, which entity id should I pass for scoring?

For example, if I have data from 3 runs on 3 different days with a big gap of time between them. In the training dataset I need to assign an entity id for each one, lets say: run1, run2, run3. Now when scoring in the future, which entity id should I use? run1, run2 or run3? Why would I choose one over the other if they were only separated in order to avoid mixing runs?

Rocko · ‎Apr 15, 2024

I haven't tried it, but i think that in this case the Entity_ID can be the same for all sets, because you won't have timestamp collisions. ENTITY_ID makes the records unique in case of multiple machines being logged at the same time. Maybe you can try using a constant entity_id in training (or leaving it away in the first place).

If that doesn't work because the gaps are too large, there should be no difference in which Entity ID you pass in for scoring: You trained one model which should be valid for all the machines/series, not one model per machine/series, all packed into one. Hence it shouldn't make a difference - would be my assumption. But that's also quick to try out if you already have the model.

View solution in original post

Rocko · ‎Apr 15, 2024

I haven't tried it, but i think that in this case the Entity_ID can be the same for all sets, because you won't have timestamp collisions. ENTITY_ID makes the records unique in case of multiple machines being logged at the same time. Maybe you can try using a constant entity_id in training (or leaving it away in the first place).

If that doesn't work because the gaps are too large, there should be no difference in which Entity ID you pass in for scoring: You trained one model which should be valid for all the machines/series, not one model per machine/series, all packed into one. Hence it shouldn't make a difference - would be my assumption. But that's also quick to try out if you already have the model.

precisionlabs · ‎Apr 15, 2024

Hi @Rocko thanks for your answer. I just tried a couple of things. The entity id must be present and unique for every run if there are gaps in time. If I set a constant entity id it fails to train warning that the time sampling is not consistent with what was configured in the dataset import.

"there should be no difference in which Entity ID you pass in for scoring" I would think so too, but confirmation would be great. You also say it'd be quick to try, how do you suggest I go about that? I could score trying different entity ids for the same data and model, but how will I know if it makes a difference or not? I'm must be missing something.

Rocko · ‎Apr 16, 2024

Well maybe I'm missing something I would've tried to score the same sample data with two different entity ids and see if there is a difference in the goal value. If not, the entity_id didn't make a difference.

precisionlabs · ‎Apr 16, 2024

Came back here to report exactly that. I tried to score the same data with 2 different entity ids and it gives me the exact same results (actually the entity id passed for scoring doesn't even need to be one from the training dataset, it could be anything, but needs to be present in the scoring dataset). It just uses the scoring entity id as an identifier in the prediction results. Screenshots of both tests attached.

"You trained one model which should be valid for all the machines/series, not one model per machine/series, all packed into one." This was the key insight and after testing I can confirm that's accurate, thank you @Rocko.

d_kessler · ‎Apr 19, 2024

Hi all,

To add A little bit more information and context, I will expand on what Entity ID is intended to do. It is supposed to help Thingworx Analytics understand what data belongs to what machine, as well as tell what chunk of data is continuous for each machine. This is important because Thingworx Analytics uses the look back window to automatically engineer features that are used in training the model, as well as for making predictions. Without entity ID, all data would just be pushed together, and the look back window calculations would end up using mixed data. It is quite possible that the beginning of a single run behaves differently from the end of that run, and so it is important to keep the runs segregated.

It is important to have each run for scoring have a Unique entity ID. It doesn't matter what you call the new Entity IDs, but they need to be unique. This way, when scoring, ThingWorx Analytics will perform look back window calculations using a single run for a single machine, This ensures that only relevant data is being used for scoring, and not data from other machines, and not data from other times in the past.

It is true that multiple historical runs from multiple machines can be used together to train a Thingworx Analytics time series model. This single model is used to make predictions for multiple runs and/or multiple machines in the future. Entity ID ensures that the data is properly segregated, and that the calculations, and therefore predictions, are correct.