Community Tip - Your Friends List is a way to easily have access to the community members that you interact with the most! X
This Blog Post will cover the 4 sampling Strategies that are available in ThingWorx Analytics. It will tell you how the sampling strategy runs behind the scenes, when you may want to use that strategy, and will give you the pros and cons of each strategy.
This strategy is not often used by professionals but still may be useful in certain circumstances. When you sample with replacement, the value that you randomly selected is then returned to the sample pool. So there is a chance that you can have the same record multiple times in your sample.
Let’s say you have a hat that contain 3 cards with different people’s names on them.
Let’s say you make 2 random selections. The first selection you pull out the name Tom. When you sample with replacement, you would put the name Tom back into the hat and then randomly select a card again. For your second selection, it is possible to get another name like Sarah, or the same one you selected, Tom.
This is the default setting in ThingWorx Analytics and the most commonly used sampling strategy by professionals. The way this strategy works is after the value is randomly selected from the sample pool, it is not returned. This ensures that all the values that are selected for the sample, are unique.
Let’s say you have a hat that contain 3 cards with different people’s names on them.
Let’s say you make 2 random selections. The first selection you pull out the name Tom. When you sample without replacement, you would randomly select a card from the hat again without adding the card Tom. For your second selection, you could only get the Sarah or John card.
This is useful when the desired goal is underrepresented in the dataset. The features that represent the desired outcome of the goal are copied multiple times so they represent a larger share of the total dataset.
Let’s say you are trying to discover if a patient is at risk for developing a rare condition, like chronic kidney failure, that affects around .5% of the US population. In this case, the most accurate model that would be generated would say that no one will get this condition, and according to the numbers, it would be right 99.5% of the time. But in reality, this is not helpful at all to the use case since you want to know if the patient is at risk of developing the condition.
To avoid this from happening, copies are made of the records where the patient did develop the condition so it represents a larger share of the dataset. Doing this will give ThingWorx Analytics more examples to help it generate a more accurate model.
This is also useful when the desired goal is underrepresented in the dataset. In downsample and sample without replacement, some features that do not represent the desired goal outcome are removed. This is done to increase the desired features percentage of the dataset.
Let’s continue using the medical example from above. Instead of creating copies of the desired records, undesired record are removed from the dataset. This causes the records where patients did develop the condition to occupy a larger percentage of the dataset.
Hi John Greiner, it is also possible to select "None" as sampling value. When would this be applied and what would be the effect?
BR
Roman
Hi Roman,
Yes None is an option. It causes the model to be trained on the entire dataset instead of the sample. It is only recommended to use this with a smaller dataset (only a few thousand rows) because applying it to a larger dataset will add a significant amount of time to the training process.
Warm Regards,
John