Sampling Strategy This Blog Post will cover the 4 sampling Strategies that are available in ThingWorx Analytics. It will tell you how the sampling strategy runs behind the scenes, when you may want to use that strategy, and will give you the pros and cons of each strategy. SAMPLE_WITH_REPLACEMENT This strategy is not often used by professionals but still may be useful in certain circumstances. When you sample with replacement, the value that you randomly selected is then returned to the sample pool. So there is a chance that you can have the same record multiple times in your sample. Example Let’s say you have a hat that contain 3 cards with different people’s names on them. John Sarah Tom Let’s say you make 2 random selections. The first selection you pull out the name Tom. When you sample with replacement, you would put the name Tom back into the hat and then randomly select a card again. For your second selection, it is possible to get another name like Sarah, or the same one you selected, Tom. Pros May find improved models in smaller datasets with low row counts Cons The Accuracy of the model may be artificially inflated due to duplicates in the sample SAMPLE_WITHOUT_REPLACEMENT This is the default setting in ThingWorx Analytics and the most commonly used sampling strategy by professionals. The way this strategy works is after the value is randomly selected from the sample pool, it is not returned. This ensures that all the values that are selected for the sample, are unique. Example Let’s say you have a hat that contain 3 cards with different people’s names on them. John Sarah Tom Let’s say you make 2 random selections. The first selection you pull out the name Tom. When you sample without replacement, you would randomly select a card from the hat again without adding the card Tom. For your second selection, you could only get the Sarah or John card. Pros This is the sampling strategy that is most commonly used It will deliver the best results in most cases Cons May not be the best choice if the desired goal is underrepresented in the dataset UPSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT This is useful when the desired goal is underrepresented in the dataset. The features that represent the desired outcome of the goal are copied multiple times so they represent a larger share of the total dataset. Example Let’s say you are trying to discover if a patient is at risk for developing a rare condition, like chronic kidney failure, that affects around .5% of the US population. In this case, the most accurate model that would be generated would say that no one will get this condition, and according to the numbers, it would be right 99.5% of the time. But in reality, this is not helpful at all to the use case since you want to know if the patient is at risk of developing the condition. To avoid this from happening, copies are made of the records where the patient did develop the condition so it represents a larger share of the dataset. Doing this will give ThingWorx Analytics more examples to help it generate a more accurate model. Pros Patterns from the original dataset remain intact Cons Longer training time DOWNSAMPLE_AND_SAMPLE_WITHOUT_REPLACEMENT This is also useful when the desired goal is underrepresented in the dataset. In downsample and sample without replacement, some features that do not represent the desired goal outcome are removed. This is done to increase the desired features percentage of the dataset. Example Let’s continue using the medical example from above. Instead of creating copies of the desired records, undesired record are removed from the dataset. This causes the records where patients did develop the condition to occupy a larger percentage of the dataset. Pros Shorter training time Cons Patterns from the original dataset may be lost
View full tip