Skip to main content
23-Emerald I
August 10, 2023
Question

What do you do when you suspect that some data is wrong?

  • August 10, 2023
  • 3 replies
  • 4344 views

I spent a fair portion of my career taking measurements of experimental data.  When I first started the senior engineer would review the data and discard measurements that he deemed false.  Discarding data simply because you thought it was wrong struck me as very questionable and I spent a fair amount of effort developing a method (using T statistics) to identify "bad" data.

 

It turns out that there is a statistically rigorous calculation to do just that, developed (and published in 1852) by Benjamin Peirce.  That method is developed and demonstrated in the attached Prime 4 Express file.

 

Thoughts and suggestions?

3 replies

Derbigdog
15-Moonstone
August 10, 2023

W. Edwards Deming wrote a book on Statistical Process Control that addressed this subject as well. His ideas lead to the Japanese manufacturing industry going from one of the worst to one of the best in quality control.

13-Aquamarine
August 11, 2023

Dear Fred, Would it be possible to attach a pdf of your worksheet? Have a nice weekend.

23-Emerald I
August 11, 2023
13-Aquamarine
August 12, 2023

Dear Fred, Thank you highly appreciated!

25-Diamond I
August 11, 2023

Would your method also work in case of the data provided in this thread?
Solved: Remove specific regions from a graph - PTC Community

The built-in functions (Grubbs, GrubbsClassic, ThreeSigma) won't.

23-Emerald I
August 11, 2023

Man!  That's a lot of data!

 

First, note that Terry has successfully trimmed this.

 

Second, Peirce (takes the log of the inequality I'm using.  I haven't been able to do that successfully.  The large data set is(I think) keeping my solution from working (using the root function.)

 

Third, my first pass simply treated the entire set as measurements to be averaged and analyzed.  Clearly there's a sinusoidal function that might reduce scatter and standard deviation.

25-Diamond I
August 11, 2023

@Fred_Kohlhepp wrote:

Man!  That's a lot of data!

 

First, note that Terry has successfully trimmed this.

 

Second, Peirce (takes the log of the inequality I'm using.  I haven't been able to do that successfully.  The large data set is(I think) keeping my solution from working (using the root function.)

 

Third, my first pass simply treated the entire set as measurements to be averaged and analyzed.  Clearly there's a sinusoidal function that might reduce scatter and standard deviation.


Yes, sure a huge amount of data - slightly less than 175000 data points.
As far as I understood Terry guessed(!) a sine as being the upper limit and eliminated all data above it.

I thought about automating that process by using an outlier function (without success).

If we zoom in (see picture below) we can clearly(?) see which data should be considered an outlier/spike. I thought about some kind of windowing and applying an outlier function peu à peu instead of treating all the data in one go ....?
I had not played around with that idea any further as the OP in that thread seemed to be happy with Terrys solution anyway.

The thread just came to my mind when I read your posting.

Werner_E_0-1691771266421.png