cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Community Tip - Learn all about the Community Ranking System, a fun gamification element of the PTC Community. X

Benford's Law

PhilipLeitch
1-Newbie

Benford's Law

A while ago I posted on Outlier detection and removal. When I get time I come back to this topic from time to time.

One of the things I have turned up recently is "Benford's law". It is the law that explains how frequently the first number in a sequence is likely to appear. That is - in base 10 numbers, most measured numbers (non-categorical) will have first digits showing in specific ratios. 1 occurs roughly 30.1% of the time, while 9 occurs about 4.6% of the time.

This works for anything from street numbers to bank account transactions. Think about it - a street with numbers 10-19 don't need to have numbers 90-99 but a street with numbers 90-99 must have numbers 10-19. As for transactions, if a product is $100 and gets 5% interest, it will become $105, while if the product was $900 it would get a $45 increase. Given any interest rate, the percent of time the product remains in the 100's compared to the 900's is 30.1 to 4.6.

This from wikipedea:
"Benford's law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit is distributed in a specific, non-uniform way."

"More precisely, Benford's law states that the leading digit d (d ? {1, �, b - 1} ) in base b (b = 2) occurs with P(d)=logb(d + 1) - logbd = logb((d + 1)/d). This quantity is exactly the space between d and d + 1...."

So - I applied it to my company's data and found that it works very well for most data. BUT - this is to be expected, because as others (Jean??) pointed out that, most of my data did form a log distribution.

What I was surprised at was the following data:

First Number
Digit of Transactions
1 7364
2 3048
3 3113
4 5055
5 2680
6 4893
7 4559
8 7177
9 1959

Expressed as percentages:

Actual % Predicted % Difference
18.5% 30.1% -11.6%
7.6% 17.6% -10.0%
7.8% 12.5% -4.7%
12.7% 9.7% 3.0%
6.7% 7.9% -1.2%
12.3% 6.7% 5.6%
11.4% 5.8% 5.6%
18.0% 5.1% 12.9%
4.9% 4.6% 0.3%

Someone (Jean??) said that I wouldn't be able to find issues unless they were "the size of the moon". Well - this particular issue is that size. He was right, I still can't point to specific transactions and say "that record is wrong". So this is NOT picking up on "outlier" values, but rather a process that is, as a whole, producing inaccurate data.

The parts of my data set that do meet Benford's law are almost always values measured by instruments(tonnes, KG, etc.) while the above data set was a volume ESTIMATE. There had been an assumption that the estimates were relatively accurate, but this analysis shows that it is clearly very different. This analysis forced the upper management to change the business processes.

In a previous post I was accused of "making up" data, which was a reasonable statement given how the data looked, but was in fact false. However, falsifying of data is exactly what this tool detects. If people make numbers up, they are more likely to depart from Benford's law than to stick to it (see example above). If anyone ever thinks they are being given made up data, try applying it to this test. This is exactly what the tax auditors do to determine �made up� tax returns.

So this is another "tool in the pack".

By the way - I merely did a visual comparison, but a goodness-of-fit tests could be applied if required, such as K-S or Chi-Square.

Cheers,
Philip
___________________
Correct answers don't require correct spelling.
33 REPLIES 33

On 1/27/2009 6:39:36 PM, pleitch wrote:

>This from the university of
>Wikipedia:
>Poisson is the "probability of
>a number of events occurring
>in a fixed period of time if
>these events occur with a
>known average rate and
>independently of the time
>since the last event."
>
>Yep - I agree with Jean. That
>describes most events that you
>would find in nature.

It describes the rate of events occurring, but not the magnitude of events. So earthquake frequency would be Poisson, but what about earthquake magnitude?

I agree though - another thread is in order.

Richard

Philip,

I was pretty right advising Simon Plouffe and Neil Sloane.

Read more:

http://mathworld.wolfram.com/BenfordsLaw.html

One striking example of Benford's law is given by the 54 million real constants in Plouffe's "Inverse Symbolic Calculator" database, 30% of which begin with the digit 1.

jmG

jmG

More about "Benford's Law" here: "Explaining Benford’s Law".

Top Tags