Benford's Law

PhilipLeitch · ‎Jan 20, 2009

A while ago I posted on Outlier detection and removal. When I get time I come back to this topic from time to time.

One of the things I have turned up recently is "Benford's law". It is the law that explains how frequently the first number in a sequence is likely to appear. That is - in base 10 numbers, most measured numbers (non-categorical) will have first digits showing in specific ratios. 1 occurs roughly 30.1% of the time, while 9 occurs about 4.6% of the time.

This works for anything from street numbers to bank account transactions. Think about it - a street with numbers 10-19 don't need to have numbers 90-99 but a street with numbers 90-99 must have numbers 10-19. As for transactions, if a product is $100 and gets 5% interest, it will become $105, while if the product was $900 it would get a $45 increase. Given any interest rate, the percent of time the product remains in the 100's compared to the 900's is 30.1 to 4.6.

This from wikipedea:
"Benford's law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit is distributed in a specific, non-uniform way."

"More precisely, Benford's law states that the leading digit d (d ? {1, �, b - 1} ) in base b (b = 2) occurs with P(d)=logb(d + 1) - logbd = logb((d + 1)/d). This quantity is exactly the space between d and d + 1...."

So - I applied it to my company's data and found that it works very well for most data. BUT - this is to be expected, because as others (Jean??) pointed out that, most of my data did form a log distribution.

What I was surprised at was the following data:

First Number
Digit of Transactions
1 7364
2 3048
3 3113
4 5055
5 2680
6 4893
7 4559
8 7177
9 1959

Expressed as percentages:

Actual % Predicted % Difference
18.5% 30.1% -11.6%
7.6% 17.6% -10.0%
7.8% 12.5% -4.7%
12.7% 9.7% 3.0%
6.7% 7.9% -1.2%
12.3% 6.7% 5.6%
11.4% 5.8% 5.6%
18.0% 5.1% 12.9%
4.9% 4.6% 0.3%

Someone (Jean??) said that I wouldn't be able to find issues unless they were "the size of the moon". Well - this particular issue is that size. He was right, I still can't point to specific transactions and say "that record is wrong". So this is NOT picking up on "outlier" values, but rather a process that is, as a whole, producing inaccurate data.

The parts of my data set that do meet Benford's law are almost always values measured by instruments(tonnes, KG, etc.) while the above data set was a volume ESTIMATE. There had been an assumption that the estimates were relatively accurate, but this analysis shows that it is clearly very different. This analysis forced the upper management to change the business processes.

In a previous post I was accused of "making up" data, which was a reasonable statement given how the data looked, but was in fact false. However, falsifying of data is exactly what this tool detects. If people make numbers up, they are more likely to depart from Benford's law than to stick to it (see example above). If anyone ever thinks they are being given made up data, try applying it to this test. This is exactly what the tax auditors do to determine �made up� tax returns.

So this is another "tool in the pack".

By the way - I merely did a visual comparison, but a goodness-of-fit tests could be applied if required, such as K-S or Chi-Square.

Cheers,
Philip
___________________
Correct answers don't require correct spelling.

ptc-1368288 · ‎Jan 21, 2009

I know nothing about the "Benford's law' neither how old it is, but it is as old a Fermat. It has been used by the french "fisc" [taxation collector], secretly. It is how they detect false declaration, i.e: on the occurrence of the first digit.
There was an article by Steven Finch [former Mathsoft].

jmG

PhilipLeitch · ‎Jan 21, 2009

On 1/21/2009 12:54:32 AM, jmG wrote:
>I know nothing about the
>"Benford's law' neither how
>old it is, but it is as old a
>Fermat.
Recent - relatively speaking.
http://en.wikipedia.org/wiki/Benford%27s_law
According to Wikipedia it dates from 1938, although first observations date back to the late 19th centry. The first people to note it had realised that their log tables starting with "1" where much more used/dirty than the "9" log pages.

The wikipedia link is actually very good - better than the book I read this in.

>It has been used by
>the french "fisc" [taxation
>collector], secretly. It is
>how they detect false
>declaration, i.e: on the
>occurrence of the first digit.

Yes - and look at the link, under "Applications and limitations". There are sources for how this application was developed.

>There was an article by Steven
>Finch [former Mathsoft].

I would be interested in that.

Philip
___________________
Correct answers don't require correct spelling.

PhilipOakley · ‎Jan 21, 2009

The theory part assumes a uniform distribution numbers from zero to N [0:N].

I would expect that you data would have a different statistic.

If you apply your statistic you would get the local Benford law...

Given the size of the data set you may be able to partition the data according to customer type to get a better set of probabilities and consequent discrimination of errors..

Philip Oakley

PhilipLeitch · ‎Jan 21, 2009

Yes - that's right about different results by segmenting data. That's very right. Also - right about uniform distribution between 0 and N, given that N is variable, and therefore the observed relationship arises. The numbers are uniform, but the first digit is not. For that matter, none of the digit place are purely uniform, although the right most ones are close. Therefore this is an artifact of the "base" of the number system given uniform distributions.

So - specifically looking at a particular customer, let's say the average transaction is $150, with a deviation of only a couple of dollars plus or minus from one transaction to the next.

Then the results will be strongly distorted towards 1's, with maybe an elevation of "2" and "9". Whatever the case, the non-uniform nature will distort this "law".

But, across multiple customers - EVEN IF each one displays their own distribution - the values DO match this relationship. I'm assuming that the multiple overlaping distributions approximate a uniform overall distribution.

Anyway, I found this counter-intuitive and interesting. I applied it to my bank account, to customer weights (in KG and Tonnes), to customer transaction dollar amounts, to customer transaction unit amounts (i.e. per bin rate) and they ALL come through with this distribution.

It does not match at a customer level, or even particular segments (that conform to a specific distribution).

But - of the example I included last time, there was no distorting segment/customer - all data matched. The reason? The user estimated/guessed all the values meet a relatively distorted distribution. I looked at the dollar value too, and it was just as distorted.

I'll do up some examples in a Mathcad sheet and post here when I get a chance.

Philip
___________________
Correct answers don't require correct spelling.

PhilipOakley · ‎Jan 21, 2009

Looking back at the original numbers and the differences, it looks like the customesr prefer even numbers to odd numbers. This is most obvious for 6 & 8.
Clearly the percentages must add to 100 so any 'up' will be matched by a 'down', possibly explaining some of the negative difference values for 1-5.

Philip Oakley

ptc-1368288 · ‎Jan 21, 2009

>According to Wikipedia it dates from 1938< [Philip]

==>Probably got it from the "french connexion"

>...log tables starting with "1" where much more used/dirty than the "9" log pages.< [Philip]

==> True, as well as my log booklet and this is why it is as old as the 16 th century.
The first log tables are more ancient than Neper.
It applies to "Mathematical Constants" too. See two pages starting "1" in the attached.

>>There was an article by Steven Finch [former Mathsoft].<<

>I would be interested in that< [Philip]

==> Not sure if collected, the Mathsof special site went in the blue with PTC.

Would be interesting to isolate the starting first digit in the attached and plot for a statistical distribution .
I have no expertise in that kind of surgery !

Thanks Philip, for your collaboration.

jmG

PhilipLeitch · ‎Jan 21, 2009

Here is the sheet I said I would do up.

It uses real-world data.

This demonstrates the processes, and also validates Philip Oakley's statements. That is - the "law" holds true on the aggrigate information, but if a specific customer, or specific "distribution" is affecting the data, then the law deviates.

I also included an example of "finding" an issue in data.

As per the comments on "even" numbers, my suspicion is that they see the volume of a "skip", see the skip is roughly full and just enter that in. Skips are normally an even size in cubic metres. This isn't what they are meant to do - but it sure looks like what is occuring.

As for the french connection - I think a lot of these concepts are found a number of times over, but it would be interesting to know how far back it goes.

Philip
___________________
Correct answers don't require correct spelling.

PhilipLeitch · ‎Jan 21, 2009

Is this what you were after?

Philip
___________________
Correct answers don't require correct spelling.

ptc-1368288 · ‎Jan 22, 2009

On 1/21/2009 10:41:30 PM, pleitch wrote:
>Is this what you were after?
>
>Philip
_____________________________

Your shifting is incorrect as indicated by the red. The Quickplot is very forgiving ! That kind of correlation for such an horror fit ... no way ! Benford has surely forgotten many constants in some bins and the table does not include enough of the "world" of constants. It looks like counting the school when kids are sick by age category.

Fitting two points is like fitting a shoe to an elephant.

jmG

PhilipLeitch · ‎Jan 22, 2009

>Your shifting is incorrect as indicated
>by the red.

I saw that after I posted - missing the 9. I couldn't be bothered to re-post - but thanks for pointing it out.

>That kind of correlation for
>such an horror fit ... no way !
>Fitting two points is like fitting a
>shoe to an elephant.

Elephant's don't wear shoes.... to me fitting two points is like... connecting the dots.

I think I get what you mean though. Under this system almost any sets of 9 points are going to show a strong correlation to the law, even though the fit is obviously horrific.

In fact, correlation itself is inappropriate. Both data sets should be transformed to a "uniform" scale, otherwise the variation likely to occur for digit 1 will be expected to be higher than the variation of digit 9, which breaches the applicability of parametric statistics. But even a conversion to a uniform scale is likely to maintain a strong correlation because... there are ONLY 9 points.

So.... I'm going to ramp it up.

This law goes across multiple bases, so I'm going to convert the values to a much higher base, which will give more points and therefore better divergence from the predicted path. I'll still have to transform so that the data points are uniformly variable, but I'll give it a go anyway.

I'll keep you posted.

Philip
___________________
Correct answers don't require correct spelling.

PhilipLeitch · ‎Jan 22, 2009

Here is some data converted into base64.

It works really well, and highlights deviation from the expected relationship, even with values that are "close" to following the relationship. I have shown the standard relationship and also the values on the "benford scale".

In the second section you can now see that the correlation in the "guessed" values is very low. However, I'm still not happy with using a correlation. I'm not 100% sure what test should be used to identify how different the values are from 1 before we consider them significantly different - but ks test springs to mind.

How big could you take this? You would need more and more transactions for every base size you increased to. Base 10 requires only a couple of hundred transactions, base 256 might need thousands.

Already I was hitting blanks with 40,000 transactions at base 64 (although this data didn't batch the law). But larger bases show the relationships better, and therefore would be useful in identifying "where" anomalies exist.

But the opposite should be true too. In base 3 0.63% of values would start with a 1. It stands to reason that base 3 would work as a good general test for faked/non-uniform data on only a couple of transactions.

Similarly, you can measure a degree of surprise that the value would engender based on the where the value sits on a larger base range. For instance, in base 64, 50% of values occur in the first 7 ordinals, about 70% in the first 17 about 85% in the first 33, 90% in the first 41 and 99% in the first 60.

That is - the sum of the probabilities to that point in the scale.

Cranking that up a couple of notches - let's assume the limit of my credit card is $1000 (base 1000), that I only deal in whole dollars (for simplicity of argument) and the balance meets Benford's law at any point in time (i.e. never zero). What's the chance that I'll see a charge between $985 and $999 on by credit card? Just over 0.2%. What about 1-25? Just under 47.2%.

What's the chance I will see $185 or less on my credit card? About 75.65%

So if I saw a fee for $993, there is a very strong likelihood that it is erroneous.

Certainly food for thought, and a very interesting little topic.

Philip
___________________
Correct answers don't require correct spelling.

ptc-1368288 · ‎Jan 23, 2009

Philip,

There is a gross typo in the data set, maybe more than 1. That is enough to perturb the blue plot. "correlation" is just one out of many statistical lies, a pure invention. I don't think much about Benford law ... log, log of what. It would be a lot closer if related to sun flower Fibonnacci. My point is that log or exp are incremental, but life (creation ) and all the "worlds of life" are as they come, i.e: noisy and not incremental arriving therefore more like a Poison than log.

Anything on that is too biased and incomplete, still interesting.

jmG

PhilipLeitch · ‎Jan 23, 2009

>There is a gross typo in the
>data set, maybe more than 1.
Really - where? I didn't see it. Probably true - I make typos all the time.

>"correlation" is
>just one out of many
>statistical lies, a pure
>invention.

I don't see how a measure of association can lie. All it can do is measure. It's like looking at a tape measure and saying that it is lying. Actually... that's probably more often said when someone stands on bathroom scales and looks at the weight...

It is the inferences, assumptions and conclusions you draw from the association that may be "lies"

It just seems extreeme to me to consider that all statistics is lies when statistics only speaks to the probability of events occuring. That is, statistics makes no firm statements/decisions - only people do that.

>I don't think much
>about Benford law ... log, log
>of what. It would be a lot
>closer if related to sun
>flower Fibonnacci. My point is
>that log or exp are
>incremental, but life
>(creation ) and all the
>"worlds of life" are as they
>come, i.e: noisy and not
>incremental arriving therefore
>more like a Poison than log.

I think you might be missing the point. Benford isn't picking up on a distribution of numbers, only of the leading digit.

So the law makes no assumption about the underlying distributions, which may well be Poisson, it is making an assumption on the base system itself.

Philip
___________________
Correct answers don't require correct spelling.

ptc-1368288 · ‎Jan 23, 2009

On 1/23/2009 1:27:36 AM, pleitch wrote:
>>There is a gross typo in the
>>data set, maybe more than 1.
>Really - where? I didn't see
>it. Probably true - I make
>typos all the time.
>
...
>Philip
____________________________

@ index 20, you have 2 instead of 21
corrected in the attached *.gif.

jmG

ptc-1368288 · ‎Jan 23, 2009

Now, you an check the Benford fit and my fit !

What do you think, Philip ?

Jean

PhilipLeitch · ‎Jan 24, 2009

That does appear to be a much closer fit.

I have access to vast amounts of data from multiple sources.

I will attempt to compare the two distributions, but based on what you have shown so far, poisson looks superior. I will be happy to eat my words if I am wrong.

Philip
___________________
Correct answers don't require correct spelling.

RichardJ · ‎Jan 24, 2009

When Jean put his stuff in to the sheet he trashed the indexing for yours. So that needs to be put back. Then use the distribution for base 64, and it looks a lot better.

I don't buy Jean's distribution, because it drops to a constant value. With this data that happens to fit well, but in general that's not going to happen.

Richard

ptc-1368288 · ‎Jan 24, 2009

On 1/24/2009 10:50:11 AM, rijackson wrote:
>When Jean put his stuff in to
>the sheet he trashed the
>indexing for yours. So that
>needs to be put back. Then use
>the distribution for base 64,
>and it looks a lot better.

I
>don't buy Jean's distribution,
>because it drops to a constant
>value. With this data that
>happens to fit well, but in
>general that's not going to
>happen.

Richard
____________________________

I don't buy anything yet, Richard.

Constants are all mystic at least as much as Pi. So, the matter is to imagine a "mystic distribution", interesting but more constants need be collected. I get quick even better fit than the b = 64 you propose. An exponentiated function has puzzled me for at least 30 years. It is parent to the De Moivre, old but solved and applied problem. But there is more in the extended sense of the universe.

Interesting even if it goes nowhere.

The base today is -30� C.
I watched Al Gore last week end, superb.
I'm too poor, otherwise would hire this man.

Jean

RichardJ · ‎Jan 25, 2009

On 1/24/2009 1:03:06 PM, jmG wrote:

>Constants are all mystic at least as
>much as Pi. So, the matter is to imagine
>a "mystic distribution"

There's nothing "mystic" about Benford's law. A little surprising when you first see it perhaps, but not "mystic". Philip explained why it works the way it does. Philip was showing that, to within the error in the data (which is quite large when you have a limited data set and are working in base 64; something else Philip also pointed out), the data obeys Benford's law. Fitting an arbitrary function to the data proves nothing. Except perhaps that since there are an infinite number of arbitrary functions it's no great surprise you can find one that fits one particular data set better than Benford's law does.

Richard

ptc-1368288 · ‎Jan 26, 2009

On 1/24/2009 1:03:06 PM, jmG wrote:

>Constants are all mystic at least as
>much as Pi. So, the matter is to imagine
>a "mystic distribution"

There's nothing "mystic" about Benford's law.
...

Richard
______________________

You are right on all points except that the Benford law does not make sense to me, yet. I didn't read to the point of understanding the demonstration. If based on statistics only, it can't be a fixed law. Constants in general are too dependent. What is the base 64 doing in the decimal system. Just bin the entire collection in 10 bins or 100, 1000 ... Base 64 is just out of the blue to complicate matters. The project should by binning the collection I passed. A good source for more of the constants is surely Simon Plouffe, also Neil Sloane. Their repertoire is over 100 000 (if my understanding and recollection are correct, back to 2001).

I will trust Simon and/or Neil better than Benford.

jmG

PhilipLeitch · ‎Jan 26, 2009

Just to interject here...

You both have valid points.

Benford's law DOES have assumptions. It is somewhat more of a general "observation" than a "fits all law".

So it is likely that if a specific set of data, which form an overall distribution - it will not necesarily fit Bendford's law very well.

It is my understanding that constants, or more specifically the constants we are currently aware of, probably do have a skewed distrubution, which by their nature are likely to be at the lower end of the spectrum.

So in some instances, Jean makes a valid point - a set of data points like this is not "growing" in a compound way and isn't generated "from" a uniform (or pseudo-uniform) distrubution. Thus the first digit isn't uniform.

Even so, if the data were generated by Poisson distribution, I still wouldn't expect the first digits to form a poisson distribution.

Philip
___________________
Correct answers don't require correct spelling.

ptc-1368288 · ‎Jan 24, 2009

If you like those things, Philip

I have a cocktail of them ! Hard to beat that one.

jmG

LouP · ‎Jan 26, 2009

After working on the "Impossible problem?" in puzzles and gams, I got the book from which it came from my local library (good lib!). The book, "Impossible?" by Julian Havil, Princeton University Press, 2008, has, among other interesting things, a chapter on Benford's law and an informal derivation of it.

Lou

PhilipLeitch · ‎Jan 26, 2009

Thanks for the tip.

I'll note that book down.

Philip
___________________
Correct answers don't require correct spelling.

ptc-1368288 · ‎Jan 26, 2009

On 1/26/2009 10:42:23 AM, lpoulo wrote:
>After working on the
>"Impossible problem?" in
>puzzles and gams, I got the
>book from which it came from
>my local library (good lib!).
>The book, "Impossible?" by
>Julian Havil, Princeton
>University Press, 2008, has,
>among other interesting
>things, a chapter on Benford's
>law and an informal derivation
>of it.
>
>Lou
____________________________

Then, collabs can expect a work sheet sometimes ?

jmG

LouP · ‎Jan 27, 2009

No, but the book is worth reading if you can find it and enjoy math results that are not intuitive and likely new to you.

Lou

ptc-1368288 · ‎Jan 27, 2009

The best way to appreciate thew Benford law is to collect all canadian bank accounts and sort them. Oh ! you can sort others than Canadians. Mathematical constants are invalid because there are too few "independent" ones. I maintain my opinion that life in general Poisson, but true enough my log booklet is all brown in the first pages, but that proves only the law of using log. It proves nothing about numbers that govern our life because there are too many "lifes" and numbers attached to.

jmG

RichardJ · ‎Jan 27, 2009

On 1/27/2009 2:47:38 PM, jmG wrote:
>The best way to appreciate
>thew Benford law is to collect
>all canadian bank accounts and
>sort them.

I would rather collect all Canadian bank accounts and sum them into my own. No need to sort then, because summation is commutative 🙂

Richard

PhilipLeitch · ‎Jan 27, 2009

I think the whole "poisson" distribution discussion is interesting, but almost needs to be pushed off into another tread to do it justice.

This from the university of Wikipedia:
Poisson is the "probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and independently of the time since the last event."

Yep - I agree with Jean. That describes most events that you would find in nature.

Philip
___________________
Correct answers don't require correct spelling.