cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Fitting Statistical Distributions

Highlighted
Newbie

Fitting Statistical Distributions

Since the subject of fitting a a statistical distribution to a data sample has come up a couple of times recently, I thought I'd post a worksheet that does such fits.

It currently has four distributions available (normal, log normal, gamma, and Weibull), but additional distributions can easily be added by defining them. It has three different fitting strategies -- maximum likelihood, fitting the PDF to a histogram, and fitting the CDF.

� � � � Tom Gutman
78 REPLIES 78
Highlighted

Fitting Statistical Distributions

For the normal and log-normal distributions the unbiased estimates of the mean and variance parameters, mu and sigma squared, are the sample mean and variance with a prior log transformation for the latter distribution.

For the maximum likelihood estimates, replace 'n-1' in the variance calculation with 'n', where 'n' is the sample size. There. No muss. No fuss.
Highlighted

Fitting Statistical Distributions

Hi,

how about this approach.

Luc
Highlighted

Fitting Statistical Distributions

Your approach is a check not only on how well the distribution fits, but also (and perhaps moreso) on how good your estimates of the parameters, based apparently on matching one or two of the descriptive statistics, are.

Also, in some cases you use the Mathcad provided one parameter distribution rather than the more usual (and appropriate) two parameter distribution. In several cases Mathcad omits the scale parameter (compare the descriptions of the gamma and Weibull distributions in Mathcad and on Mathworld), presumably on the basis that it's easy enough to add those back in externally. Not the choice I would have made, but of little practical import.

Your criterion of the correlation between a linear function and the observed CDF is supportable. There is a minor problem that the effective dynamic range is limited, almost any distribution results in a value near one so one has to make distinctions on small differences. It is also unclear how this criterion relates to the three criteria I have used, or to other commonly used criteria.

Looking at the Weibull distribution, we can see some of the effects. Using the parameters that best fit the CDF (as approximated by a histogram) I get a correlation of .9994, vs. your value of .981. If I tweak the parameters a bit more, maximizing this correlation, I get a correlation of .9996. The numbers are not exactly comparable, as you have kept the zero values, which are impossible for several of the distributions.

Also, see Paul about your parameter estimates for the log-normal distribution.

� � � � Tom Gutman
Highlighted

Fitting Statistical Distributions

Hi Tom --

I just saw your excellent worksheet for fitting probability distributions. I am able to open it in v2001 (on my desktop) but couldn't get it work. So, I tried v12 (someone else's desktop) and got the error shown in the attached GIF. It's not an error I've ever seen before.

I know you're working with v11, which I don't have access to. Any ideas about the error?

Thanks.

Matt
Graduate Student in Civil Engineering
Georgia Institute of Technology
Highlighted

Fitting Statistical Distributions

Unfortunately, you've hit one of the restrictions imposed by M12 static type checking.

M12 ensures that all uses of user-defined functions have a consistent number of parameters. If they don't it flags it up as an error.

For example, if you define f(a,b):= ... , then M12 will not allow you to use f(a) or f(a,b,c). All this is pretty standard pre-M12 stuff. Where M12 shoots the user in the foot, is that it also notes occurences of calls to indirect functions, as in the example you gave.

So whereas M11, etc, would allow

f(a) if cond1
f(a,b) otherwise

M12 notes that f is either a one parameter function or a two parameter function - it can't be both, so you've introduced an error .... 😕

Hopefully, this "Feature" will be fixed.

Stuart
Highlighted

Fitting Statistical Distributions

MC12 is severely brain damaged. Besides being very restrictive (by design) the error messages are extremely poor and often buggy.

The real problem here is that this function is illegal under MC12 (did I mention that I don't use MC12 much as hardly anything works?), as the prameter f is a function that might need to take 2, 3, 4, or 5 parameters. That is invalid for MC12, which insists that a function argument must have a fixed and known (to Mathcad's analyzer) number of arguments. How it gets from that issue to the particular error message it produces completely escapes me.

Looking over the sheet I don't see anything that I know to be inherently incompatible with MC2001. Certainly the NaN's in some of the utility functions need to be replaced. Other things might need tweaking, I don't really know what the limitations of 2001 are. But there are several things that I know are inherently incompatible with MC12. The use of a function argument that may take functions with different number of parameters is one. The construction of the distribution specification vectors with both numeric and function componentes is another. So I think there is a much better chance of getting this sheet to run in MC2001 than in MC12.

Tom Gutman
Highlighted

Fitting Statistical Distributions

Thanks, fellas. I'm going to play around with this a bit. I don't know if anyone besides me is still stuck in the Stone Ages, but if I get a working sheet in 2001, I'll post it (with credit to Tom, of course).

A side note -- I had finally decided to upgrade my Mathcad version and was going to buy v11. Lo and behold, all I could get (at the student price) was v12. Given that it will render all of my extension packs obsolete and that not many folks have nice things to say about it, I decided to hold off. I think I could live with it if I could make it live happily with v2001 on the same box, but I understand that is only possible for v11 and v12.

Any word on v13?

Matt
Graduate Student in Civil Engineering
Georgia Institute of Technology
Highlighted

Fitting Statistical Distributions

OK, so I am a bit dense....another question:

Tom -- it looks like the function LLikelihood(p,D,X) is providing a measure of the goodness-of-fit for each distribution and fitting method. Is this correct? Why do you you take the natural log of each data point as you generate the sum?

Thanks.

Matt
Graduate Student in Civil Engineering
Georgia Institute of Technology
Highlighted

Fitting Statistical Distributions

LLikelihood is the log of the likelihood. Maximizing the likelihood (practically, due to range considerations, the log of the likelihood) is one form of parameter estimation.

The likelihood of an event is simply the probability of that event. The likelihood of a compund event is just the product of the probabilities of the component events. For continuous distributions, the probability density function is used instead of an actual probability (which is zero for any point value).

But actual likelihoods tend to be extremely small numbers. Hence one usually uses the log of the likelihood. Since the log function is monotonic, the location of the maximum is the same. So instead of the product of all the probabilities (or probability densities) one takes the sum of the logs of the probabilities (or probability densities).

� � � � Tom Gutman
Announcements