probability or statistics - Fitting data to an empirical distribution, finding best fit

So I have a non-standard (non-normal, non-Gaussian, non-exponential, etc...) distribution, so I created a distribution empirically:

skdData=SmoothKernelDistribution[MyData,Automatic,{"Bounded",{20,104},"Gaussian"}];

Here's a plot of that PDF over the data:

Now I want to find out how well another set of data points fits the distribution of the first. It does not match exactly, so I want to get a value for how well it fits or not. However, I can't even get data from the PDF to fit itself...

So consider this: I create the PDF from MyData, and then generate 1,000,000 points randomly from this PDF, and compare it to the PDF itself:

test1 = RandomVariate[skdData, 1000000];
KolmogorovSmirnovTest[test1, skdData]
DistributionFitTest[test1, skdData]

This returns anything from 0.1 to 0.9, which blows my mind, especially for the second result. So... Obviously (to me) I'm doing something wrong, because this is clearly a nearly perfect fit:

Show[Histogram[test1,Automatic,"PDF"],Plot[PDF[skdData, x],{x,20,110},PlotStyle->Thick]]

Which shows:

Can anyone help me get the correct information from these fits? Chi-squared or P-Values are what I'm really after here... or whatever you think is a better estimator of fit.

My guess is that these functions are looking for normal distributions and are failing to use the one I give it, OR I have no idea what I'm doing...

Answer

I like your last 7 words.

I think what you're looking for is the two-sample Kolmogorov-Smirnov test but it appears that Mathematica currently only implements the one-sample Kolmogorov-Smirnov test. (But I could be very wrong about that.) The two-sample test works for any continuous distributions and tests whether the two sets of samples come from different distributions. The downside is that it is a "pure significance test" and doesn't help much characterizing what the differences might be. (And with any such test with large enough sample sizes, you're fairly certain to reject the null hypothesis of equality even if the differences are very small in a practical sense.)

Here's how to perform the two-sample Kolmogorov-Smirnov test with an approximate P-value:

(* Generate some data *)

(* Sample sizes *)
SeedRandom[12345];
n = 50;
m = 75;
data1 = RandomVariate[NormalDistribution[0, 1], n];
data2 = RandomVariate[NormalDistribution[0, 1], m];
all = Join[data1, data2];

(* Create empirical distributions *)
ed1 = EmpiricalDistribution[data1];

ed2 = EmpiricalDistribution[data2];
Plot[{CDF[ed1, x], CDF[ed2, x]}, {x, Min[all], Max[all]},
 PlotLegends -> {"Sample 1", "Sample 2"}]

(* Observed test statistic *)
d = Max[Abs[CDF[ed1, all] - CDF[ed2, all]]]
(* 0.19333333333333247` *)

(* Critical value at 5% level of significance *)

(* Larger values of the observed test statistic indicate statistical significance *)
ks = Sqrt[((n + m)/(n m)) (-Log[0.05/2]/2)]
(* 0.24795427851769825` *)

(* Approximate P-value *)
pValue = E^((-2 d^2 m n + m Log[2] + n Log[2])/(m + n))
(* 0.21234998643426609` *)

Blog

Search This Blog

probability or statistics - Fitting data to an empirical distribution, finding best fit

Comments

Post a Comment

Popular posts from this blog

front end - keyboard shortcut to invoke Insert new matrix

How to thread a list

functions - Get leading series expansion term?