Skip to main content

probability or statistics - Fitting data to an empirical distribution, finding best fit



So I have a non-standard (non-normal, non-Gaussian, non-exponential, etc...) distribution, so I created a distribution empirically:


skdData=SmoothKernelDistribution[MyData,Automatic,{"Bounded",{20,104},"Gaussian"}];


Here's a plot of that PDF over the data: enter image description here


Now I want to find out how well another set of data points fits the distribution of the first. It does not match exactly, so I want to get a value for how well it fits or not. However, I can't even get data from the PDF to fit itself...


So consider this: I create the PDF from MyData, and then generate 1,000,000 points randomly from this PDF, and compare it to the PDF itself:


test1 = RandomVariate[skdData, 1000000];
KolmogorovSmirnovTest[test1, skdData]
DistributionFitTest[test1, skdData]

This returns anything from 0.1 to 0.9, which blows my mind, especially for the second result. So... Obviously (to me) I'm doing something wrong, because this is clearly a nearly perfect fit:



Show[Histogram[test1,Automatic,"PDF"],Plot[PDF[skdData, x],{x,20,110},PlotStyle->Thick]]

Which shows: enter image description here


Can anyone help me get the correct information from these fits? Chi-squared or P-Values are what I'm really after here... or whatever you think is a better estimator of fit.


My guess is that these functions are looking for normal distributions and are failing to use the one I give it, OR I have no idea what I'm doing...



Answer



I like your last 7 words.


I think what you're looking for is the two-sample Kolmogorov-Smirnov test but it appears that Mathematica currently only implements the one-sample Kolmogorov-Smirnov test. (But I could be very wrong about that.) The two-sample test works for any continuous distributions and tests whether the two sets of samples come from different distributions. The downside is that it is a "pure significance test" and doesn't help much characterizing what the differences might be. (And with any such test with large enough sample sizes, you're fairly certain to reject the null hypothesis of equality even if the differences are very small in a practical sense.)


Here's how to perform the two-sample Kolmogorov-Smirnov test with an approximate P-value:


(* Generate some data *)

(* Sample sizes *)
SeedRandom[12345];
n = 50;
m = 75;
data1 = RandomVariate[NormalDistribution[0, 1], n];
data2 = RandomVariate[NormalDistribution[0, 1], m];
all = Join[data1, data2];

(* Create empirical distributions *)
ed1 = EmpiricalDistribution[data1];

ed2 = EmpiricalDistribution[data2];
Plot[{CDF[ed1, x], CDF[ed2, x]}, {x, Min[all], Max[all]},
PlotLegends -> {"Sample 1", "Sample 2"}]

Cumulative distribution functions


(* Observed test statistic *)
d = Max[Abs[CDF[ed1, all] - CDF[ed2, all]]]
(* 0.19333333333333247` *)

(* Critical value at 5% level of significance *)

(* Larger values of the observed test statistic indicate statistical significance *)
ks = Sqrt[((n + m)/(n m)) (-Log[0.05/2]/2)]
(* 0.24795427851769825` *)

(* Approximate P-value *)
pValue = E^((-2 d^2 m n + m Log[2] + n Log[2])/(m + n))
(* 0.21234998643426609` *)

Comments

Popular posts from this blog

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...

How to thread a list

I have data in format data = {{a1, a2}, {b1, b2}, {c1, c2}, {d1, d2}} Tableform: I want to thread it to : tdata = {{{a1, b1}, {a2, b2}}, {{a1, c1}, {a2, c2}}, {{a1, d1}, {a2, d2}}} Tableform: And I would like to do better then pseudofunction[n_] := Transpose[{data2[[1]], data2[[n]]}]; SetAttributes[pseudofunction, Listable]; Range[2, 4] // pseudofunction Here is my benchmark data, where data3 is normal sample of real data. data3 = Drop[ExcelWorkBook[[Column1 ;; Column4]], None, 1]; data2 = {a #, b #, c #, d #} & /@ Range[1, 10^5]; data = RandomReal[{0, 1}, {10^6, 4}]; Here is my benchmark code kptnw[list_] := Transpose[{Table[First@#, {Length@# - 1}], Rest@#}, {3, 1, 2}] &@list kptnw2[list_] := Transpose[{ConstantArray[First@#, Length@# - 1], Rest@#}, {3, 1, 2}] &@list OleksandrR[list_] := Flatten[Outer[List, List@First[list], Rest[list], 1], {{2}, {1, 4}}] paradox2[list_] := Partition[Riffle[list[[1]], #], 2] & /@ Drop[list, 1] RM[list_] := FoldList[Transpose[{First@li...

front end - keyboard shortcut to invoke Insert new matrix

I frequently need to type in some matrices, and the menu command Insert > Table/Matrix > New... allows matrices with lines drawn between columns and rows, which is very helpful. I would like to make a keyboard shortcut for it, but cannot find the relevant frontend token command (4209405) for it. Since the FullForm[] and InputForm[] of matrices with lines drawn between rows and columns is the same as those without lines, it's hard to do this via 3rd party system-wide text expanders (e.g. autohotkey or atext on mac). How does one assign a keyboard shortcut for the menu item Insert > Table/Matrix > New... , preferably using only mathematica? Thanks! Answer In the MenuSetup.tr (for linux located in the $InstallationDirectory/SystemFiles/FrontEnd/TextResources/X/ directory), I changed the line MenuItem["&New...", "CreateGridBoxDialog"] to read MenuItem["&New...", "CreateGridBoxDialog", MenuKey["m", Modifiers-...