Skip to main content

probability or statistics - Fitting data to an empirical distribution, finding best fit



So I have a non-standard (non-normal, non-Gaussian, non-exponential, etc...) distribution, so I created a distribution empirically:


skdData=SmoothKernelDistribution[MyData,Automatic,{"Bounded",{20,104},"Gaussian"}];


Here's a plot of that PDF over the data: enter image description here


Now I want to find out how well another set of data points fits the distribution of the first. It does not match exactly, so I want to get a value for how well it fits or not. However, I can't even get data from the PDF to fit itself...


So consider this: I create the PDF from MyData, and then generate 1,000,000 points randomly from this PDF, and compare it to the PDF itself:


test1 = RandomVariate[skdData, 1000000];
KolmogorovSmirnovTest[test1, skdData]
DistributionFitTest[test1, skdData]

This returns anything from 0.1 to 0.9, which blows my mind, especially for the second result. So... Obviously (to me) I'm doing something wrong, because this is clearly a nearly perfect fit:



Show[Histogram[test1,Automatic,"PDF"],Plot[PDF[skdData, x],{x,20,110},PlotStyle->Thick]]

Which shows: enter image description here


Can anyone help me get the correct information from these fits? Chi-squared or P-Values are what I'm really after here... or whatever you think is a better estimator of fit.


My guess is that these functions are looking for normal distributions and are failing to use the one I give it, OR I have no idea what I'm doing...



Answer



I like your last 7 words.


I think what you're looking for is the two-sample Kolmogorov-Smirnov test but it appears that Mathematica currently only implements the one-sample Kolmogorov-Smirnov test. (But I could be very wrong about that.) The two-sample test works for any continuous distributions and tests whether the two sets of samples come from different distributions. The downside is that it is a "pure significance test" and doesn't help much characterizing what the differences might be. (And with any such test with large enough sample sizes, you're fairly certain to reject the null hypothesis of equality even if the differences are very small in a practical sense.)


Here's how to perform the two-sample Kolmogorov-Smirnov test with an approximate P-value:


(* Generate some data *)

(* Sample sizes *)
SeedRandom[12345];
n = 50;
m = 75;
data1 = RandomVariate[NormalDistribution[0, 1], n];
data2 = RandomVariate[NormalDistribution[0, 1], m];
all = Join[data1, data2];

(* Create empirical distributions *)
ed1 = EmpiricalDistribution[data1];

ed2 = EmpiricalDistribution[data2];
Plot[{CDF[ed1, x], CDF[ed2, x]}, {x, Min[all], Max[all]},
PlotLegends -> {"Sample 1", "Sample 2"}]

Cumulative distribution functions


(* Observed test statistic *)
d = Max[Abs[CDF[ed1, all] - CDF[ed2, all]]]
(* 0.19333333333333247` *)

(* Critical value at 5% level of significance *)

(* Larger values of the observed test statistic indicate statistical significance *)
ks = Sqrt[((n + m)/(n m)) (-Log[0.05/2]/2)]
(* 0.24795427851769825` *)

(* Approximate P-value *)
pValue = E^((-2 d^2 m n + m Log[2] + n Log[2])/(m + n))
(* 0.21234998643426609` *)

Comments

Popular posts from this blog

mathematical optimization - Minimizing using indices, error: Part::pkspec1: The expression cannot be used as a part specification

I want to use Minimize where the variables to minimize are indices pointing into an array. Here a MWE that hopefully shows what my problem is. vars = u@# & /@ Range[3]; cons = Flatten@ { Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; Minimize[{Total@((vec1[[#]] - vec2[[u[#]]])^2 & /@ Range[1, 3]), cons}, vars, Integers] The error I get: Part::pkspec1: The expression u[1] cannot be used as a part specification. >> Answer Ok, it seems that one can get around Mathematica trying to evaluate vec2[[u[1]]] too early by using the function Indexed[vec2,u[1]] . The working MWE would then look like the following: vars = u@# & /@ Range[3]; cons = Flatten@{ Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; NMinimize[ {Total@((vec1[[#]] - Indexed[vec2, u[#]])^2 & /@ R...

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...

What is and isn't a valid variable specification for Manipulate?

I have an expression whose terms have arguments (representing subscripts), like this: myExpr = A[0] + V[1,T] I would like to put it inside a Manipulate to see its value as I move around the parameters. (The goal is eventually to plot it wrt one of the variables inside.) However, Mathematica complains when I set V[1,T] as a manipulated variable: Manipulate[Evaluate[myExpr], {A[0], 0, 1}, {V[1, T], 0, 1}] (*Manipulate::vsform: Manipulate argument {V[1,T],0,1} does not have the correct form for a variable specification. >> *) As a workaround, if I get rid of the symbol T inside the argument, it works fine: Manipulate[ Evaluate[myExpr /. T -> 15], {A[0], 0, 1}, {V[1, 15], 0, 1}] Why this behavior? Can anyone point me to the documentation that says what counts as a valid variable? And is there a way to get Manpiulate to accept an expression with a symbolic argument as a variable? Investigations I've done so far: I tried using variableQ from this answer , but it says V[1...