Skip to main content

probability or statistics - Fitting data to an empirical distribution, finding best fit



So I have a non-standard (non-normal, non-Gaussian, non-exponential, etc...) distribution, so I created a distribution empirically:


skdData=SmoothKernelDistribution[MyData,Automatic,{"Bounded",{20,104},"Gaussian"}];


Here's a plot of that PDF over the data: enter image description here


Now I want to find out how well another set of data points fits the distribution of the first. It does not match exactly, so I want to get a value for how well it fits or not. However, I can't even get data from the PDF to fit itself...


So consider this: I create the PDF from MyData, and then generate 1,000,000 points randomly from this PDF, and compare it to the PDF itself:


test1 = RandomVariate[skdData, 1000000];
KolmogorovSmirnovTest[test1, skdData]
DistributionFitTest[test1, skdData]

This returns anything from 0.1 to 0.9, which blows my mind, especially for the second result. So... Obviously (to me) I'm doing something wrong, because this is clearly a nearly perfect fit:



Show[Histogram[test1,Automatic,"PDF"],Plot[PDF[skdData, x],{x,20,110},PlotStyle->Thick]]

Which shows: enter image description here


Can anyone help me get the correct information from these fits? Chi-squared or P-Values are what I'm really after here... or whatever you think is a better estimator of fit.


My guess is that these functions are looking for normal distributions and are failing to use the one I give it, OR I have no idea what I'm doing...



Answer



I like your last 7 words.


I think what you're looking for is the two-sample Kolmogorov-Smirnov test but it appears that Mathematica currently only implements the one-sample Kolmogorov-Smirnov test. (But I could be very wrong about that.) The two-sample test works for any continuous distributions and tests whether the two sets of samples come from different distributions. The downside is that it is a "pure significance test" and doesn't help much characterizing what the differences might be. (And with any such test with large enough sample sizes, you're fairly certain to reject the null hypothesis of equality even if the differences are very small in a practical sense.)


Here's how to perform the two-sample Kolmogorov-Smirnov test with an approximate P-value:


(* Generate some data *)

(* Sample sizes *)
SeedRandom[12345];
n = 50;
m = 75;
data1 = RandomVariate[NormalDistribution[0, 1], n];
data2 = RandomVariate[NormalDistribution[0, 1], m];
all = Join[data1, data2];

(* Create empirical distributions *)
ed1 = EmpiricalDistribution[data1];

ed2 = EmpiricalDistribution[data2];
Plot[{CDF[ed1, x], CDF[ed2, x]}, {x, Min[all], Max[all]},
PlotLegends -> {"Sample 1", "Sample 2"}]

Cumulative distribution functions


(* Observed test statistic *)
d = Max[Abs[CDF[ed1, all] - CDF[ed2, all]]]
(* 0.19333333333333247` *)

(* Critical value at 5% level of significance *)

(* Larger values of the observed test statistic indicate statistical significance *)
ks = Sqrt[((n + m)/(n m)) (-Log[0.05/2]/2)]
(* 0.24795427851769825` *)

(* Approximate P-value *)
pValue = E^((-2 d^2 m n + m Log[2] + n Log[2])/(m + n))
(* 0.21234998643426609` *)

Comments

Popular posts from this blog

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....