Skip to main content

performance tuning - Better way to get Fisher Exact?


I want to perform a Pearson's $\chi^2$ test to analyse contingency tables; but because I have small numbers, it is recommended to perform instead what is called a Fisher's Exact Test.


This requires generating all integer matrices with the same column and row totals as the one given, and compute and sum all p-values from the corresponding distribution which are lower than the one from the data.


See Wikipedia and MathWorld for relevant context.


Apparently R offers that, but couldn't find it in Mathematica, and after extensive research couldn't find an implementation around, so I did my own.


The examples in the links are with 2x2 matrices, but I did a n x m implementation and, at least for the MathWorld example, numbers match.


I have one question: The code I wrote uses Reduce; although it seemed to me generating all matrices was more a combinatorial problem. I pondered using FrobeniusSolve, but still seemed far from what's needed. Am I missing something or is Reduce the way to go?



The essential part of the code, which I made available in github here, is that for a matrix like


$$ \left( \begin{array}{ccc} 1 & 0 & 2 \\ 0 & 1 & 2 \\ \end{array} \right)$$


with row sums 3, 3 and column sums 1, 1, 4, it creates a system of linear equations like:


$$ \begin{array}{c} x_{1,1}+x_{1,2}+x_{1,3}=3 \\ x_{2,1}+x_{2,2}+x_{2,3}=3 \\ \end{array} $$ $$ \begin{array}{c} x_{1,1}+x_{2,1}=1 \\ x_{1,2}+x_{2,2}=1 \\ x_{1,3}+x_{2,3}=4 \\ \end{array} $$


subject to the constrains $ x_{1,1}\geq 0$, $x_{1,2}\geq 0$, $x_{1,3}\geq 0$, $x_{2,1}\geq 0$, $x_{2,2}\geq 0$, $ x_{2,3}\geq 0 $ and feeds this into Reduce to solve this system over the Integers. Reduce returns all the solutions, which is what we need to compute Fisher's exact p-value.


Note: I just found this advice on how to use github better for Mathematica projects. For the time being, I leave it as-is. Hope easy to use and test.


You can test the above mentioned code like


FisherExact[{{1, 0, 2}, {0, 0, 2}, {2, 1, 0}, {0, 2, 1}}]

It has some debugging via Print which shows all the generated matrices and their p-value. The last part (use of Select) to process all found matrices didn't seem very Mathematica to me, but it was late and I was tired - feedback is welcome.



I would give my tick to the answer with more votes after a couple of days if anyone bothers to write me two lines :)


Thanks in advance!



Answer



Maybe you are willing to consider a Bayesian approach to this perennial problem. Beware though: Bayesians have no random variables, no p-values, no null hypotheses, etc. They have probabilities, or ratios thereof.


The (out of print) book "Rational Descriptions, Decisions and Designs" by Miron Tribus (1969!) has an excellent chapter on contingency tables. From this book I have copied the solutions below. His solutions are exact and work for small counts as well as non-square tables. He considers two mutially exclusive hypotheses: "the rows and columns are independent" vs "the rows and columns are dependent", under a variety of different types of knowledge.


Here I give only two cases:


-- Knowledge type 1A, with no specific prior knowledge on the (in-)dependence and no controls,


-- Knowledge type 1B, also with no specific prior knowledge but with a control on the counts of the rows (see examples below).


Tribus computes the "evidence in favor of the hypothesis of independence of rows and columns" for these types. (The references in the code are to chapters and pages in his book.)


The evidence for type 1A is:



(* Evidence rows-cols independent: 1A VI-38 p. 194 *)
evidence1A[table_] :=
Module[{r, s, nidot, ndotj, ntot, ev, prob},

(* Table dimensions r=nr of rows, s=nr of cols *)
{r, s} = Dimensions[table];

(* Margin and Total counts *)
nidot = Total[table, {2}] ;(* sum in r-direction *)
ndotj = Total[table, {1}] ;(* sum in s-direction *)

ntot = Total[table, 2]; (* overall total *)

(* evidence of VI-38 p.194 *)

ev = Log[ ((ntot + r*s - 1)! * ntot!)/
((ntot + r - 1)!*(ntot + s - 1)!)] -
Log[ (r*s - 1)!/((r - 1)!*(s - 1)!) ] +
(Total[ Log[ nidot!]] - Log[ntot!]) +
(Total[ Log[ ndotj!]] - Log[ntot!]) -
(Total[Log[table!], 2] - Log[ntot!]);


(* probability from evidence: III-13 p.84 *)
prob = (1 + Exp[-ev])^-1 ;

{ev // N, prob // N} (* output *)
] (* End of Module *)

Tribus tests this using an interesting example of eye-color vs hair-color correlation of soldiers in conscription military service (!). Note that this is a 3x4 table.


(* Soldier table VI-1 p.183: eye color vs. hair color *)
soldier = {

(* blonde,brown,black,red *)
(* blue *) {1768, 807, 189, 47},
(* green *) {946, 1387, 786, 53},
(* brown *) {115, 438, 288, 16}};

(* Tribus p.197 gives 560 Napiers *)
(* prob that the table is row-col independent *)
evidence1A[soldier]
(* output: {-560.661, 3.22157*10^-244} *)


The probability of independence of rows and columns is 3.22*10^-244, and thus virtually zero. As expected.




The case 1B applies to tests with a pre-set count for the columns. In Tribus' tobacco test flavor example: 250 packages with mixed cigarettes + pipe tobacco vs. 150 packages with only cigarettes.


(* Tobacco problem p.198 : solution is independent of s *)

tobacco = {
(* cigaret+pipe tobacco: mixed, not mixed *)
(* no change *) {72, 119},
(* change aroma *) {178, 31}
(* fixed counts : {250,150} *)

};

The evidence for this problem is:


(* Evidence rows-cols independent: 1B VI-54 p. 200 *)
(* solution is independent of s *)
evidence1B[table_] :=
Module[ {r, s, nidot, ndotj, ntot, ev, prob},

(* Table dimensions r=nr of rows, s=nr of cols *)
{r, s} = Dimensions[table];


(* Margin and Total counts *)
nidot = Total[table, {2}] ;(* sum in r-direction *)
ndotj = Total[table, {1}] ;(* sum in s-direction *)
ntot = Total[table, 2]; (* overall total *)

(* evidence Eq.VI-54 p.200 *)

ev = Log[(r - 1)!/(ntot + r - 1)!] +
Total[Log[(ndotj + r - 1)!/(r - 1)!]] +

(Total[Log[nidot!]] - Log[ntot!]) -
(Total[Log[table!], 2] - Log[ntot!]) ;

(* probability from evidence: III-13 p.84 *)

prob = (1 + Exp[-ev])^-1 ;

{ev // N, prob // N} (* output *)
] (* End of Module *)


Tribus' solution:


(* Tribus p.200 : 1.45 10^-21 *)
evidence1B[tobacco]

(* output: {-47.9818, 1.45138*10^-21} *)

Also here the probability for rows and columns to be independent is pretty small 1.45*10^-21.




Your example of a 3x3 table:


caya = {{1, 0, 2}, {0, 0, 2}, {2, 1, 0}, {0, 2, 1}};


evidence1A[caya]
(* output: {-2.62276, 0.0676881} *)

evidence1B[caya]
(* output: {-1.7158, 0.152413} *)

The probabilities for independence of rows and columns are small-ish. But they are not very small. Depending on the details of your problem, such probability values can signal: inconclusive.


Comments

Popular posts from this blog

plotting - Filling between two spheres in SphericalPlot3D

Manipulate[ SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, Mesh -> None, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], {n, 0, 1}] I cant' seem to be able to make a filling between two spheres. I've already tried the obvious Filling -> {1 -> {2}} but Mathematica doesn't seem to like that option. Is there any easy way around this or ... Answer There is no built-in filling in SphericalPlot3D . One option is to use ParametricPlot3D to draw the surfaces between the two shells: Manipulate[ Show[SphericalPlot3D[{1, 2 - n}, {θ, 0, Pi}, {ϕ, 0, 1.5 Pi}, PlotPoints -> 15, PlotRange -> {-2.2, 2.2}], ParametricPlot3D[{ r {Sin[t] Cos[1.5 Pi], Sin[t] Sin[1.5 Pi], Cos[t]}, r {Sin[t] Cos[0 Pi], Sin[t] Sin[0 Pi], Cos[t]}}, {r, 1, 2 - n}, {t, 0, Pi}, PlotStyle -> Yellow, Mesh -> {2, 15}]], {n, 0, 1}]

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

plotting - Adding a thick curve to a regionplot

Suppose we have the following simple RegionPlot: f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}] Now I'm trying to change the curve defined by $y=g[x]$ into a thick black curve, while leaving all other boundaries in the plot unchanged. I've tried adding the region $y=g[x]$ and playing with the plotstyle, which didn't work, and I've tried BoundaryStyle, which changed all the boundaries in the plot. Now I'm kinda out of ideas... Any help would be appreciated! Answer With f[x_] := 1 - x^2 g[x_] := 1 - 0.5 x^2 You can use Epilog to add the thick line: RegionPlot[{y < f[x], f[x] < y < g[x], y > g[x]}, {x, 0, 2}, {y, 0, 2}, PlotPoints -> 50, Epilog -> (Plot[g[x], {x, 0, 2}, PlotStyle -> {Black, Thick}][[1]]), PlotStyle -> {Directive[Yellow, Opacity[0.4]], Directive[Pink, Opacity[0.4]],