Skip to main content

fitting - Mathematica Implementations of the Random Forest algorithm


Is anyone aware of Mathematica use/implementation of Random Forest algorithm?



Answer



Here I will attempt to provide a basic implementation of the random forest algorithm for classification. This is by no means fast and doesn't scale very well but otherwise is a nice classifier. I recommend reading Breiman and Cutler's page for information about random forests.



The following are some helper functions that allow us to compute entropy and ultimately information gain easily.


condEnt = Statistics`Library`NConditionalEntropy;
ent = Statistics`Library`NEntropy;
maxI[v_] := Ordering[v, -1][[1]]

Now for the decision trees that will be used in the forest. This is the biggest bottleneck in the whole process and I would love to see a much faster implementation in Mathematica! The trees I'm working with are called CART trees.


The basic idea is to select at random m of the possible k variables to separate the response y into classes. Once the best variable is chosen the value of that variable which bests separates the classes is chosen to create a split in the tree. This process continues until all responses have been classified or we aren't able to split things based on the remaining data.


cart[x_, y_ /; ent[y] == 0, m_] := First[y]

cart[x_, y_, m_] :=

Block[{k = Length[x], h, drop, bestVar, ub, best, mask},
h = ent[y];
drop = RandomSample[Range[k], k - m];
bestVar =
maxI[Table[If[FreeQ[drop, i], h - condEnt[x[[i]], y], 0], {i, k}]];
ub = Union[x[[bestVar]]];
mask = UnitStep[x[[bestVar]] - #] & /@ ub;
best = maxI[(h - condEnt[#, y] & /@ mask)];
If[Min[#] == Max[#],
RandomChoice[y],

{bestVar, ub[[best]],
cart[Pick[x\[Transpose], #, 1]\[Transpose], Pick[y, #, 1], m],
cart[Pick[x\[Transpose], #, 0]\[Transpose], Pick[y, #, 0], m]
}
] &[mask[[best]]]
]

To demo these things as I go lets use the famous iris data built in to Mathematica selecting about 80% for training and 20% for testing.


data = ExampleData[{"Statistics", "FisherIris"}];


rs = RandomSample[Range[Length[data]]];
train = rs[[1 ;; 120]];
test = rs[[121 ;;]];

Now lets create a CART tree from this data letting m be 3. We can read the result easily. The first element is the variable that bests splits the data, in this case variable 3. The next value 3.3 is the critical value of variable 3 to use. If a value is below that it branches to the right and if it is above it branches to the left. The leaves are the next two elements.


tree = cart[Transpose@data[[train, 1 ;; -2]], data[[train, -1]], 3]

(* {3, 3.3, {3, 4.9, {4, 1.8, "virginica", {3, 5.1, {4, 1.6, "versicolor","virginica"},
"versicolor"}}, {4, 1.7, {4, 1.8, "versicolor", "virginica"},"versicolor"}}, "setosa"} *)


So far so good. Now we need a classifier that can take a new input and a tree to make a classification.


classify[x_, Except[_List, d_]] := d
classify[x_, {best_, value_, d1_, d2_}] := classify[x, If[x[[best]] < value, d2, d1]]

Lets try it out with the first element from our training data. It correctly classifies the iris as species 2 (virginica).


classify[{6.4, 2.8, 5.6, 2.1}, tree]

(* "virginica *)

A random forest is nothing but an ensemble of such trees, created from a bootstrap sample of the data, that allows each one to vote. The function rf takes some input data x, a response vector y, the number of variables to select for splitting m and the number of trees to grow ntree. The function rfClassify takes a new input and a trained forest and makes a classification.



rf[x_, y_, m_, ntree_] := 
Table[boot = RandomChoice[Range[Length[y]], Length[y]];
cart[x[[All, boot]], y[[boot]], m], {ntree}]

rfClassify[input_, forest_] :=
First[#][[maxI[Last[#]]]] &[
Transpose[Tally[classify[input, #] & /@ forest]]]

Lets try it on the iris data. First we fit a forest with our training set. And test it to make sure it works well on its own training data. In this case we get perfect classification of our training data so it looks good.


f = rf[data[[train, 1 ;; -2]]\[Transpose], data[[train, -1]], 3, 500];


N@
Mean[Boole[
First[#] === Last[#] & /@
Transpose[{rfClassify[#, f] & /@ data[[train, 1 ;; -2]],
data[[train, -1]]}]]]

(* 1. *)

Now lets try it out on the test data it has never seen before.



 N@
Mean[Boole[
First[#] === Last[#] & /@
Transpose[{rfClassify[#, f] & /@ data[[test, 1 ;; -2]],
data[[test, -1]]}]]]

(* 0.966667 *)

I'd say 97% correct classification isn't bad for a relatively small data set and forest.


EDIT:



It is worth showing how one might use RLink and the randomForest package in Mathematica with the data I have here. The following code will fit a random forest to the iris data and return the prediction for a particular input newX.


(*Enable RLink*)
<< RLink`
InstallR[]

(*Set the training data and response*)
RSet["x", data[[train, 1 ;; -2]]];
RSet["y", data[[train, -1]]];
RSet["newX", data[[test, 1 ;; -2]][[1]]];


(*install the package. Note: you only need to do this once*)
REvaluate["install.packages(\"randomForest\")"];

(*fit a random forest and make one classification*)
REvaluate["{
library(randomForest)
f = randomForest(x,as.factor(y))
predict(f,newX)
}"
]


(* RFactor[{1}, RFactorLevels["setosa", "versicolor", "virginica"],
RAttributes["names" :> {"1"}]] *)

Comments

Popular posts from this blog

plotting - Mathematica: 3D plot based on combined 2D graphs

I have several sigmoidal fits to 3 different datasets, with mean fit predictions plus the 95% confidence limits (not symmetrical around the mean) and the actual data. I would now like to show these different 2D plots projected in 3D as in but then using proper perspective. In the link here they give some solutions to combine the plots using isometric perspective, but I would like to use proper 3 point perspective. Any thoughts? Also any way to show the mean points per time point for each series plus or minus the standard error on the mean would be cool too, either using points+vertical bars, or using spheres plus tubes. Below are some test data and the fit function I am using. Note that I am working on a logit(proportion) scale and that the final vertical scale is Log10(percentage). (* some test data *) data = Table[Null, {i, 4}]; data[[1]] = {{1, -5.8}, {2, -5.4}, {3, -0.8}, {4, -0.2}, {5, 4.6}, {1, -6.4}, {2, -5.6}, {3, -0.7}, {4, 0.04}, {5, 1.0}, {1, -6.8}, {2, -4.7}, {3, -1....

plotting - Plot 4D data with color as 4th dimension

I have a list of 4D data (x position, y position, amplitude, wavelength). I want to plot x, y, and amplitude on a 3D plot and have the color of the points correspond to the wavelength. I have seen many examples using functions to define color but my wavelength cannot be expressed by an analytic function. Is there a simple way to do this? Answer Here a another possible way to visualize 4D data: data = Flatten[Table[{x, y, x^2 + y^2, Sin[x - y]}, {x, -Pi, Pi,Pi/10}, {y,-Pi,Pi, Pi/10}], 1]; You can use the function Point along with VertexColors . Now the points are places using the first three elements and the color is determined by the fourth. In this case I used Hue, but you can use whatever you prefer. Graphics3D[ Point[data[[All, 1 ;; 3]], VertexColors -> Hue /@ data[[All, 4]]], Axes -> True, BoxRatios -> {1, 1, 1/GoldenRatio}]

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...