Skip to main content

fitting - Mathematica Implementations of the Random Forest algorithm


Is anyone aware of Mathematica use/implementation of Random Forest algorithm?



Answer



Here I will attempt to provide a basic implementation of the random forest algorithm for classification. This is by no means fast and doesn't scale very well but otherwise is a nice classifier. I recommend reading Breiman and Cutler's page for information about random forests.



The following are some helper functions that allow us to compute entropy and ultimately information gain easily.


condEnt = Statistics`Library`NConditionalEntropy;
ent = Statistics`Library`NEntropy;
maxI[v_] := Ordering[v, -1][[1]]

Now for the decision trees that will be used in the forest. This is the biggest bottleneck in the whole process and I would love to see a much faster implementation in Mathematica! The trees I'm working with are called CART trees.


The basic idea is to select at random m of the possible k variables to separate the response y into classes. Once the best variable is chosen the value of that variable which bests separates the classes is chosen to create a split in the tree. This process continues until all responses have been classified or we aren't able to split things based on the remaining data.


cart[x_, y_ /; ent[y] == 0, m_] := First[y]

cart[x_, y_, m_] :=

Block[{k = Length[x], h, drop, bestVar, ub, best, mask},
h = ent[y];
drop = RandomSample[Range[k], k - m];
bestVar =
maxI[Table[If[FreeQ[drop, i], h - condEnt[x[[i]], y], 0], {i, k}]];
ub = Union[x[[bestVar]]];
mask = UnitStep[x[[bestVar]] - #] & /@ ub;
best = maxI[(h - condEnt[#, y] & /@ mask)];
If[Min[#] == Max[#],
RandomChoice[y],

{bestVar, ub[[best]],
cart[Pick[x\[Transpose], #, 1]\[Transpose], Pick[y, #, 1], m],
cart[Pick[x\[Transpose], #, 0]\[Transpose], Pick[y, #, 0], m]
}
] &[mask[[best]]]
]

To demo these things as I go lets use the famous iris data built in to Mathematica selecting about 80% for training and 20% for testing.


data = ExampleData[{"Statistics", "FisherIris"}];


rs = RandomSample[Range[Length[data]]];
train = rs[[1 ;; 120]];
test = rs[[121 ;;]];

Now lets create a CART tree from this data letting m be 3. We can read the result easily. The first element is the variable that bests splits the data, in this case variable 3. The next value 3.3 is the critical value of variable 3 to use. If a value is below that it branches to the right and if it is above it branches to the left. The leaves are the next two elements.


tree = cart[Transpose@data[[train, 1 ;; -2]], data[[train, -1]], 3]

(* {3, 3.3, {3, 4.9, {4, 1.8, "virginica", {3, 5.1, {4, 1.6, "versicolor","virginica"},
"versicolor"}}, {4, 1.7, {4, 1.8, "versicolor", "virginica"},"versicolor"}}, "setosa"} *)


So far so good. Now we need a classifier that can take a new input and a tree to make a classification.


classify[x_, Except[_List, d_]] := d
classify[x_, {best_, value_, d1_, d2_}] := classify[x, If[x[[best]] < value, d2, d1]]

Lets try it out with the first element from our training data. It correctly classifies the iris as species 2 (virginica).


classify[{6.4, 2.8, 5.6, 2.1}, tree]

(* "virginica *)

A random forest is nothing but an ensemble of such trees, created from a bootstrap sample of the data, that allows each one to vote. The function rf takes some input data x, a response vector y, the number of variables to select for splitting m and the number of trees to grow ntree. The function rfClassify takes a new input and a trained forest and makes a classification.



rf[x_, y_, m_, ntree_] := 
Table[boot = RandomChoice[Range[Length[y]], Length[y]];
cart[x[[All, boot]], y[[boot]], m], {ntree}]

rfClassify[input_, forest_] :=
First[#][[maxI[Last[#]]]] &[
Transpose[Tally[classify[input, #] & /@ forest]]]

Lets try it on the iris data. First we fit a forest with our training set. And test it to make sure it works well on its own training data. In this case we get perfect classification of our training data so it looks good.


f = rf[data[[train, 1 ;; -2]]\[Transpose], data[[train, -1]], 3, 500];


N@
Mean[Boole[
First[#] === Last[#] & /@
Transpose[{rfClassify[#, f] & /@ data[[train, 1 ;; -2]],
data[[train, -1]]}]]]

(* 1. *)

Now lets try it out on the test data it has never seen before.



 N@
Mean[Boole[
First[#] === Last[#] & /@
Transpose[{rfClassify[#, f] & /@ data[[test, 1 ;; -2]],
data[[test, -1]]}]]]

(* 0.966667 *)

I'd say 97% correct classification isn't bad for a relatively small data set and forest.


EDIT:



It is worth showing how one might use RLink and the randomForest package in Mathematica with the data I have here. The following code will fit a random forest to the iris data and return the prediction for a particular input newX.


(*Enable RLink*)
<< RLink`
InstallR[]

(*Set the training data and response*)
RSet["x", data[[train, 1 ;; -2]]];
RSet["y", data[[train, -1]]];
RSet["newX", data[[test, 1 ;; -2]][[1]]];


(*install the package. Note: you only need to do this once*)
REvaluate["install.packages(\"randomForest\")"];

(*fit a random forest and make one classification*)
REvaluate["{
library(randomForest)
f = randomForest(x,as.factor(y))
predict(f,newX)
}"
]


(* RFactor[{1}, RFactorLevels["setosa", "versicolor", "virginica"],
RAttributes["names" :> {"1"}]] *)

Comments

Popular posts from this blog

mathematical optimization - Minimizing using indices, error: Part::pkspec1: The expression cannot be used as a part specification

I want to use Minimize where the variables to minimize are indices pointing into an array. Here a MWE that hopefully shows what my problem is. vars = u@# & /@ Range[3]; cons = Flatten@ { Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; Minimize[{Total@((vec1[[#]] - vec2[[u[#]]])^2 & /@ Range[1, 3]), cons}, vars, Integers] The error I get: Part::pkspec1: The expression u[1] cannot be used as a part specification. >> Answer Ok, it seems that one can get around Mathematica trying to evaluate vec2[[u[1]]] too early by using the function Indexed[vec2,u[1]] . The working MWE would then look like the following: vars = u@# & /@ Range[3]; cons = Flatten@{ Table[(u[j] != #) & /@ vars[[j + 1 ;; -1]], {j, 1, 3 - 1}], 1 vec1 = {1, 2, 3}; vec2 = {1, 2, 3}; NMinimize[ {Total@((vec1[[#]] - Indexed[vec2, u[#]])^2 & /@ R...

functions - Get leading series expansion term?

Given a function f[x] , I would like to have a function leadingSeries that returns just the leading term in the series around x=0 . For example: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x)] x and leadingSeries[(1/x + 2 + (1 - 1/x^3)/4)/(4 + x)] -(1/(16 x^3)) Is there such a function in Mathematica? Or maybe one can implement it efficiently? EDIT I finally went with the following implementation, based on Carl Woll 's answer: lds[ex_,x_]:=( (ex/.x->(x+O[x]^2))/.SeriesData[U_,Z_,L_List,Mi_,Ma_,De_]:>SeriesData[U,Z,{L[[1]]},Mi,Mi+1,De]//Quiet//Normal) The advantage is, that this one also properly works with functions whose leading term is a constant: lds[Exp[x],x] 1 Answer Update 1 Updated to eliminate SeriesData and to not return additional terms Perhaps you could use: leadingSeries[expr_, x_] := Normal[expr /. x->(x+O[x]^2) /. a_List :> Take[a, 1]] Then for your examples: leadingSeries[(1/x + 2)/(4 + 1/x^2 + x), x] leadingSeries[Exp[x], x] leadingSeries[(1/x + 2 + (1 - 1/x...

What is and isn't a valid variable specification for Manipulate?

I have an expression whose terms have arguments (representing subscripts), like this: myExpr = A[0] + V[1,T] I would like to put it inside a Manipulate to see its value as I move around the parameters. (The goal is eventually to plot it wrt one of the variables inside.) However, Mathematica complains when I set V[1,T] as a manipulated variable: Manipulate[Evaluate[myExpr], {A[0], 0, 1}, {V[1, T], 0, 1}] (*Manipulate::vsform: Manipulate argument {V[1,T],0,1} does not have the correct form for a variable specification. >> *) As a workaround, if I get rid of the symbol T inside the argument, it works fine: Manipulate[ Evaluate[myExpr /. T -> 15], {A[0], 0, 1}, {V[1, 15], 0, 1}] Why this behavior? Can anyone point me to the documentation that says what counts as a valid variable? And is there a way to get Manpiulate to accept an expression with a symbolic argument as a variable? Investigations I've done so far: I tried using variableQ from this answer , but it says V[1...