Answer
Here I will attempt to provide a basic implementation of the random forest algorithm for classification. This is by no means fast and doesn't scale very well but otherwise is a nice classifier. I recommend reading Breiman and Cutler's page for information about random forests.
The following are some helper functions that allow us to compute entropy and ultimately information gain easily.
condEnt = Statistics`Library`NConditionalEntropy;
ent = Statistics`Library`NEntropy;
maxI[v_] := Ordering[v, -1][[1]]
Now for the decision trees that will be used in the forest. This is the biggest bottleneck in the whole process and I would love to see a much faster implementation in Mathematica! The trees I'm working with are called CART trees.
The basic idea is to select at random m
of the possible k
variables to separate the response y
into classes. Once the best variable is chosen the value of that variable which bests separates the classes is chosen to create a split in the tree. This process continues until all responses have been classified or we aren't able to split things based on the remaining data.
cart[x_, y_ /; ent[y] == 0, m_] := First[y]
cart[x_, y_, m_] :=
Block[{k = Length[x], h, drop, bestVar, ub, best, mask},
h = ent[y];
drop = RandomSample[Range[k], k - m];
bestVar =
maxI[Table[If[FreeQ[drop, i], h - condEnt[x[[i]], y], 0], {i, k}]];
ub = Union[x[[bestVar]]];
mask = UnitStep[x[[bestVar]] - #] & /@ ub;
best = maxI[(h - condEnt[#, y] & /@ mask)];
If[Min[#] == Max[#],
RandomChoice[y],
{bestVar, ub[[best]],
cart[Pick[x\[Transpose], #, 1]\[Transpose], Pick[y, #, 1], m],
cart[Pick[x\[Transpose], #, 0]\[Transpose], Pick[y, #, 0], m]
}
] &[mask[[best]]]
]
To demo these things as I go lets use the famous iris data built in to Mathematica selecting about 80% for training and 20% for testing.
data = ExampleData[{"Statistics", "FisherIris"}];
rs = RandomSample[Range[Length[data]]];
train = rs[[1 ;; 120]];
test = rs[[121 ;;]];
Now lets create a CART tree from this data letting m
be 3. We can read the result easily. The first element is the variable that bests splits the data, in this case variable 3. The next value 3.3 is the critical value of variable 3 to use. If a value is below that it branches to the right and if it is above it branches to the left. The leaves are the next two elements.
tree = cart[Transpose@data[[train, 1 ;; -2]], data[[train, -1]], 3]
(* {3, 3.3, {3, 4.9, {4, 1.8, "virginica", {3, 5.1, {4, 1.6, "versicolor","virginica"},
"versicolor"}}, {4, 1.7, {4, 1.8, "versicolor", "virginica"},"versicolor"}}, "setosa"} *)
So far so good. Now we need a classifier that can take a new input and a tree to make a classification.
classify[x_, Except[_List, d_]] := d
classify[x_, {best_, value_, d1_, d2_}] := classify[x, If[x[[best]] < value, d2, d1]]
Lets try it out with the first element from our training data. It correctly classifies the iris as species 2 (virginica).
classify[{6.4, 2.8, 5.6, 2.1}, tree]
(* "virginica *)
A random forest is nothing but an ensemble of such trees, created from a bootstrap sample of the data, that allows each one to vote. The function rf
takes some input data x
, a response vector y
, the number of variables to select for splitting m
and the number of trees to grow ntree
. The function rfClassify
takes a new input and a trained forest and makes a classification.
rf[x_, y_, m_, ntree_] :=
Table[boot = RandomChoice[Range[Length[y]], Length[y]];
cart[x[[All, boot]], y[[boot]], m], {ntree}]
rfClassify[input_, forest_] :=
First[#][[maxI[Last[#]]]] &[
Transpose[Tally[classify[input, #] & /@ forest]]]
Lets try it on the iris data. First we fit a forest with our training set. And test it to make sure it works well on its own training data. In this case we get perfect classification of our training data so it looks good.
f = rf[data[[train, 1 ;; -2]]\[Transpose], data[[train, -1]], 3, 500];
N@
Mean[Boole[
First[#] === Last[#] & /@
Transpose[{rfClassify[#, f] & /@ data[[train, 1 ;; -2]],
data[[train, -1]]}]]]
(* 1. *)
Now lets try it out on the test data it has never seen before.
N@
Mean[Boole[
First[#] === Last[#] & /@
Transpose[{rfClassify[#, f] & /@ data[[test, 1 ;; -2]],
data[[test, -1]]}]]]
(* 0.966667 *)
I'd say 97% correct classification isn't bad for a relatively small data set and forest.
EDIT:
It is worth showing how one might use RLink
and the randomForest
package in Mathematica with the data I have here. The following code will fit a random forest to the iris data and return the prediction for a particular input newX
.
(*Enable RLink*)
<< RLink`
InstallR[]
(*Set the training data and response*)
RSet["x", data[[train, 1 ;; -2]]];
RSet["y", data[[train, -1]]];
RSet["newX", data[[test, 1 ;; -2]][[1]]];
(*install the package. Note: you only need to do this once*)
REvaluate["install.packages(\"randomForest\")"];
(*fit a random forest and make one classification*)
REvaluate["{
library(randomForest)
f = randomForest(x,as.factor(y))
predict(f,newX)
}"
]
(* RFactor[{1}, RFactorLevels["setosa", "versicolor", "virginica"],
RAttributes["names" :> {"1"}]] *)
Comments
Post a Comment