Introduction
This quick start page shows how to run the Breiman example. It implements the test procedure described in Breiman's paper 1.
The basic algorithm is as follows :
- repeat I iterations
- foreach iteration do
- 10% of the dataset is kept apart as a testing set
- build two forests using the training set, one with m=int(log2(M)+1) (called Random-Input) and one with m=1 (called Single-Input)
- choose the forest that gave the lowest oob error estimation to compute the test set error
- compute the test set error using the Single Input Forest (test error), this demonstrates that even with m=1, Decision Forests give comparable results to greater values of m
- compute the mean test set error using every tree of the chosen forest (tree error). This should indicate how well a single Decision Tree performs
- compute the mean test error for all iterations
- compute the mean tree error for all iterations
Steps
Download the data
- The current implementation is compatible with the UCI repository file format. Here are links to some of the datasets used in Breiman's paper:
- Put the data in HDFS:
$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
Build the Job files
Generate a file descriptor for the dataset:
for the glass dataset (glass.data), run :
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>.job org.apache.mahout.df.tools.Describe -p testdata/glass.data -f testdata/glass.info -d I 9 N L
The "I 9 N L" string indicates the nature of the variables. which means 1 ignored(I) attribute, followed by 9 numerical(N) attributes, followed by the label(L)
- you can also use C for categorical (nominal) attributes
Run the example
$HADOOP_HOME/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<VERSION>.job org.apache.mahout.df.BreimanExample -d testdata/glass.data -ds testdata/glass.info -i 10 -t 100
which builds 100 trees (-t argument) and repeats the test 10 iterations (-i argument)
- The example outputs the following results:
- Selection error : mean test error for the selected forest on all iterations
- Single Input error : mean test error for the single input forest on all iterations
- One Tree error : mean single tree error on all iterations
- Mean Random Input Time : mean build time for random input forests on all iterations
- Mean Single Input Time : mean build time for single input forests on all iterations