Apache Lucene Mahout > index > QuickStart > Breiman Example
Added by abdelHakim Deneche, last edited by abdelHakim Deneche on Sep 28, 2009

Introduction

This quick start page shows how to run the Breiman example. It implements the test procedure described in Breiman's paper 1.
The basic algorithm is as follows :

  • repeat I iterations
  • foreach iteration do
    • 10% of the dataset is kept apart as a testing set
    • build two forests using the training set, one with m=int(log2(M)+1) (called Random-Input) and one with m=1 (called Single-Input)
    • choose the forest that gave the lowest oob error estimation to compute the test set error
    • compute the test set error using the Single Input Forest (test error), this demonstrates that even with m=1, Decision Forests give comparable results to greater values of m
    • compute the mean test set error using every tree of the chosen forest (tree error). This should indicate how well a single Decision Tree performs
  • compute the mean test error for all iterations
  • compute the mean tree error for all iterations

Steps

Download the data

Build the Job files

  • In $MAHOUT_HOME/ run:
    mvn install -DskipTests

Generate a file descriptor for the dataset:

for the glass dataset (glass.data), run :

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>.job org.apache.mahout.df.tools.Describe -p testdata/glass.data -f testdata/glass.info -d I 9 N L

The "I 9 N L" string indicates the nature of the variables. which means 1 ignored(I) attribute, followed by 9 numerical(N) attributes, followed by the label(L)

  • you can also use C for categorical (nominal) attributes

Run the example

$HADOOP_HOME/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<VERSION>.job org.apache.mahout.df.BreimanExample -d testdata/glass.data -ds testdata/glass.info -i 10 -t 100

which builds 100 trees (-t argument) and repeats the test 10 iterations (-i argument)

  • The example outputs the following results:
    • Selection error : mean test error for the selected forest on all iterations
    • Single Input error : mean test error for the single input forest on all iterations
    • One Tree error : mean single tree error on all iterations
    • Mean Random Input Time : mean build time for random input forests on all iterations
    • Mean Single Input Time : mean build time for single input forests on all iterations