Apache Lucene Mahout > index > QuickStart > SyntheticControlData
Added by Grant Ingersoll, last edited by Robert Burrell Donkin on Jun 17, 2009  (view change) show comment

Introduction

This quick start page shows how to run the clustering Synthetic Control Data example. The data is described here .

Steps

  • Download the data at http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series.
  • In $MAHOUT_HOME/, build the Job file
    • The same job is used for all examples so this only needs to be done once
    • mvn install
    • The job will be generated in $MAHOUT_HOME/examples/target/ and it's name will contain the Mahout version number. For example, when using Mahout 0.1 release, the job will be mahout-examples-0.1.job
  • (Optional) 1 Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
  • Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
  • Run the Job: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<MAHOUT VERSION>.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job 2
    • For kmeans : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<MAHOUT VERSION>.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
    • For canopy : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<MAHOUT VERSION>.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job
    • For dirichlet : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<MAHOUT VERSION>.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
    • For meanshift : $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<MAHOUT VERSION>.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job
  • Get the data out of HDFS 3 4 and have a look 5
    • All example jobs use testdata as input and output to directory output
    • Use bin/hadoop fs -lsr output to view all outputs
    • Output:
      • KMeans is placed into output/points
      • Canopy and MeanShift results are placed into output/clustered-points

Footnotes
Reference Notes
1 This step should be skipped when using standalone Hadoop
2 Substitute in whichever Clustring Job you want here: KMeans, Canopy, etc. See subdirectories of $MAHOUT_HOME/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/.
3 See HDFS Shell
4 The output directory is cleared when a new run starts so the results must be retrieved before a new run
5 Dirichlet also prints data to console