Apache Lucene Mahout > index > Bayesian > WikipediaBayesExample
Added by Grant Ingersoll, last edited by Geoff Thilo on Jul 16, 2009  (view change)

Intro

The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either the Naive Bayes or Complementary Naive Bayes implementations in Mahout. The example (described below) gets a Wikipedia dump and then splits it up into chunks. These chunks are then further split by country. From these splits, a classifier is trained to predict what country an unseen article should be categorized into.

Running the example

NOTE: Substitute in the appropriate version of Mahout as needed below (i.e. replace 0.1-dev with the appropriate value)

  1. cd <MAHOUT_HOME>/examples
  2. ant -f build-deprecated.xml enwiki-files
  3. Chunk the Data into pieces:
    <HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-0.1-dev-ex.jar org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml -o  <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64

    We strongly suggest you backup the results to some other place so that you don't have to do this step again in case it gets accidentally erased

  4. Move the chunks to HDFS:
    <HADOOP_HOME>/bin/hadoop dfs -put <MAHOUT_HOME>/examples/work/wikipedia/chunks/ wikipediadump
  5. Create the countries based Split of wikipedia dataset.
    <HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job org.apache.mahout.classifier.bayes.WikipediaDatasetCreator -i wikipediadump -o wikipediainput -c <MAHOUT_HOME>/examples/src/test/resources/country.txt
  6. Train the classifier:
    <HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o wikipediamodel --gramSize 3 -classifierType bayes
  7. Fetch the input files for testing:
    <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput wikipediainput 
  8. Test the classifier:
    <HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.TestClassifier -p wikipediamodel -t  wikipediainput