| Apache Lucene Mahout > index > Bayesian > WikipediaBayesExample |
The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either the Naive Bayes or Complementary Naive Bayes implementations in Mahout. The example (described below) gets a Wikipedia dump and then splits it up into chunks. These chunks are then further split by country. From these splits, a classifier is trained to predict what country an unseen article should be categorized into.
NOTE: Substitute in the appropriate version of Mahout as needed below (i.e. replace 0.1-dev with the appropriate value)
<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-0.1-dev-ex.jar org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml -o <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64
We strongly suggest you backup the results to some other place so that you don't have to do this step again in case it gets accidentally erased
<HADOOP_HOME>/bin/hadoop dfs -put <MAHOUT_HOME>/examples/work/wikipedia/chunks/ wikipediadump
<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job org.apache.mahout.classifier.bayes.WikipediaDatasetCreator -i wikipediadump -o wikipediainput -c <MAHOUT_HOME>/examples/src/test/resources/country.txt
<HADOOP_HOME>bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.job org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o wikipediamodel --gramSize 3 -classifierType bayes
<HADOOP_HOME>/bin/hadoop dfs -get wikipediainput wikipediainput
<HADOOP_HOME>/bin/hadoop jar <MAHOUT_HOME>/examples/target/apache-mahout-examples-0.1-dev.jar org.apache.mahout.classifier.bayes.TestClassifier -p wikipediamodel -t wikipediainput