Mahout currently has two implementations of Bayesian classifiers. One is the traditional Naive Bayes approach, and the other is called Complementary Naive Bayes.
The Naive Bayes implementations in Mahout follow the paper http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Before we get to the actual algorithm lets discuss the terminology
Given, in an input set of classified documents:
- j = 0 to N features
- k = 0 to L labels
- Normalized Frequency for a term(feature) in a document is calculated by dividing the term frequency by the root mean square of terms frequencies in that document
- Weight Normalized Tf for a given feature in a given label = sum of Normalized Frequency of the feature across all the documents in the label.
- Weight Normalized Tf-Idf for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf
Once Weight Normalized Tf-idf(W-N-Tf-idf) is calculated, the final weight matrix for Bayes and Cbayes are calculated as follows
We calculate the sum of W-N-Tf-idf for all the features in a label called as Sigma_k or sumLabelWeight
We calculate the Sum of W-N-Tf-Idf across all labels for a given feature. We call this sumFeatureWeight of Sigma_j
Also we sum the entire W-N-Tf-Idf weights for all feature,label pair in the train set. Call this Sigma_jSigma_k
Final Weight is calculated as
In Mahout's example code, there are two samples that can be used:
- Wikipedia Bayes Example - Classify Wikipedia data.
- Twenty Newsgroups - Classify the classic Twenty Newsgroups data.