This is a concise quick tour of using the mahout command line to generate text analysis data. It follows examples from the Mahout in Action book and uses the Reuters-21578 data set. This is one simple path through vectorizing text, creating clusters and calculating similar documents. The examples will work locally or distributed on a hadoop cluster. With the small data set provided a local installation is probably fast enough.
This walkthrough was originally written for Mahout 0.6 CLI and has been updated to the >= 0.7. When in doubt executing any command without parameters will output help.
Generate Mahout vectors from text
Get the Reuters-21578 (http://www.daviddlewis.com/resources/testcollections/reuters21578/) files and extract them in “./reuters”. They are in SGML format. Mahout will also create sequence files from raw text and other formats. At the end of this section you will have the text files turned into vectors, which are basically lists of weighted token. The weights are calculated to indicate the importance of each token.
- Convert from SGML to text: If you plan to run this example on a hadoop cluster you will need to copy the files to HDFS, which is not covered here.
- Now turn raw text in a directory into mahout sequence files:
- Examine the sequence files with seqdumper:
mahout seqdumper -i reuters-seqfiles/chunk-0 | more
you should see something like this:
- Create tfidf weighted vectors. This uses the default analyzer and default TFIDF weighting, -n 2 is good for cosine distance, which we are using in clustering and for similarity, -x 90 meaning that if a token appears in 90% of the docs it is considered a stop word, -nv to get named vectors making further data files easier to inspect.
- Examine the vectors if you like but they are not really human readable...
- Examine the tokenized docs to make sure the analyzer is filtering out enough (note that the rest of this example used a more restrictive lucene analyzer and not the default so your result may vary): This should show each doc with nice clean tokenized text.
- Examine the dictionary. It maps token id to token text.
Cluster documents using kmeans
Clustering documents can be done with one of several clustering algorithms in Mahout. Perhaps the best know is kmeans, which will drop documents into k categories. You have to supply k as input along with the vectors. The output is k centroids (vectors) for each cluster and optionally the documents assigned to each cluster.
- Create clusters and assign documents to the clusters If -c and -k are specified, kmeans will put random seed vectors into the -c directory, if -c is provided without -k then the -c directory is assumed to be input and kmeans will use each vector in it to seed the clustering. -cl tell kmeans to also assign the input doc vectors to clusters at the end of the process and put them in reuters-kmeans-clusters/clusteredPoints. if -cl is not specified then the documents will not be assigned to clusters.
- Examine the clusters and perhaps even do some anaylsis of how good the clusters are: Note: clusterdump can do some analysis of the quality of clusters that option is not shown here.
- The clusteredPoints dir has the docs mapped into clusters, and if you created vectors with names (seq2sparse -nv) you’ll see file names. You also have the distance from the centroid using the distance measure supplied to the clustering driver. To look at this use seqdumper:
You will see that the file contains key: clusterid, value: wt = % likelihood the vector is in cluster, distance from centroid, named vector belonging to the cluster, vector data.
For kmeans the likelihood will be 1.0 or 0. For example: Clusters, of course, do not have names. A simple solution is to construct a name from the top terms in the centroid as they are output from clusterdump.
Calculate several similar docs to each doc in the data
This will take all docs in the data set and for each calculate the 10 most similar docs. This can be used for a "give me more like this" feature. The algorithm is fairly fast and requires only three mapreduce passes.
- First create a matrix from the vectors:
You’ll get output announcing the number of rows (documents) and columns (total number of tokens in the dictionary) in the matrix. I looks will look like this:
Rowsimilarity job can infer the number of columns (r arg) from the input matrix if not explicitly specified by the user. Also note that this creates a reuters-matrix/docIndex file where the rowids are mapped to docids. In the case of this example it will be rowid->file name since we created named vectors in seq2sparse.
Note: This does not create a Mahout Matrix class but a sequence file so use seqdumper to examine the results.
- Create a collection of similar docs for each row of the matrix above: This will generate the 10 most similar docs to each doc in the collection.
- Examine the similarity list: Which should look something like this For each rowid there is a list of ten rowids and distances. These correspond to documents and distances created by the --similarityCalssname. In this case they are cosines of the angle between doc and similar doc. Look in the reuters-matrix/docIndex to find rowid to docid mapping. It should look something like this:
A wide variety of tasks can be performed from the command line of Mahout. Many parameters available in the Java API are supported so it is a good way to get an idea of how Mahout works and will give a basis for tuning your own use.