Apache Lucene Mahout > index > QuickStart > ClusteringYourData
Added by Grant Ingersoll, last edited by Drew Farris on Nov 07, 2009  (view change)

Mahout_0.2

After you've done the QuickStart and are familiar with the basics of Mahout, it is time to cluster your own data.

The following pieces may be useful for in getting started:

Input

For starters, you will need your data in an appropriate Vector format (which has changed since Mahout 0.1)

Text Preparation

Running the Process

Canopy

Background: canopy

Documentation of running canopy from the command line: canopy-commandline

kMeans

Background: k-Means

Documentation of running kMeans from the command line: k-means-commandline

Documentation of running fuzzy kMeans from the command line: fuzzy-k-means-commandline

Dirichlet

Background: dirichlet

Documentation of running dirichlet from the command line: dirichlet-commandline

Mean-shift

Background: meanshift

Documentation of running mean shift from the command line: mean-shift-commandline

Latent Dirichlet Allocation

Background and documentation: LDA

Retrieving the Output

TODO

Validating the Output

From Ted Dunning's response on See http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output

A principled approach to cluster evaluation is to measure how well the cluster membership captures the structure of unseen data. A natural measure for this is to measure how much of the entropy of the data is captured by cluster membership. For k-means and its natural L_2 metric, the natural cluster quality metric is the squared distance from the nearest centroid adjusted by the log_2 of the number of clusters. This can be compared to the squared magnitude of the original data or the squared deviation from the centroid for all of the data. The idea is that you are changing the representation of the data by allocating some of the bits in your original representation to represent which cluster each point is in. If those bits aren't made up by the residue being small then your clustering is making a bad trade-off.

In the past, I have used other more heuristic measures as well. One of the key characteristics that I would like to see out of a clustering is a degree of stability. Thus, I look at the fractions of points that are assigned to each cluster or the distribution of distances from the cluster centroid. These values should be relatively stable when applied to held-out data.

For text, you can actually compute perplexity which measures how well cluster membership predicts what words are used. This is nice because you don't have to worry about the entropy of real valued numbers.

Manual inspection and the so-called laugh test is also important. The idea is that the results should not be so ludicrous as to make you laugh. Unfortunately, it is pretty easy to kid yourself into thinking your system is working using this kind of inspection. The problem is that we are too good at seeing (making up) patterns.

References