Apache Mahout … Mahout Wiki DeletionCandidates Algorithms Collaborative Filtering with ALS-WR

Go to start of metadata

Problem setting

In Collaborative Filtering we want to predict which items that a user has not yet seen might be highly preferred by the user. We assume that users give explicit ratings on a fixed scale (say from 1 to 5). We collect all these ratings in a matrix A. Only a minority of the cells of A is filled. In order to produce recommendations we need to find a way to predict the rating of a user to an unseen item. This corresponds to filling the blanks in A. We will demonstrate the approach on a toy example.

rating matrix A (users x items)

 5.00 5.00 2.00 - 2.00 - 3.00 5.00 - 5.00 - 3.00 3.00 - - 5.00

The latent factor approach

A very successful approach to this problem are the so called latent factor models. These model the users and items as points in a k-dimensional feature space. An unknown rating can than simply be estimated by taking the dot product between the corresponding user and item feature vectors.

Mathematically spoken, we decompose A into two other matrices U and M whose combination is a good approximation of A

user feature matrix U (users x features)

 1.12 1.49 0.48 1.31 -0.52 0.59 1.13 0.67 -0.52 1.39 0.05 0.45

item feature matrix M (items x features)

 1.81 1.62 0.74 2.66 1.71 -1.08 1.73 -0.23 0.78 3.16 -0.24 0.9

We can than compute an approximation A_k of A which has all cells filled. Each previously empty cell now contains a predicted rating.

prediction matrix A_k (users x items)

A_k = UM'

 4.78 4.98 1.97 3.61 1.98 1.97 2.85 4.81 2.75 4.71 1.4 2.94 2.94 3.32 2.75 4.79

Finding the decomposition

Mahout uses Alternating Least Squares with Weighted Lambda-Regularization to find the decomposition. Please refer to Large-scale Parallel Collaborative Filtering for the Netflix Prize (PDF) for details of the algorithm.

Taking this approach into production

In order to successfully apply this algorithm, a regularization parameter lambda and the number of iterations to run have to be determined. Currently you have to figure these out manually using holdout tests.

Mahout includes a toy example in examples/bin/factorize-movielens-1M.sh that shows how to do a simple holdout test on the Movielens 1 million ratings dataset. It also shows how to compute top-N recommendations from the resulting factorization:

Hints

Please be aware that ALS-WR is an iterative algorithm. Iterative algorithms currently show poor performance on Hadoop caused by the enormous overhead of scheduling and checkpointing each single iteration.

Labels
• No labels