Building a Recommender System in Spark with ALS

Summary: Spark has an implementation of Alternating Least Squares (ALS) along with a set of very simple functions to create recommendations based on past data.

Recommender Systems with Apache Spark's ALS Function from Will Johnson

Building a recommendation engine has never been easier than now. My #1 goal is to get more people to build data tools like recommender systems. The more it becomes common place, the bigger and better things we as analysts can do. So I hope you enjoy the slides and the code is down below.

Recommender System Basics

Nearest Neighbors Approach
Matrix Factorization Models

Imagine you’re trying to predict how much a user (Guido) will like a new programming language (spark).

A Nearest Neighbors approach uses k Nearest Neighbors to find the most similar user(s) to Guido and then takes a weighted average to estimate how much he will like the new programming language.

User-based recommendations are very volatile. A single new rating or new user may dramatically affect the recommendations given. A more robust method is item-based collaborative filtering.

Very similar concept, but compares items to items.
Adding one more user won’t dramatically change anything.
Item-to-Item builds a table of similarities for every item to every item.
This lets you build your model and then simply do some multiplication to find the next best item.

A Matrix Factorization approach focuses on finding latent factors that are not necessarily apparent in the data.

In Linear Algebra, a matrix can be decomposed into multiple parts.
Those parts can be recombined and this creates recommended “scores” for products not yet purchased.
The main function in Spark for MatrixFactorization is ALS: Alternating Least Squared

Matrix Factorization (MF) is the cutting edge of recommender systems. The Netflix Prize Competition resulted in a bunch of research on MF methods. The winners combined 500 different recommendation systems to increase Netflix’s recommender accuracy by 10%.

However, Netflix, by the end of the competition, was no longer looking to minimize their Root Mean Squared Error. Instead, their business had changed and they were looking to evaluate more contextual recommendations (which are much harder to measure with a straightforward metric like RMSE).

Get to the Code Already!

Spark does not include an implementation for user-based or item-based collaborative filtering. Instead, it has the Alternating Least Squares (ALS) Matrix Factorization method.

ALS works by trying to find the optimal representation of a user and a product matrix – which when combined, should accurately represent the original dataset. The genius part of ALS is that it alternates between finding the optimal values for the user matrix and the product matrix.

The ALS function is in the MLlib package and is exceedingly easy to use! The example code uses the MovieLens 100K dataset. Make sure you download it to code along!

Procedural Code for ALS in Spark

Additional Resources

Research Papers to help understand Recommender Systems

Alternating Least Squares for Low-Rank Matrix Reconstruction
Amazon.com Recommendations Item-to-Item Collaborative Filtering
Matrix Factorization Techniques for Recommender Systems
Collaborative Filtering for Implicit Datasets
Contextual Recommendation (long-term and short-term memory for recos)

Some more pragmatic resources include Intro to Recommender Systems on Coursera and Recommender Systems: An Introduction. I really liked the Coursera course but the book is a very nice, easy introduction (and great reference book) to recommendation systems.

Learn by Marketing

Data Mining + Marketing in Plain English