Lately, I’ve written a few iterations of pyspark to develop a recommender system (I’ve had some practice creating recommender systems in pyspark). I ran into a situation where I needed to generate some recommendations on some different datasets. My problem was that I had to decipher some of the prediction documentation. Because of my struggles, I decided to provide a little more detailed documentation on what each prediction function does.
- I’ve written a basic pyspark tutorial for those wanting an introduction.
- This tutorial is using the MovieLens 100K dataset
- These outputs are up-to-date through Spark 2.0.2.
- All of these functions DO NOT remove users who have already purchased / rated a product. You have to remove them in post-processing.
Setting up the data
Before we can get to creating any predictions, we need to load the data and create a basic model.
from pyspark import SparkConf, SparkContext from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating movielens = sc.textFile("../in/ml-100k/u.data") data = movielens.map(lambda x: x.split('\t'))\ .map(lambda x:(int(x), int(x), float(x))) ratings = data.map(lambda l: Rating(int(l), int(l), float(l))) rank = 5 #Creates 5 latent factors, should experiment with this value numIterations = 10 #How many times will be iterate over the data #Create the model on the training data model = ALS.train(ratings, rank, numIterations)
Now we can use the various prediction functions on the
Predict Function – Return a single floating point value
The predict function takes a user id and a product id and produces a single floating point value.
This is a useful tool for interactive exploration of predictions. For example, take your best customer (or most important rater?) and check to see what the prediction would be for your most popular product.
model.predict(196, 242) type(model.predict(196, 242)) #<type 'float'>
predictAll Function – Returns RDD of Rating Objects
The predictAll function takes an RDD of user id and product id pairs and returns a prediction *(as floating point) for each pair. Essentially, it’s the vector version of the predict function.
In order to get the example to work, I create some example pairs. For user 196, I want predictions for product id 242, 242, and 244.
user_product_pairs = sc.parallelize([(196, 242), (196, 243), (196, 244)]) predictAllOutput = model.predictAll(user_product_pairs) type(predictAllOutput) #<class 'pyspark.rdd.RDD'> predictAllOutput.first() #Rating(user=196, product=242, rating=3.5317132284294104)
recommendUsers – Returns a List of Ratings in Descending Order by Rating
This is another tool that is great for exploration purposes. Given a product id and a number of users (N), recommendUsers will find the top N users for the product id. This is a great tool for companies who are product-first focused. If you’re developing an email campaign for a product and need a list of customers who would appreciate that product, this is your function.
The output is a regular python list.
Personally, I found that to be a little annoying since it would be more convenient to make it the same output (an RDD) as the other mass recommendation generating functions (like recommendProductsForUsers).
id242RecoUsers100 = model.recommendUsers(242, 100) type(id242RecoUsers100) #<type 'list'> id242RecoUsers100[0:2] #[Rating(user=219, product=242, rating=5.611818784586312), #Rating(user=68, product=242, rating=5.586003851543971)]
recommendProducts – Returns a List of Ratings
Similar to the above recommendUsers but product focused. If you want to generate the top recommendations for one user, this is it. This could be very useful in a web service that hooks into your spark instance and generates the recommendations on the fly.
user219RecoItems100 = model.recommendProducts(219, 100) type(user219RecoItems100) #<type 'list'> user219RecoItems100[0:2] #[Rating(user=219, product=851, rating=9.600323153015365), #Rating(user=219, product=1175, rating=9.36959344778193)]
recommendProductsForUsers – Returns RDD with(UserID, (RatingObj, RatingObj, …) ) where RatingObj is sorted descending by rating
Now we’re getting into the most applicable tool for generating recommendations in batch. This is by far the most important method for me.
You simply say how many recommendations (N) you want and it will output an RDD where…
- The RDD has the same number of rows as number of users.
- Each row is a tuple with
(user id, (Rating, Rating, ..., Rating)).
- The second element of the tuple is another tuple full of N rating objects.
- The rating objects are sorted descending by rating.
That last part makes it a bit harder to work with if you’re just creating a csv or some file for further processing.
TopProductsAllUsers = model.recommendProductsForUsers(10) type(TopProductsAllUsers) #<class 'pyspark.rdd.RDD'> TopProductsAllUsers.first() #(720, (Rating(user=720, product=1205, rating=8.57902951605234), Rating(user=720, product=1664, rating=8.050483208046831), Rating(user=720, product=1598, rating=7.583552623567954), Rating(user=720, product=913, rating=7.43263122066972), Rating(user=720, product=1233, rating=6.822753524823774), Rating(user=720, product=1631, rating=6.67932957489265), Rating(user=720, product=1639, rating=6.647824532323629), Rating(user=720, product=1427, rating=6.437768325335273), Rating(user=720, product=1466, rating=6.433707466558552), Rating(user=720, product=1114, rating=6.413562253329957)))
recommendUsersForProducts – Returns RDD with(ProductID, (RatingObj, RatingObj, …) ) where RatingObj is sorted descending by rating
Same as above but product focused instead of users. Give a required number of recommendations (N) and the function will produce N users for every product in the training data.
TopUsersAllProducts = model.recommendUsersForProducts(10) type(TopUsersAllProducts) #<class 'pyspark.rdd.RDD'> TopUsersAllProducts.first() #(1440, (Rating(user=61, product=1440, rating=9.073874707471804), Rating(user=462, product=1440, rating=8.22083024996575), Rating(user=300, product=1440, rating=8.122701346040417), Rating(user=261, product=1440, rating=7.845205624742849), Rating(user=797, product=1440, rating=7.14356460859009), Rating(user=78, product=1440, rating=7.015719961331772), Rating(user=170, product=1440, rating=6.954448894950653), Rating(user=575, product=1440, rating=6.84725744220184), Rating(user=340, product=1440, rating=6.74445531663849), Rating(user=791, product=1440, rating=6.517175876371361)))
That ends all of the pre-built prediction / recommendation functions in pyspark for the MatrixFactorization models!
Bottom Line: Personally, I use recommendProductsForUsers the most but the other functions are useful for exploratory analysis and potentially web services for your website.