Collaborative Filters work with an interaction matrix, also called rating matrix. The aim of this algorithm is to learn a function that can predict if a user will benefit from
an item-meaning the user will likely buy, listen to, watch this item.
Among collaborative-based systems, we can encounter two types: user-item filtering and item-item filtering. We are going to use a matrix factorization approach
to User data to generate a music recommender system.
Matrix Factorization is a powerful way to implement a recommendation system. The idea behind it is to represent users and items in a lower-dimensional latent space.
So, in other words, Matrix factorization methods decompose the original sparse user- item matrix into lower dimensionality less sparse rectangular matrices with latent features. This does not only solve the sparsity issue but also makes the method scalable.
It doesn't matter how big the matrix is, you can always find lower dimensionality matrices that are a true representation of the original one.
Among the different matrix factorization techniques, we found the popular singular value decomposition (SVD).
This can be an abstract concept as we deepen into the mathematical foundations. But we'll try to keep it as simple as possible. Imagine we have a matrix A that contains the data for n users x m songs. This matrix can be decomposed uniquely into 3 matrices; let's called them U, S, and V.
In terms of our song recommender:
U is an n users x r user-latent feature matrix
k V is an m songs x r song-latent feature matrix
k S is an r xr non-negative diagonal matrix containing the singular values of the original matrix.
Instead of working with the implicit rating as it is, we'll apply the binning technique.
We'll define 10 categories. The original data values which fall into the interval from
0 to 1, will be replaced by the representative rating of 1; if they fall into the interval 1
to 2, they will be replaced by 2; and so on and so forth. The last category will be assigned to original values ranging from 9 to 2213.
400000
300000
count
200000
100000
1 2 3 4 5 6 7 8
Figure 3.7 Categories rating diagram
40
reader = Reader(rating_scale=(1, 10))
data = Dataset. load_from_df(df_song_reduced[['user_id', ‘song_id', ‘listen_count']], reader)
For this topic, we are going to use a fun package called surprise. Surprise is an easy-
to-use Python library specially designed for recommender systems.
To load a dataset from our DataFrame, we will use the load_from_df() method.
We will need to pass the following parameters:
ằ df: The dataframe containing the ratings. It must have three columns, corresponding
to the user ids, the song ids, and the ratings.
s reader (Reader): A reader to read the file. Only the rating _scale field needs to be specified.
Finally, we will we split the dataset into training and testing.
trainset, testset = train_test_split(data, test_size=.25)
Instead, this algorithm factorizes the original matrix as the product of two lower- dimensional matrices.
The first matrix has a row containing latent features associated with each user. The second one has a column containing latent features associated with a song.
First of all, we need to find which are the best parameters for our model for the data
we have.
The GridSearchCV class will compute accuracy metrics for the SVDalgorithm on the combinations of parameters selected, over a cross-validation procedure. This is useful for finding the best set of parameters for a prediction algorithm.
cross_validate will run a cross-validation procedure for our best estimator found during the grid search, and it will report accuracy measures and computation times.
This method uses KFold as the cross-validation technique.
41
param_grid = {'n_factors': [120, 160], 'n_epochs': [100, 110], ‘lr_all': [0.001, 0.005], 'reg a\L': [0.08, 0.12]}
grid_search_svd = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, joblib_verbose=4, n_jobs=-2)
grid_search_svd. fit (data)
find_algo = grid_search_svd.best_estimator['rmse']
print(grid_search_svd.best_score['rmse'])
print(grid_search_svd.best_params['rmse'])
# Perform the cross validation
cross_validate(find_algo, data, measures=['RMSE'], cv=5, verbose=True)
Evaluating RMSE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 2.1704 2.1807 2.1786 2.1790 2.1766 2.1771 0.0036
Fit time 246.07 247.41 247.19 246.36 247.11 246.83 0.52
Test time 2.35 1.99 2.37 2:31 2.01 2.21 0.17
{'test_rmse': array([2.17037624, 2.18068088, 2.17857971, 2.17901004, 2.17660557]), 'fit_time': (246.07249093055725,
247 .4095058441162,
247 .18708205223083,
246. 3611478805542,
247 .10727787017822),
'test_time': (2.347102403640747,
1.9904148578643799,
2.365492105484009,
2.314052104949951,
2.009442090988159)}
Figure 3.8 Evaluating RMSE of algorithm SVD
After finding the best parameters for the model, we can create our final model.
We can use the method fit() to train the algorithm on the trainset, and then, the
method test() to return the predictions obtained from the test set.
# After getting the best parameters, we fit the model again
final_algorithm = SVD(n_factors=169, n_epochs=100, lr_all=0.005, reg_all=0.1)
final_algorithm. fit(trainset)
# And we test it
test_predictions = final_algorithm. test(testset)
# Get the accuracy
print(f"The RMSE is {accuracy.rmse(test_predictions, verbose=True) }")
The RMSE is 2.186
Result
42
['Greatest Hits',
'Gypsy',
‘Still Crazy After All These Years', '5@ Ways to Leave Your Lover',
‘Tracy Chapman', 'Fast Car',
‘Billy Joel’,
"Glass Houses',
‘You May Be Right', 'L,A, Woman',
‘Love Her Madly',
‘Tango In the Night',
‘Caroline’,
"Idlewild South',
‘Midnight Rider', 'Revo Lver ',
‘Eleanor Rigby',
"Seven Wonders’,
‘Cloud Nine',
‘dello, I Love You']
Figure.3.9 Result of Collaborative Filtering