Hybrid similarity matrix in neighborhood based recommendation system

2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Hybrid Similarity Matrix in Neighborhood-based Recommendation System Tan Nghia Duong, Truong Giang Do, Nguyen Nam Doan, Tuan Nghia Cao, Tien Dat Mai School of Electronics and Telecommunications Hanoi University of Science and Technology Hanoi, Viet Nam E-mail: nghia.duongtan@hust.edu.vn Abstract—Modern hybrid recommendation methods have successfully mitigated the data sparsity and cold-start problems Existing hybrid neighborhood-based models adopt both the transaction history and profiles of users and items, although each is used separately in different phases of learning the similarity scores and giving recommendations This paper proposes utilizing both types of information to measure similarity scores between items, creating a more robust hybrid similarity matrix which helps improve the accuracy of the neighborhood-based models Comprehensive experiments show that our proposed hybrid similarity matrix can boost the accuracy of neighborhood-based systems by 0.77 - 4.46% compared to the earlier related hybrid methods Index Terms—Collaborative filtering, Similarity measure, Ensemble model, Recommendation system I I NTRODUCTION In the current rapid shift to online markets, recommendation system (RS) has become an essential component of e-commerce platforms to automatically provide personalized suggestions to consumers by analyzing transaction history as well as other information of items and user profiles [1] Generally, there are three approaches in RS [2]: the contentbased, the collaborative filtering, and the hybrid model Content-based techniques [3] give recommendations based on descriptions of items or profiles of users By contrast, Collaborative Filtering (CF) approaches [4] not require the content information but rely on the analogy between users with similar tastes determined by their past transactions Regardless of its successful application in recommendation systems, the existences of sparse rating matrices and the cold-start problem significantly degrade the performance of CF Therefore, hybrid methods are proposed [5]–[8] to tackle this problem by combining both content-based and CF models and have yielded promising results Several optimization techniques are adopted to enhance the performance of traditional methods [9]–[11], however, their performances are still restricted by the linearity In real-world application, deep learning has been proven as an effective approach to explore non-linear correlations between the user-item interactions and data features [12]–[18] Due to the advantages of hybrid model compared to traditional methods, our study focuses solely on the similarity matrix by incorporating existing similarity measures to improve neighborhood-based systems Our main contribution in 978-1-6654-1001-4/21/$31.00 ©2021 IEEE this paper is experimenting on a variety of methods to combine the similarity matrices calculated by the rating-based and the content-based information, which create more robust hybrid similarity matrices compared to each of the stand-alone matrix The remaining of the paper is organized as follows Section II presents the formalized problem and discusses related works on the rating prediction problem Our proposed methods are presented in Section III Section IV describes our experimental results and in-depth analysis Finally, we conclude with a summary of this work in Section V II P RELIMINARIES In this paper, u, v denote users and i, j denote items rui denotes the rating by user u for item i, and all the (u, i) pairs are stored in the set K = {(u, i)|rui is known} Meanwhile, R(u) denotes the set of all items rated by user u In rating prediction task, the objective is to predict unknown rating rûi where user u has not rated item i yet The state-of-the-art CF techniques for the rating prediction task and existing hybrid variants are briefly reviewed as follows A Collaborative filtering models The neighborhood-based approach in CF gives recommendations based only on the similarity between users (useroriented) which predicts a user’s preference based on similar users, or between items (item-oriented) which finds similar items to the item a user liked and recommends these items to her Of the two methods, the latter introduced in [19] is more successful due to its superior accuracy and its capability of providing a rational explanation for recommendations [1] In this work, the item-item approach is adopted in our implementations not only because of the discussed reasons but also to utilize the more abundant items’ content-based data Figure illustrates a simplified flow-graph of a neighborhood-based model, where the fundamental is similarity measure By computing the similarity degree sij between all pairs of items i and j using popular similarity measures such as Cosine similarity (Cos) or Pearson Correlation Coefficients (PCC), we can identify the set of k neighbors S k (i, u) which consists of k most similar items to 475 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) i rated by user u Then, rûi can be predicted as a weighted average of the ratings of similar items [19]: X sij ruj rûi = j∈S k (i,u) (1) X sij j∈S k (i,u) accuracy by taking implicit feedbacks into account for an additional indication of user preferences B Integrating content-based information into CF models Generally, hybrid methods integrate user-item ratings and auxiliary information to generate unified systems In [5], movie genres, user profiles, and their past interactions are integrated into a generalized linear framework One restriction of this work is the data privacy concern, which limits the shared user profiles In another example, the authors established an integrated model between neighborhood-based and SVD++ to gain a better result by generating a new representation for a user from the items rated by that user instead of using an explicit parameterization [23] A model named Factorization Machines combining MF and Support Vector Machine [6] also uses both ratings and auxiliary information for predictions Fig 1: The flow-graph of a neighborhood-based CF system Even though Equation (1) can capture the user-item interactions, much of the observed ratings are due to the bias effects associated with either users or items, independently of their interactions In detail, some items usually receive higher ratings than others, and some users tend to give higher ratings than others kNNBaseline model proposed by [20] adjusts the above formula through a baseline estimate which accounts for the user and item effects as follows X sij (ruj − buj ) kN N Baseline rûi = bui + j∈S k (i;u) X (2) sij j∈S k (i,u) where bui = µ + bu + bi denotes the baseline estimate, µ denotes the mean of overall ratings, bu and bi correspond to the bias of user u and item i, respectively, which can be trained using popular optimization algorithms such as Stochastic Gradient Descent (SGD) or Alternating Least Squares (ALS) Besides Cos and PCC, advanced similarity measures, such as PCCBaseline [20] or cubedPCC [21], can effectively improve the performance of neighborhood-based models Matrix factorization (MF) is another typical model-based CF technique that has proven its superior accuracy and flexible scalability in the Netflix Prize [22] By using Singular Value Decomposition (SVD), both users and items are mapped into a latent space of dimension k to uncover latent features that explain the observed ratings [22] An extended version of SVD, named SVD++ [23], was proposed to improve the Fig 2: The flow-graph of a kNNContent model In [7], the similarity scores between users are computed based on users’ side information obtained from item profiles The problems of calculating similarity scores using rating information are discussed in more detail in [8], [15], which showed that the sparsity of the rating matrix could yield an inaccurate similarity score between two items that share only a few common users Furthermore, filtering common users who rated both items to calculate the similarity score is a time-consuming task due to a large number of users To address these problems, novel similarity measures were proposed using content-based information (see Figure 2) Let qi = {qi1 , qi2 , , qif } ∈ Rf denotes the feature vector of item i where f is the number of features The similarity score sij between two items i and j is calculated as follows Pf qik qjk Coscontent sij = qP k=1 qP (3) f f 2 q q k=1 ik k=1 jk or Pf content ¯i )(qjk − q¯j ) k=1 (qik − q q qP sPCC = (4) ij Pf f 2 ¯ ¯ (q − q ) (q − q ) i j k=1 ik k=1 jk 476 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) where q¯i and q¯j are the mean of feature vectors qi and qj , respectively Hereafter, kNNBaseline model using one of these similarity measures is referred to as kNNContent III PROPOSED SYSTEM As described in Section II, even though the two distinct neighborhood-based CF techniques, kNNBaseline and kNNContent, share the same prediction algorithm, each of them has a different way of calculating the similarity matrix between items In kNNBaseline model, the similarity between item i and item j are measured by applying either Cos or PCC measure to the available rating information As for the kNNContent model, the similarity values between items depend entirely on the content-based vector of each item to produce more accurate similarity scores and lower the running time of this stage Although these techniques have their own benefits, several noticeable problems remain On one hand, the similarity scores between items in kNNBaseline model are only calculated by available ratings without any other data, which is the reason why the model requires lots of ratings to produce precise scores However, in practice, customers tend to not give feedbacks about used items, which causes the utility matrix to become highly sparse (for example, 99.47% of the ratings in the MovieLens 20M dataset are missing [24]) On the other hand, although the kNNContent model only focuses on content-based information, a manually managed database for storing items’ genres can be expensive to maintain and scale over time To automate this process, several studies [25]–[27] proposed machine learning systems that can predict items’ properties based on user-provided information Specifically, comments and reviews about movies given by viewers were used to build Tag Genome data structure in the MovieLens 20M dataset [25], where each movie is represented by an 1128-element vector These representations include many movie genres but some of them are identical, which can give inaccurate information [15] Moreover, since the content-based information does not exhibit items quality, a rational recommendation system must not give a high similarity score in the situation where two items are similar in terms of properties but one is much worse than the other (which heavily affects user ratings) In consequence, even though these models are simple and practical, they haven’t got to the expected potential of the neighborhood-based CF approach An ensemble between kNNBaseline and kNNContent can be an interesting approach for this problem [28], however, the core problem of identifying the similar items accurately is not solved since the predicting stage of each model takes place independently To fully utilize the advantages of both kNNBaseline and kNNContent models, this paper analyzes several techniques that combine two similarity matrices, one calculated by rating information (which is denoted by Sr ) and one calculated using content-based information (which is denoted by Sc ), to gather as much information as possible More specifically, as illustrated in Figure 3, aggregation functions are applied to group different similarity matrices to form a unified matrix Fig 3: The flow-graph of the proposed system The purpose of this step is to ensure that the system can identify the correct similar items with more accurate similarity scores In our experiments, to aggregate the two matrices Sr and Sc into an aggregated similarity matrix, we first experiment with element-wise maximum in Equation (5) and minimum in Equation (6) to answer the question of whether only the largest or the smallest score of Sij is needed to form an appropriate similarity matrix S = max(Sr , Sc ) (5) S max = min(Sr , Sc ) (6) Then, to investigate whether a simple element-wise additon or multiplication of Sr and Sc can outperforms each individual matrix in a neighborhood-based system, S add and S mul matrices are calculated as follows S add = Sr + Sc (7) S mul = Sr ⊙ Sc (8) where ⊙ denotes the element-wise operation between Sr and Sc Note that from eqs (5) to (8), a system with multiple types of content-based data, hence multiple similarity matrices, can be scaled up without any difficulty IV E XPERIMENTAL RESULTS A MovieLens Dataset and Evaluation Criteria In order to evaluate the performance of our proposed models, the MovieLens 20M dataset is used as a benchmark It contains 20,000,263 ratings and 465,564 tag applications across 27,278 movies given by 138,493 users The ratings are float values ranging from 0.5 to 5.0 with step of 0.5 [24] 477 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) This dataset also includes Tag Genome data, which encodes how strongly movies exhibit particular properties represented by tags in the range of to [24] TABLE I: Summary of the original MovieLens 20M and the preprocessed dataset Dataset Original Preprocessed # Ratings 20,000,263 19,793,342 # Users 138,493 138,185 # Movies 27,278 10,239 Sparsity 99.47% 98.97% It is necessary to first preprocess the original dataset Any movie which does not have Tag Genome will be discarded from the dataset Only movies and users containing at least 20 ratings are kept Table I summarizes the results Eventually, the preprocessed dataset consists of 19,793,342 ratings (approximately 98.97% sparsity compared to 99.47% sparsity of the original dataset) given by 138,185 users for 10,239 movies For performance evaluation, we split the whole dataset into two distinct parts: 80% of the data is used as the training set and 20% remaining data as the testing set To compare the performance between models, three indicators are used: RMSE (Root Mean Squared Error), MAE (Mean Absolute Error) for accuracy evaluation where smaller values indicate better performance, and Time [s] for timing evaluation RMSE and MAE are computed as s X (ˆ rui − rui ) /|Test Set| (9) RMSE = SVD and SVD++ models, the number hidden factors is chosen from {20, 30, 40, 50, 60, 80, 100} I-AutoRec model is constructed using a 3-layer AE is trained using 600 hidden neurons, and the combination of activation functions is (Identity, Sigmoid) All experiments are carried out using Google Colaboratory with 25GB RAM and no GPU Experimental results with different similarity measures are displayed in Figure 4, where PCC has shown its superiority compared to Cos In particular, using the element-wise minimum or the maximum methods give poor and unstable performance compared to kNNContent and other aggregation methods This could be explained that for each similarity score of two items i and j, these techniques keep only one minimum or maximum value from either Sr or Sc , and thus, the similarity between each pair of items only includes the information from one method, resulting in unreliable predictions u,i∈Test Set MAE = X |ˆ rui − rui | /|Test Set| (10) u,i∈Test Set where |Test Set| is the size of the testing set, rûi denotes the predicted rating of user u to item i estimated by the model, and the corresponding observed rating in the testing set is denoted by rui The total duration of the model’s learning process on the training set and predicting all samples in the testing set is measured as Time [s] B Performance Evaluation The performance of our proposed models is compared with the following common recommendation systems: • Neighborhood-based models including kNNBaseline, kNNContent, and an ensemble of the two models using average predictions • MF models including SVD [22] and SVD++ [23] • A deep learning-based system named item-based AutoRec (I-AutoRec) [12] which adopts an autoencoder network to estimate missing ratings In particular, we measure the RMSE and MAE of each model and the corresponding time to make rating predictions on the MovieLens 20M dataset Here, each neighborhoodbased model is calculated with the number of neighbors is chosen from k ∈ {10, 15, 20, 25, 30, 35, 40} and the similarity measures implemented are both Cos and PCC where Sr is calculated using the rating information from the training set and Sc is calculated from the Tag Genome data For Fig 4: Error rates of kNNBaseline models when incorporating the hybrid similarity matrices with respect to different the neighborhood sizes In contrast, by combining the two similarity matrices using element-wise addition or multiplication, the unified matrix could collect more information, which results in more accurate similarity scores and gives better results In addition, compared to the ensemble model, our proposed hybrid similarity matrices are much more flexible in terms of aggregation methods without requiring several independent predicting processes, whilst providing competitive performance A comprehensive comparison of our proposed models is provided in Table II, where the number of neighbors k for each neighborhood-based model is chosen for best performance Specifically, the most effective model, kNNBaseline 478 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) using element-wise multiplication between Sr and Sc both measured by PCC, gains: • 1.88% lower RMSE and 2.48% MAE than SVD • 1.53% lower RMSE and 1.67% MAE than SVD++ • 0.45% lower RMSE and 0.66% MAE than I-AutoRec • 4.13% lower RMSE and 4.46% MAE than kNNBaseline • 1.42% lower RMSE and 1.60% MAE than kNNContent • 0.77% lower RMSE and 0.88% MAE than the ensemble model of kNNBaseline and kNNContent TABLE II: Performance of the kNNBaseline model using hybrid similarity matrices compared to other baseline models Model SVD (40 factors) SVD++ (40 factors) I-AutoRec kNNBaseline (k = 40) kNNContent (k = 20) Ensemble model S add (k = 20) Hybrid Sr , Sc S mul (k = 20) RMSE 0.7922 0.7894 0.7808 0.8108 0.7885 0.7833 0.7834 0.7773 MAE 0.6042 0.5992 0.5931 0.6167 0.5988 0.5944 0.5943 0.5892 Time [s] 292 27,387 69,860 565 293 827 571 575 In addition to the significant improvements compared to other traditional neighborhood-based and MF models, our proposed methods have proven that by properly corporating different types of information, a simple neighborhood-based system can achieve and even surpass the performance of a deep-learning model whilst maintaining a reasonable prediction time These improvements by directly aggregating both content- and rating-based similarity matrices demonstrate the potential of other methods combining different sources of information for similarity measure V CONCLUSIONS A neighborhood-based RS relies on similarity scores between items to give accurate recommendations to users While the traditional similarity measure using transaction history suffer from data sparsity problem, the item features are biased and unable to represent the true quality of items This paper proposes applying both the content- and the rating-based information into measuring the similarity scores between items Through intensive experiments, the proposed method has outperformed other baseline models, resulting in at least 0.77% improvement in terms of predicting accuracy compared to related hybrid neighborhood-based methods in equivalent running time R EFERENCES [1] F Ricci, L Rokach, and B Shapira, “Recommender systems: introduction and challenges,” in Recommender systems handbook Springer, 2015, pp 1–34 [2] G Adomavicius and A Tuzhilin, “Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions,” IEEE Transactions on Knowledge & Data Engineering, vol 17, no 6, pp 734–749, 2005 [3] P Lops, M De Gemmis, and G Semeraro, “Content-based recommender systems: State of the art and trends,” in Recommender systems handbook Springer, 2011, pp 73–105 [4] X Su and T M Khoshgoftaar, “A survey of collaborative filtering techniques,” Advances in artificial intelligence, vol 2009, 2009 [5] D Agarwal and B.-C Chen, “Regression-based latent factor models,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining ACM, 2009, pp 19–28 [6] S Rendle, “Factorization machines,” in 2010 IEEE International Conference on Data Mining IEEE, 2010, pp 995–1000 [7] J Niu, L Wang, X Liu, and S Yu, “Fuir: Fusing user and item information to deal with data sparsity by using side information in recommendation systems,” Journal of Network and Computer Applications, vol 70, pp 41–50, 2016 [8] T N Duong, V D Than, T A Vuong, T H Tran, Q H Dang, D M Nguyen, and H M Pham, “A novel hybrid recommendation system integrating content-based and rating information,” in International Conference on Network-Based Information Systems Springer, 2019, pp 325–337 [9] X Wang, F Luo, Y Qian, and G Ranzi, “A personalized electronic movie recommendation system based on support vector machine and improved particle swarm optimization,” PloS one, vol 11, no 11, p e0165868, 2016 [10] D M Jiménez-Bravo, J Pérez-Marcos, D H De la Iglesia, G Villarrubia González, and J F De Paz, “Multi-agent recommendation system for electrical energy optimization and cost saving in smart homes,” Energies, vol 12, no 7, p 1317, 2019 [11] H T Nguyen, H Q Ngo, N X Tran, E Bjornson et al., “Pilot assignment for joint uplink-downlink spectral efficiency enhancement in massive mimo systems with spatial correlation,” IEEE Transactions on Vehicular Technology, 2021 [12] S Sedhain, A K Menon, S Sanner, and L Xie, “Autorec: Autoencoders meet collaborative filtering,” in Proceedings of the 24th International Conference on World Wide Web ACM, 2015, pp 111–112 [13] X He, L Liao, H Zhang, L Nie, X Hu, and T.-S Chua, “Neural collaborative filtering,” in Proceedings of the 26th international conference on world wide web International World Wide Web Conferences Steering Committee, 2017, pp 173–182 [14] G B Martins, J P Papa, and H Adeli, “Deep learning techniques for recommender systems based on collaborative filtering,” Expert Systems, vol 37, no 6, p e12647, 2020 [15] N Duong Tan, T A Vuong, D M Nguyen, and Q H Dang, “Utilizing an autoencoder-generated item representation in hybrid recommendation system,” IEEE Access, vol PP, pp 1–1, 04 2020 [16] A Le Ha, T Van Chien, T H Nguyen, W Choi et al., “Deep learningaided 5g channel estimation,” in 2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM) IEEE, 2021, pp 1–7 [17] P N Huu and H N T Thu, “Proposal gesture recognition algorithm combining cnn for health monitoring,” in 2019 6th NAFOSTED Conference on Information and Computer Science (NICS) IEEE, 2019, pp 209–213 [18] P N Huu and H L The, “Proposing recognition algorithms for hand gestures based on machine learning model,” in 2019 19th International Symposium on Communications and Information Technologies (ISCIT) IEEE, 2019, pp 496–501 [19] B M Sarwar, G Karypis, J A Konstan, J Riedl et al., “Item-based collaborative filtering recommendation algorithms.” Www, vol 1, pp 285–295, 2001 [20] Y Koren, “Factor in the neighbors: Scalable and accurate collaborative filtering,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol 4, no 1, p 1, 2010 [21] T N Duong, V D Than, T H Tran, Q H Dang, D M Nguyen, and H M Pham, “An effective similarity measure for neighborhoodbased collaborative filtering,” in 2018 5th NAFOSTED Conference on Information and Computer Science (NICS) IEEE, 2018, pp 250–254 [22] S Funk, “Netflix update: Try this at home,” 2006 [23] Y Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining ACM, 2008, pp 426–434 [24] F M Harper and J A Konstan, “The movielens datasets: History and context,” Acm transactions on interactive intelligent systems (tiis), vol 5, no 4, p 19, 2016 [25] J Vig, S Sen, and J Riedl, “The tag genome: Encoding community knowledge to support novel interaction,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol 2, no 3, pp 1–44, 2012 479 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) [26] Q Hoang, “Predicting movie genres based on plot summaries,” arXiv preprint arXiv:1801.04813, 2018 [27] G Barney and K Kaya, “Predicting genre from movie posters,” Stanford CS 229: Machine Learning, 2019 [28] D Opitz and R Maclin, “Popular ensemble methods: An empirical study,” Journal of artificial intelligence research, vol 11, pp 169–198, 1999 480 ... and H M Pham, “A novel hybrid recommendation system integrating content -based and rating information,” in International Conference on Network -Based Information Systems Springer, 2019, pp 325–337... taking implicit feedbacks into account for an additional indication of user preferences B Integrating content -based information into CF models Generally, hybrid methods integrate user-item ratings... calculated using the rating information from the training set and Sc is calculated from the Tag Genome data For Fig 4: Error rates of kNNBaseline models when incorporating the hybrid similarity