Ecommerce graph based recommendation system

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY VO THI KIM NGUYET ECOMMERCE GRAPH-BASED RECOMMENDATION SYSTEM Major: COMPUTER SCIENCE Major code: 8480101 MASTER’S THESIS HO CHI MINH CITY, July 2023 THIS THESIS IS COMPLETED AT HO CHI MINH UNIVERSITY OF TECHNOLOGY – VNU-HCM Supervisor: Le Thanh Van, Ph.D Examiner 1: Assoc Prof Dr Huynh Tuong Nguyen, Ph.D Examiner 2: Ton Long Phuoc, Ph.D This master’s thesis is defended at Ho Chi Minh City University of Technology (HCMUT) – VNU-HCM on 11th July 2023 Master’s Thesis Committee: Assoc Prof Dr Tran Ngoc Thinh, Ph.D Chairman Assoc Prof Dr Huynh Tuong Nguyen, Ph.D Examiner Ton Long Phuoc, Ph.D Examiner Le Thanh Van, Ph.D Commissioner Nguyen Tien Thinh, Ph.D Secretary Approval of the Chairman of the Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any) CHAIRMAN OF THESIS COMMITTEE DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING i VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness THE TASK SHEET OF MASTER’S THESIS Full name: Vo Thi Kim Nguyet Student code: 2270346 Date of birth: Oct 10th 1995 Place of birth: Ho Chi Minh City Major: Computer Science Major code: 8480101 I THESIS TITLE: E-commerce graph-based recommendation system (Hệ thống gợi ý dựa phương pháp đồ thị thương mại điện tử) II TASKS AND CONTENTS: Introduction: • Introduce the research topic and its significance • Provide an overview of the structure of the thesis Literature Review: • Conduct a comprehensive review of existing product recommendation techniques • Analyze strengths and weaknesses of different approaches • Identify gaps in the literature that the research aims to address Problem Statement: • Clearly state the problem being addressed in the research • Highlight the need for improved recommendation approaches in the context of e-commerce Methodology: • Design the research approach for developing and evaluating the recommendation system • Define the criteria and metrics for evaluating the effectiveness of the system Graph-Based Recommendation System Implementation: • Develop the recommendation system using graph embedding techniques • Implement graph construction methods based on user behavior data • Incorporate Node2Vec and FAISS for graph embedding and indexing Experimental Evaluation: • Conduct experiments to evaluate the performance of the developed system • Compare the results with other existing recommendation models • Collect and analyze data on key evaluation metrics Discussion of Findings: • Analyze and interpret the results of the experimental evaluation • Discuss the implications of the findings in relation to the research objectives ii Conclusion: • Summarize the key findings and contributions of the research • Discuss the practical implications of the research outcomes Future Research Directions: • Suggest avenues for further research and improvements in the recommendation system • Highlight areas where the proposed approach could be extended or refined 10 References: • List all the sources and references cited throughout the thesis 11 Appendices: • Include any supplementary material, code snippets, graphs, or diagrams that enhance understanding III THESIS START DAY: Feb-06-2023 IV THESIS COMPLETION DAY: Jun-09-2023 V SUPERVISOR: Le Thanh Van, Ph.D Ho Chi Minh City, Jun-09-2023 SUPERVISOR (Full name and signature) CHAIR OF PROGRAM COMMITTEE (Full name and signature) DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING (Full name and signature) Note: Student must pin this task sheet as the first page of the Master’s Thesis booklet iii ACKNOWLEDGEMENT This thesis marks the culmination of my research journey into graph-based modeling and its applications in data analysis and machine learning Graphs offer a unique perspective to understand complex relationships within vast datasets Throughout this work, I explore fundamental concepts of graph-based modeling, delve into graph embedding techniques, and evaluate their efficacy in solving realworld problems I extend my gratitude to my advisors, mentors, colleagues, and family for their unwavering support and encouragement My hope is that this thesis inspires further research and innovative applications of graph-based models in various domains Thank you for joining me on this journey Sincerely, Nguyet Vo Ho Chi Minh City, June 2023 iv ABSTRACT This thesis presents a graph-based recommendation system tailored for personalized content suggestions in ecommerce Utilizing graph embedding methods such as DeepWalk and Node2Vec as part of Random Walks technique, the system captures users’ behavioural sequences and generates embeddings for items These embeddings facilitate pairwise similarity calculations among items, forming the basis for content recommendations rooted in similarity metrics To tackle challenges like sparsity and cold start, additional information is seamlessly integrated into the graph embedding framework Empirical evaluation using clickstream data demonstrates the superiority of the proposed approach over traditional collaborative filtering techniques in terms of both accuracy and efficiency The study contributes a novel graph-based recommendation system addressing scalability, sparsity, and cold start issues, further enriched by the incorporation of supplementary data to enhance recommendation system efficacy The results suggest that graph-based techniques hold potential for enhancing personalized recommendation systems across diverse domains, including ecommerce v TÓM TẮT LUẬN VĂN THẠC SĨ Luận văn đề xuất hệ thống gợi ý dựa đồ thị cho việc cá nhân hóa gợi ý nội dung thương mại điện tử Dự án sử dụng kỹ thuật đồ thị DeepWalk Node2Vec kỹ thuật Random Walks để nắm bắt chuỗi hành vi người dùng đề xuất danh sách sản phẩm phù hợp Những kỹ thuật giúp tính tốn độ xác suất cặp/ danh sách sản phẩm, từ tạo tảng cho gợi ý cho người dùng dựa hành vi họ Để giải thách thức người dùng/ sản phẩm khả mở rộng thưa thớt, thuật toán bổ sung tích hợp hệ thống nhằm đưa gợi ý thông minh, phù hợp với sở thích khách hàng Đánh giá thực nghiệm liệu lớn hành vi khách hàng cho thấy phương pháp đề xuất vượt trội so với phương pháp lọc cộng tác truyền thống độ xác hiệu suất Nghiên cứu đóng góp hệ thống gợi ý dựa đồ thị nhằm giúp khách hàng nhanh chóng định vị sản phẩm họ quan tâm để từ đưa định đắn mua sắm online khả cải thiện hiệu suất hệ thống gợi ý cá nhân hoá vi DECLARATION OF AUTHORSHIP I hereby declare that this thesis was carried out by myself under the guidance and supervision of Le Thanh Van, Ph.D; and that the work contained and the results in it are true by author and have not violated research ethics The data and figures presented in this thesis are for analysis, comments, and evaluations from various resources by my own work and have been duly acknowledged in the reference part In addition, other comments, reviews and data used by other authors, and organizations have been acknowledged, and explicitly cited I will take full responsibility for any fraud detected in my thesis Ho Chi Minh City University of Technology (HCMUT) – VNU-HCM is unrelated to any copyright infringement caused on my work (if any) Ho Chi Minh City, June 2023 Author Vo Thi Kim Nguyet vii TABLE OF CONTENTS LIST OF FIGURES ix LIST OF TABLES xi CHAPTER 1: INTRODUCTION 1.1 Background on recommendation systems and the importance of personalization 1.2 Research Questions 1.3 Methodology 1.4 Contributions 1.5 Scope of the research 1.6 Implications 1.7 Novelty of the topic 1.8 Outline CHAPTER 2: OVERVIEW OF RECOMMENDATION SYSTEM 2.1 Recommendation System methods 2.2 Research Problem 14 2.3 Overview of existing literature on graph-based recommendation systems 16 2.4 Graph-based learning Approaches for Recommender System (RS) 17 2.5 Research results in application of Graph-based learning in Recommender System 18 2.6 Comparison of different graph-based algorithms 23 2.7 The advantages of using UMAP and FAISS in combination with Deep Walk and Node2Vec 27 2.8 Session-based Recommendation System 31 2.9 Summary 33 CHAPTER 3: IMPLEMENTATION 34 3.1 Proposed Methodologies 34 3.2 Data collection and its characteristics 36 viii 3.3 Data cleaning and preparation 39 3.4 Explanation of how the data was transformed into a graph-based representation 53 3.5 Random Walks algorithm 60 3.6 Visualization with UMAP 63 3.7 Embedding Vector Search with FAISS 65 3.8 Evaluation of the Recommendation System 67 CHAPTER 4: EXPERIMENTAL RESULTS 73 4.1 All Machine Learning (ML) Models 73 4.2 Association Rules .80 4.3 Traditional Recommendation Techniques 83 4.4 Sequence Models for Session-Level Data 86 CHAPTER 5: DISCUSSION AND CONCLUSION 95 REFERENCES 98 APPENDIX 102 101 Applications [Online] Available: https://engineering.fb.com/2017/03/29/datainfrastructure/faiss-a-library-for-efficient-similarity-search/ [30] S Wannasuphoprasit et al., “Solving Cosine Similarity Underestimation between High Frequency Words by L2 Norm Discounting,” arXiv:2305.10610v1 [cs.CL], 17 May 2023 [Online] Available: https://arxiv.org/pdf/2305.10610.pdf [31] D Valcarce et al., “Assessing ranking metrics in top-N recommendation,” Inf Retrieval J, vol 23, no 5, pp 411-448, 2020 https://doi.org/10.1007/s10791-02009377-x 102 APPENDIX Data Preprocessing: Drop duplicate Check for missing data Cold Star Problem check: Check for outliers 103 Graph construction: The purpose of this code is to create a new table that contains filtered and selected data from the “optimised_raw_data.parquet” file, focusing on sessions with multiple viewed products This table can then be used for further analysis and exploration of user behaviours and product views 104 By performing this code, the code creates the “product_views_graph” table, which represents the product views and the next viewed product within each user session This table can be used for further analysis and modelling, such as building recommendation systems or studying user behaviours patterns 105 a Directed Graph Query: b Directed Weighted Graph Query: 106 c Undirected Weighted Graph Query: 107 Embeddings Vector Search with FAISS: 108 109 Hyperparameters implementation for training model: This approach can potentially result in better embeddings compared to using any of the methods in isolation from enum import Enum class Constants (Enum): GRAPH_FILE_NAME = 'undirected_weighted_product_views_graph.parquet' WEIGHTED = True DIRECTED = False NUM_WALK = 10 WALK_LEN = 50 This code defines an “Enum” class called “Constants” An enumeration is a set of named values that represent a collection of related constants In this case, the “Constants” enum class contains several constant values The meanings of the constant values in this code are as follows: - “GRAPH_FILE_NAME”: This constant represents the name of a graph file, specifically the “undirected_weighted_product_views_graph.parquet” file It likely refers to a file containing the graph representation of product views data - “WEIGHTED”: This constant is a Boolean value (“True”) indicating whether the graph is weighted It suggests that the edges in the graph have associated weights, representing the strength or importance of the connections between products - “DIRECTED”: This constant is a Boolean value (`False`) indicating whether the graph is directed If the value is “True”, it means the graph's edges have a specific direction, indicating a one-way relationship between products - “NUM_WALK”: This constant represents the number of random walks to be performed on the graph It likely indicates the number of times a random walker traverses the graph from a starting node to generate training data - “WALK_LEN”: This constant represents the length of each random walk It indicates the number of steps or nodes visited in each random walk 110 By using an “Enum” class, these constants can be accessed and referenced using dot notation, making the code more readable and organized The reason why chooses 10 and 50 for Num_Walk and Walk_Len: - Trade-off between exploration and computation time: Increasing the number of random walks (“NUM_WALK”) and the length of each walk (“WALK_LEN”) can provide more exploration of the graph, allowing for better coverage of the product relationships However, it also increases the computation time required for generating the embeddings The chosen values strike a balance between exploration and computational efficiency - Graph structure and size: The choice of these values might be influenced by the structure and size of the graph Larger graphs with denser connections might require more random walks and longer walk lengths to capture the complexity and nuances of the relationships between products - Experimentation and empirical results: The values might have been determined through experimentation and empirical analysis to find the optimal balance between the quality of the embeddings and the resources required for computation Different values could have been tested to evaluate their impact on the performance of downstream tasks, such as recommendation accuracy Ultimately, the specific values of “NUM_WALK” and “WALK_LEN” should be chosen based on the characteristics of the dataset, the desired quality of the embeddings, and the available computational resources It may require some experimentation and fine-tuning to find the most suitable values for a given application Then, we will first train the DeepWalk model to generate node embeddings These embeddings capture the structural information of the graph by representing nodes in a low-dimensional space Below is the code for how to train a model with the required parameters when training a graph model # DeepWalk has p=1 and q=1 (BFS and DFS equally important) g = node2vec.SparseOTF(p=1, q=1, workers=-1, verbose=True, extend=False) g.read_edg(edg_graph_path, weighted=WEIGHTED, directed=DIRECTED) 111 walks = g.simulate_walks(num_walks=NUM_WALK, walk_length=WALK_LEN) # Word2Vec model for DeepWalk deepwalk_model = Word2Vec(walks, #previously generated walks hs=1, #tells the model to use hierarchical softmax Hierarchical softmax is an alternative to the negative sampling approach and is suitable for training on large datasets It helps to speed up the training process and improve the quality of the learned embeddings sg=1, #tells the model to use skip-gram The skip-gram model aims to predict the context words given a target word and is known to perform well in capturing semantic relationships between words vector_size=128, #Number of dimensions of the embeddings A higher dimensionality can potentially capture more intricate patterns and relationships in the data, but it also increases the computational complexity window=5, #controls the maximum distance between the current and predicted node within a walk It typically ranges from to 10 min_count=1, #the minimum number of occurrences of each node in the graph required to be included in the embedding It is typically set to a small value (e.g., or 3) in small datasets, but higher values are often used in large datasets workers=-1, #Number of parallel workers seed=42) # Random seed for reproducibility By setting a specific seed, we ensure that the results are consistent across different runs After obtaining the DeepWalk embeddings, we can use them as initial embeddings for training the Node2Vec model Node2Vec is a graph embedding technique that learns embeddings by performing biased random walks on the graph By leveraging the initial embeddings from DeepWalk, Node2Vec can refine and improve the embeddings by capturing both the structural information and the community structure of the graph # Refine the embeddings using Node2Vec: # Node2Vec g = node2vec.SparseOTF(p=1, q=0.5, workers=-1, verbose=True, extend=True) g.read_edg(edg_graph_path, weighted=WEIGHTED, directed=DIRECTED) walks = g.simulate_walks(num_walks=NUM_WALK, walk_length=WALK_LEN) # Word2Vec model for Node2Vec node2vec_model = Word2Vec(walks, hs=1, sg=1, vector_size=128, window=5, min_count=1, workers=4, seed=42) 112 # Fine-tuning the embeddings with Skip-Gram # Combine the walks from DeepWalk and Node2Vec combined_walks = deepwalk_model.wv.key_to_index + node2vec_model.wv.key_to_index combined_model = Word2Vec(combined_walks, hs=1, sg=1, vector_size=128, window=5, min_count=1, workers=-1, seed=42) In the case of DeepWalk, setting workers=-1 allows the model to utilize all available CPU cores, maximizing the training speed Since DeepWalk is the initial step that captures the local structure of the graph, it can benefit from parallel processing to expedite the training process On the other hand, in Node2Vec, we set workers=4 explicitly This means that the Node2Vec model will use four worker threads for training, regardless of the number of available CPU cores By specifying a lower number of workers, we are limiting the computational resources allocated to Node2Vec, possibly due to refining and improving the embeddings generated by DeepWalk Besides, in the context of the Node2Vec algorithm, the parameters “p” and “q” control the trade-off between the exploration and exploitation of the graph structure during the random walks In both DeepWalk and Node2Vec, the random walks are performed to generate node sequences that capture the local neighbourhood information The choice of “p” and “q” determines the tendency of the random walk to explore new nodes or to stay within the local neighbourhood For DeepWalk: - “p=1” indicates that the random walk is equally likely to move to a previous node or to a neighbouring node, resembling a Breadth-First Search (BFS) This promotes the exploration of diverse paths in the graph This balance between exploration and exploitation allows the random walk to explore different paths in the graph, capturing both local and global information 113 - “q=1” indicates that the random walk is equally likely to move to a neighbouring node (BFS) This helps in capturing the local neighbourhood structure For Node2Vec: - “p=1” indicates that the random walk is more likely to move to a previous node in the walk, similar to DeepWalk - “q=0.5” indicates that the random walk is biased towards visiting nodes that are closer to the previous node (Depth-First Search or DFS) This bias allows the random walk to focus more on exploring the local neighbourhood instead of exploring distant nodes The parameter “extend” controls whether to extend the graph's transition probabilities based on the second-order proximity information When “extend=True”, the transition probabilities are adjusted to incorporate more global information, making the walks more explorative This can be beneficial for capturing the community structure and long-range dependencies in the graph Conversely, when “extend=False”, the walks focus more on the local neighbourhood and may not capture the global structure as effectively Model Evaluation: 114 115 VITA Full name: Vo Thi Kim Nguyet DOB: 10/10/1995 Place of birth: Ho Chi Minh City Address: Ho Chi Minh City Email: vothikimnguyet1010@gmail.com EDUCATION Bachelor of Science (BSc.) - Computing FPT Greenwich University Vietnam Sep 2018 - Oct 2020 Advanced Diploma - Mobile/Computer Programming FPT Polytechnic Sep 2016 - Jun 2018 Bachelor of Arts (B.A.) - International Relations Sep 2013 - Jun 2018 University of Foreign Languages and Information Technology (HUFLIT) EXPERIENCE Senior Business Analysis MoMo (M_Service) Present