Luận văn thạc sĩ Khoa học máy tính: Ecommerce graph-based recommendation system

graph-By addressing these questions, this thesis endeavors to contribute to the advancement of personalized recommendation systems in ecommerce platforms, providing valuable insights int

INTRODUCTION

Background on recommendation systems and the importance of

This thesis focuses on recommender systems in ecommerce, where personalized content recommendation is crucial due to the vast number of products available on numerous websites Traditional recommender systems like

Content-Based Filtering and Collaborative Filtering encounter issues with scalability, sparsity, and cold start problems, reducing their effectiveness in handling large-scale and sparse transaction records To address these challenges, the project proposes the use of graph embedding techniques, specifically Random Walks, to capture users’ behavioral sequences and generate item embeddings These embeddings can then be utilized to recommend products to users, group similar products, and classify transactions based on meta-information about clusters, items, and users’ transaction use cases The approach also employs

Facebook AI Similarity Search (FAISS) for generating recommendations through embedding vector search By leveraging graph-based learning, this thesis offers insights into enhancing recommender systems in ecommerce platforms while tackling the limitations faced by traditional methods.

Research Questions

This thesis aims to address the effectiveness of graph-based techniques, namely Random Walks, and FAISS, in capturing users’ behaviour sequences and generating item embeddings for improved product recommendations in ecommerce, compared to traditional methods To investigate these research questions thoroughly, the following methodology will be employed: a Data Collection: To understand users’ behaviour and preferences, data will be collected from an ecommerce platform through user interactions, transaction records, and item metadata b Graph-Based Recommendation System: The proposed graph-based recommendation system, incorporating Random Walk, and FAISS, will be implemented and fine-tuned to ensure its applicability to real-world data c Comparison with traditional models: Traditional models will be developed and configured as a baseline for comparison against the graph-based approach d Data Verification: To validate the results and ensure robustness, the system will be tested on a large dataset with diverse user interactions and product attributes e Performance Metrics: The accuracy and efficiency of the graph-based and collaborative filtering systems will be evaluated using standard recommendation metrics like Mean Average Precision (MAP), Recall, Precision, and Normalized Discounted Cumulative Gain (NDCG) f User Survey: User feedback will be collected through surveys to gauge user satisfaction and preference for recommendations generated by each approach g A/B Testing: A randomized control trial (A/B testing) will be conducted to compare the user engagement and conversion rates between the two recommendation systems

By employing the above methodology, this thesis aims to shed light on the following main questions:

- How to survey? Instead of relying solely on user surveys, we will implement an A/B test and divide customer groups to assess the Click-Through Rate (CTR) for different recommendation strategies This way, we can gauge the effectiveness of the new algorithm in real-world scenarios and gather valuable insights into its performance

- What algorithm and solution and how to ensure implementation on real data? The graph-based recommendation system, incorporating Random

Walks, and FAISS, will be implemented using appropriate libraries and frameworks Rigorous testing on real-world data will ensure its suitability and effectiveness

- How to verify the results? The results will be verified through comprehensive evaluation using standard recommendation metrics, comparing the graph- based approach against the traditional approach Additionally, A/B testing will provide valuable insights into user engagement and conversion rates

By addressing these questions, this thesis endeavors to contribute to the advancement of personalized recommendation systems in ecommerce platforms, providing valuable insights into the potential benefits of graph-based techniques for more accurate and efficient product recommendations.

Methodology

The main objective of this project is to build a graph-based learning recommendation system for the eCommerce platform, targeting end-users The system will achieve two key objectives:

- Recommend similar products to users based on their preferences and browsing behavior

- Provide personalized recommendations by leveraging users’ behaviour history and search behaviour to recommend the best product for each user

Additionally, the project aims to address the challenge of converting clickstream data into various types of graphs over different event types, considering the unique characteristics and requirements of the eCommerce platform Three types of co- occurrence graphs will be constructed:

- Co-occurrence graph over Product Page (PDP) views: This type of graph is constructed by creating a node for each product page that is viewed by a user in a single session An edge is then created between two nodes if the two product pages are viewed together

- Co-occurrence graph over products that are added to cart together: This type of graph is constructed by creating a node for each product that is added to the cart by a user An edge is then created between two nodes if the two products are added to the cart together

- Co-occurrence graph over products that are bought together: This type of graph is constructed by creating a node for each product that is purchased by a user An edge is then created between two nodes if the two products are purchased together

Figure 1.1 Example of PDP views in a session Each type of graph provides a different representation of user interactions with products and serves as the basis for generating relevant recommendations The methodology of this project involves using an undirected graph for a recommendation system based on clickstream data Unlike directed graphs, which capture one-way relationships, undirected graphs represent bidirectional relationships between items and users Recommendations are made based on node similarity, calculated using measures like cosine similarity or Jaccard similarity

The project follows several steps to achieve its objectives:

- Graph Construction and Transformation: Clickstream data is pre-processed and transformed into three types of co-occurrence graphs, representing different product relationships based on user actions

- Graph Embedding Techniques: To handle the large graph size efficiently, the Random Walks algorithm, combining DeepWalk and Node2Vec techniques, is applied DeepWalk captures user behavior sequences and generates embeddings, while Node2Vec captures local and global structural information to enhance the embeddings These techniques convert the graphs into meaningful low-dimensional embeddings, which capture similarities and relationships between items

- UMAP for Analysis and Visualization: UMAP (Uniform Manifold Approximation and Projection) is utilized for cluster visualization, projecting the embeddings into 2D or 3D space This aids in understanding spatial distribution and relationships among data clusters

- FAISS for Efficient Similarity Search: FAISS (Facebook AI Similarity Search) is used for efficient search of nearest neighbors in the item-item recommendation process It indexes and organizes the embeddings, enabling quick and accurate retrieval of similar items

- Performance Evaluation: The recommendation system's performance is evaluated using rank-aware top-N metrics, such as Precision at N, Recall at N, Mean Average Precision (MAP), and Normalized Discounted Cumulative Gain (NDCG) These metrics comprehensively assess the system’s accuracy, relevance, diversity, and ranking position of the recommendations

+ Precision at N: This metric measures the proportion of relevant items among the top-N recommended items It evaluates the accuracy of the recommendations by considering how many of the recommended items are actually relevant to the user

+ Recall at N: This metric calculates the proportion of relevant items that were included in the top-N recommendations It assesses the system’s ability to retrieve all relevant items, regardless of their ranking position

+ Mean Average Precision (MAP): MAP measures the average precision of the recommendations across different users It considers both the relevance of the recommended items and their ranking positions, providing a comprehensive evaluation of the system’s performance

+ Normalized Discounted Cumulative Gain (NDCG): NDCG takes into account the relevance of the recommended items and their ranking positions

It assigns higher weights to relevant items that are ranked higher in the recommendation list, emphasizing the importance of accurate ranking

By utilizing Random Walks for embedding generation, UMAP for analysis, and FAISS for similarity search, the project aims to develop a graph-based recommendation system that provides accurate, interpretable, and efficient recommendations for the eCommerce platform, enhancing the user experience and driving business success

This case study focuses on enhancing personalized recommendation systems in ecommerce, with the primary goals of improving user experience, platform diversity, and long-tail product discovery, utilizing free eCommerce behavior data from REES46 [1] The project aims to develop a graph-based recommendation system to optimize personalized content recommendations in ecommerce platforms.

Contributions

This research makes several contributions to personalized content recommendation in e-commerce platforms The main contribution is the proposal of a graph-based recommendation system that addresses scalability, sparsity, and cold start problems faced by traditional recommender systems By constructing an item graph and incorporating side information into the graph embedding framework, the system generates informative and effective item embeddings Experimental results demonstrate that the proposed system outperforms traditional methods in terms of accuracy and efficiency

Further Enhancements: To enhance the system’s effectiveness, UMAP is used as a dimensionality reduction strategy to visualize clusters in the embedding space, aiding the recommendation process Additionally, FAISS is employed for generating recommendations based on similarity measures, enabling fast and efficient nearest neighbour search on large-scale datasets for real-time recommendation systems.

Scope of the research

The thesis focuses on applying graph embedding techniques (DeepWalk and Node2Vec) for generating embeddings from category attributes to improve recommendation system performance It explores evaluation metrics, implications, advantages, and limitations of the proposed system, along with practical applications and future research possibilities.

Implications

The findings of this research hold scientific and practical significance Scientifically, it contributes to the field of recommendation systems by advancing our understanding of graph-based methods and their application in product recommendation Practically, the developed recommendation system can be deployed in e-commerce platforms to provide personalized and relevant recommendations, enhancing user satisfaction and engagement.

Novelty of the topic

The topic of graph embedding techniques in product recommendation is relatively novel and promising The research stands out by combining DeepWalk and Node2Vec algorithms, incorporating user feedback and contextual information, and providing nearest neighbour analysis through FAISS The comprehensive evaluation framework, qualitative and quantitative measures, and visualization techniques contribute to the uniqueness of the approach.

Outline

Here is a flow to represent the approach for completing thesis outlining the steps can be a helpful visual aid:

Figure 1.2 Project Flow The project involves data collection and preparation, constructing a graph representation, and applying graph-based algorithms like DeepWalk and Node2Vec for embeddings The model utilizes vector embeddings and FAISS indexing for efficient similarity search Evaluation metrics are used to assess the recommendation system’s performance with suitable metrics like Precision@k, Recall@k, MRR@k, and nDCG@k, experiments analyze its impact and scalability, as well as comparing it to existing approaches The final chapter summarizes key findings, discusses strengths and limitations, and suggests future research directions

Then, the rest of this thesis is structured as follows Chapter 2 provides a literature review of recommender systems and graph-based learning techniques Chapter 3 describes the methodology used in this research, including the pre-processing of clickstream data, the construction of the item graph, and the application of the DeepWalk and Node2Vec technique to generate item embeddings Chapter 4 presents the experimental results, including the evaluation of the proposed system using clickstream data and the comparison with traditional methods Chapter 5 discusses the findings and their implications for the field of personalized content recommendation in ecommerce platforms and summarizes the main contributions of this research and provides recommendations for future research.

OVERVIEW OF RECOMMENDATION SYSTEM

Recommendation System methods

The literature review focuses on the significance of recommendation systems in the e-commerce industry, particularly on Alibaba’s platform Taobao, China’s largest online consumer-to-consumer (C2C) platform, stands out with a profitable share of 1/75th of Alibaba’s total e-commerce traffic [2] Researchers have explored machine learning and data mining strategies to address the challenges of implementing effective recommender systems in the vast online economy The review aims to examine previous research on recommendation methods, their successes, and limitations, with the goal of developing an advanced graph-based learning recommendation system for e-commerce This approach falls under the

“Collaborative Filtering” branch, which utilizes user-item interactions to make recommendations In this case, the graph-based technique employs Random Walks algorithm to create a recommendation graph, and FAISS is used for efficient nearest neighbor search within this graph The method leverages user interactions collaboratively to find relevant items based on similarities with other users or items in the graph The ultimate objective is to enhance user satisfaction and business success in the dynamic world of e-commerce

Figure 2.1 Taxonomy of Recommendation System

Based on the taxonomy of recommender systems in above figure [25], we can see a general overview of the classifications of recommender system models and their characteristics: a) Collaborative Filtering:

- Characteristics: Collaborative filtering recommends items to users based on the preferences and behaviors of similar users It does not require item attributes or domain knowledge and can handle large datasets And it is struggle in sparsity and sclability

- Types: o User-Based Collaborative Filtering: Recommends items based on the preferences of users who are similar to the target user o Item-Based Collaborative Filtering: Recommends items based on the preferences of users who have shown interest in similar items b) Content-Based Filtering:

- Characteristics: Content-based filtering recommends items to users based on the attributes and features of the items and the user’s past preferences It requires item attributes and domain knowledge for effective recommendations And it is usually struggle in cold start problem

- Types: o Profile-Based: Creates user profiles based on their historical preferences and recommends items with similar attributes to those in the user profile o Item-Based: Recommends items similar to the ones the user has shown interest in, based on shared attributes c) Hybrid Recommender Systems:

- Characteristics: Hybrid systems combine multiple recommendation approaches to improve recommendation accuracy and overcome limitations of individual methods

- Types: o Weighted Hybrid: Assigns different weights to individual recommendation techniques and combines their results o Switching Hybrid: Switches between different recommendation techniques based on user preferences or the availability of data o Feature Combination: Combines features from different recommendation methods to create a unified model d) Knowledge-Based Recommender Systems:

- Characteristics: Knowledge-based systems use domain-specific knowledge and user preferences to generate recommendations They are effective for scenarios with limited user data

- Types: o Rule-Based Systems: Use predefined rules and user preferences to make recommendations o Case-Based Reasoning: Recommends items based on similarities to past cases where users expressed preferences e) Matrix Factorization:

- Characteristics: Matrix factorization methods aim to factorize the user-item interaction matrix into latent feature matrices to capture underlying patterns and make predictions

- Types: o Singular Value Decomposition (SVD): Traditional matrix factorization technique that reduces the dimensionality of the user-item matrix o Alternating Least Squares (ALS): An optimization-based matrix factorization method commonly used in collaborative filtering

Here are some research papers related to recommender systems based on the information provided in the previous sections:

“Cold Start Items paper with an approach A Hybrid Recommendation Model” by Wei et al [10] addressed the Complete Cold Start (CCS) and Incomplete Cold Start (ICS) problems by combining Collaborative Filtering (CF) and Deep Learning

Neural Network Their model, SADE, incorporated content features of items to predict ratings for cold start items The study demonstrated the effectiveness of their approach using Netflix dataset

Kupisz and Unold [11] developed a Collaborative Filtering (CF) recommendation system using Hadoop and Apache Spark They utilized the

MapReduce paradigm to handle large datasets but acknowledged limitations in access time and memory

Chen et al [12] proposed a user behavior analysis and commodity recommendation approach using Co-Clustering with Augmented Matrices (CCAM) They combined heuristic evaluation, standard classification, and machine learning to construct a recommendation system, improving its efficiency

Zhou et al [13] addressed the Netflix Prize problem by developing a collaborative filtering-based Alternating Least Square (ALS) algorithm Their study focused on resolving scalability issues with ALS and predicting user ratings for a movie recommendation system

Dianping et al [14] utilized Item-Based Collaborative Filtering methods to build a movie recommender system They examined the associations between different items based on user ratings, allowing for personalized recommendations

Dev et al [15] proposed a recommendation system based on set similarity of user preferences They employed the MapReduce framework to handle big data applications and utilized Extended Prefix Filtering (REF) to reduce computation overhead and provide customized item recommendations

Zeng et al [16] presented a Parallel Latent Group Model (PLGM) for group recommendation, using matrix factorization algorithms ALS and Stochastic Gradient

Descent (SGD) They compared the efficiency and precision of each model and explored various approaches to profile aggregation

Jooa et al [17] implemented a recommendation system using Collaborative Filtering and Association Rules They utilized GPS data and Near Field Communication (NFC) to suggest items based on user preferences and address data overload

Panigrahi et al [18] proposed a Hybrid Distributed Collaborative Filtering Recommender Engine using Apache Spark Their approach aimed to reduce the Cold Start problem by correlating buyers to items based on features The study highlighted the limitations of Scala programming language for prediction accuracy

Research Problem

Based on the literature review, there are numerous algorithms for Recommendation Systems, and the importance of addressing the challenges they face cannot be understated The major technical challenges are particularly urgent and important in Recommender Systems due to the following reasons:

• Scalability: Many existing recommendation approaches perform well on smaller-scale datasets with a limited number of users and items However, these approaches often struggle when applied to larger-scale datasets with millions of users and items Developing scalable recommendation models that can handle the increasing volume of data is crucial for the effectiveness and efficiency of the system

• Sparsity: In recommendation systems, it is common for users to interact with only a small subset of items from a vast catalogue This sparsity of user-item interactions poses a challenge in training accurate recommendation models Addressing this challenge requires advanced techniques to effectively leverage the limited available data and make meaningful recommendations

• Cold Start: New items continuously enter the system without any prior user interactions or historical data This cold start problem makes it difficult to understand the preferences of users for these new items and provide personalized recommendations Overcoming the cold start challenge is crucial to ensure that users receive relevant and engaging recommendations from the start

In addition to these challenges, there are other critical issues that need to be addressed in Recommendation Systems:

• Diversity: Ensuring diversity in recommendations is important to expose users to a wider range of options and avoid over-reliance on popular items Developing algorithms that can provide diverse and novel recommendations is a challenge that requires innovative approaches

• Real-time Recommendations: Meeting user expectations for real-time recommendations that adapt to their dynamic preferences and changing context poses a significant technical challenge Designing efficient algorithms that can process and deliver timely recommendations is crucial for user satisfaction

• Privacy and Trust: Respecting user privacy while delivering personalized recommendations is a complex challenge Building trust with users and establishing transparent mechanisms for data handling and recommendation generation are essential considerations

• Context-Aware Recommendations: Incorporating contextual information, such as time, location, and user behaviour, into the recommendation process can enhance the relevance and quality of suggestions However, effectively capturing and utilizing context in real-world scenarios is a challenge that requires sophisticated algorithms and data processing techniques

• Evaluation Metrics: Evaluating the performance and effectiveness of recommendation systems is a challenge due to the multidimensional nature of user satisfaction Traditional evaluation metrics may not fully capture aspects like user engagement, serendipity, or long-term user satisfaction Developing comprehensive evaluation metrics is crucial for accurately assessing recommendation system performance

Addressing these challenges requires ongoing research and innovation in the field of recommendation systems Researchers and practitioners need to explore new algorithms, techniques, and evaluation methods to improve scalability, handle sparsity and cold start issues, ensure diversity, provide real-time recommendations, maintain user privacy, incorporate context, and develop robust evaluation frameworks By tackling these challenges, the field of recommendation systems can advance and deliver more accurate, personalized, and satisfactory recommendations to users.

Overview of existing literature on graph-based recommendation systems 16 2.4 Graph-based learning Approaches for Recommender System (RS)

Existing literature on graph-based recommendation systems has witnessed significant growth, particularly in the development of deep learning-based methods and graph embedding techniques Graph convolutional networks (GCNs) have emerged as a popular approach, leveraging the structural and semantic information of nodes in a graph to make recommendations Other common approaches include matrix factorization, random walk, and deep learning, each with their own strengths and limitations

Graph-based recommendation systems excel in handling cold-start problems by utilizing side information and capturing complex item relationships However, scalability remains a challenge due to the computational demands of large graphs Future research could focus on developing more scalable systems through distributed computing or sampling methods

Integrating complex side information, such as user reviews or social media data, into the graph embedding process holds promise for improving recommendation accuracy Exploring multimodal embeddings and attention mechanisms could enhance model robustness

Key areas for future research in graph-based recommendation systems include addressing the cold start problem by leveraging external data sources, improving scalability through advanced computing techniques, and integrating complex side information into the graph embedding process to enhance accuracy and robustness

2.4 Graph-based learning Approaches for Recommender System (RS)

Graph-based learning approaches provide powerful solutions for building Recommender Systems that can handle scalability, sparsity, and cold start problems There are three main types of graph-based learning approaches: Random Walk-based, Graph Embedding-based, and Graph Neural Network-based approaches

Random Walk-based approaches model the implicit preference or interaction propagation among users and/or items by allowing a random walker to traverse a graph with predefined transition probabilities They are effective in capturing complex, higher-order, and indirect relations among different types of nodes on the graph Examples include DeepWalk, Node2Vec, Large-scale Information Network Embedding (LINE), Deep Graph Infomax, and Metapath2Vec

Graph Embedding-based approaches leverage graph embedding techniques to encode the complex relations between nodes into low-dimensional vectors They include Graph Factorization-based RS (GFRS), Graph Distributed Representation- based RS (GDRRS), and Graph Neural Embedding-based RS (GNERS)

Graph Neural Network-based approaches apply neural network techniques on graph data They include Graph Attention Network-based RS (GATRS), Gated Graph

Neural Network-based RS (GGNNRS), and Graph Convolutional Network-based RS (GCNRS), including Graph Sage These approaches utilize attention mechanisms, gated recurrent units, and graph convolutional networks to capture inter-node relations and complex transitions between nodes

In summary, graph-based learning approaches offer effective solutions for building RS by capturing complex relations and addressing scalability, sparsity, and cold start issues The choice of approach depends on the specific characteristics of the data and the recommendation task at hand Consequently, using graph learning to model varied relations in RS may be a natural and compelling selection However, there are two important parts in the stage that we need to focus:

1 Based on research above, an effective heuristic method to construct the item graph from the behavior should be designed

2 Graph embedding methods are capable of learning embeddings for millions of items and users, making them suitable for handling massive datasets such as clickstream transaction data Random walk-based approaches are often used to efficiently explore the graph and find similarities among items This involves starting from a given node and randomly selecting a neighbor to move to, and repeating the process to obtain a sequence of nodes Therefore, the random sequence of nodes selected this way is a random walk on the graph for finding similarities among items in the fastest way On the other hand, graph neural network-based approaches, such as GNNs, are more suitable for social recommendations They can comprehensively capture the complex inter-node relations by iteratively combining feature data from local graph neighborhoods.

Research results in application of Graph-based learning in Recommender

For the first approach, Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, Dik Lun Lee (2018) [2] researched an analysis in application of graph embedding in Taobao During this article, they introduced some real-world cases in Taobao as an example of the effectiveness of the proposed strategies The cases area unit is examined in three aspects: 1) visualization of the embeddings by Enhanced Graph Embedding with Side information (EGES), 2) cold start items, and 3) weights in EGES

• Visualization: During this part, they visualize the embeddings of items learned by EGES and use the visualization tool provided by TensorFlow They see that shoes of various classes are in separate clusters It illustrates the influence of the learned embeddings with incorporation of facet info, e.g., Things with similar facet info ought to be nearer within the embedding area

• Cold start items: During this session, they show the standard of the embeddings of cold start things For a new updated item in Taobao, no embedding will be learned from the item graph, and former CF based mostly strategies additionally fail in handling cold start items Thus, a tendency to represent a chilly begin item with the common embeddings of its facet info Then, a tendency to retrieve the foremost similar things from the present things supported the real of the embeddings of two things

• Weights in EGES: In this part, they visualize the weights in varieties of side information for various items Eight things in several classes area unit elite and therefore the weights of all facet info associated with these things area unit extracted from the learned weight matrix A Many observations area’ unit price noting: 1) The weight distributions of completely different things are different, which aligns with an assumption that completely different side information contributes otherwise to the ultimate illustration 2) Among all the items, the weights of “Item”, representing the embeddings of the item itself, area’ unit systematically larger than those of all the opposite facet info It confirms the intuition that the embedding of an item itself remains to be the first supply of users’ behaviors whereas facet info provides extra hints for inferring users’ behaviors 3) Besides “Item”, the weights of “Shop” area units systematically are larger than those of the opposite facet info It aligns with users’ behaviors in Taobao, that is, users tend to get things within the same buy convenience and cheaper price

After that, they discuss the implementation and use of the proposed graph embedding method in Taobao The platform consists of two sub-systems: online and offline The main components of the online subsystem are Taobao Personality Platform (TPP) and Ranking Service Platform (RSP) A typical workflow is shown below:

• When a user launches his/her Taobao Mobile App, TPP extracts the user’s latest information, retrieves the candidate set of items from an offline subsystem, and sends it to the RSP RSP uses a fine-tuned deep neural network model to sort the candidate set of elements, and returns the sorted results to TPP

• Users’ behaviors during a visit to Taobao are collected and stored as log data in the offline subsystem

The workflow of the offline subsystem in which the graph embedding method is implemented and exposed is described below:

• Logs containing users’ behaviors are captured The item graphs are created based on it In practice, they select logs from the last three months Anti-spam processing is applied to the data before generating session-based user behavior sequences The remaining logs contain approximately 600 billion entries Then an element graph is created

• Two practical solutions are employed to implement the graph embedding method 1) The whole graph is divided into a series of sub-graphs, which can be processed in parallel on Taobao’s Open Data Processing Service (ODPS) distributed platform Each subgraph has approximately 50 million nodes 2) Use an iteration-based distributed graph framework in ODPS to generate random walk sequences in graphs The total number of sequences generated by random walks is about 150 billion

• 100 GPUs are used in the platform to implement the proposed embedding algorithm All modules of the offline subsystem, including log search, anti- spam processing, item graph generation, sequence generation by random walk, embedding, item-to-item similarity calculation, and map generation, are available on a platform that provides 150 billion samples and less than each 6 hours will be executed within So, the recommendation service can react very quickly to the user’s latest behavior

For an evaluation, they use link prediction for evaluation The connectivity prediction task is used in offline experiments as it is a fundamental problem in networks For remote edge networks, the link prediction task is to predict the occurrence of links After a similar experimental setup, one-third of the edges are randomly selected and removed as the ground truth of the test set, and the rest of the graph is used as the training set An equal number of pairs of nodes in the test data with no connecting edges are randomly selected as negative samples in the test set The Area Under Curve (AUC) score is used as a performance metric to evaluate the performance of link prediction

And online experimentation with A/B testing frameworks is of course always done with product tracking and analysis The experimental goal is the click-through rate (CTR) on the homepage of the mobile Taobao app After implementing the graph embedding steps above, for each item will generate a set of similar items as recommendations The final recommendation result on Taobao's homepage is generated by a ranking engine implemented based on a deep neural network model

As mentioned earlier, the quality of similar items has a direct impact on recommendation results Therefore, the recommended power, e.g., CTR indicates the effectiveness of different methods

For the second approach, Shoujin Wang, Liang Hu, Yan Wang, Xiangnan He, Quan Z Sheng, Mehmet A Orgun, Longbing Cao, Francesco Ricci and Philip S Yu (2021) [21] have a conference to introduce an approach on Graph Learning (GL) based Recommender Systems (GLRS) During this conference, they give a motivation that the recommender system is supported by graph learning Such characteristic is even additional obvious in RSs wherever the objects here considered including with users, items, attributes, context, are tightly connected with one another and influence one another via varied relations [Hu et al., 2014], as shown in Figure 2.2 In follow, varied types of graphs arise from the info employed by RSs, and that they will considerably contribute to the standard of the recommendations

Figure 2.2 The demonstration of graph learning based recommender systems Here, they also show some challenges still remain with some open research directions as following:

1 Self-evolving RS with dynamic graph learning In real-world RS, users, items, and interactions among them evolve over time Such dynamics directly impact user and requirement modeling, and can even lead to significant changes in recommended outcomes over time However, this issue is still underestimated in existing GLRS Therefore, designing self-evolving RS on dynamic graphs is a promising future research direction

2 Explainable RS by causal graph learning is an important technique used to discover causal relationships between items or actions and its technique is still far from fully understanding the reasons and intentions behind users' voting behavior

3 Cross-domain RS with multigraph learning In practice, data and interactions for recommendations can come from multiple domains, including different sources, systems, and modalities These are interrelated and should collectively contribute to recommendations As a result, interactions in cross- domain RS can be represented by multiplexed networks in which nodes may or may not be connected to other nodes in other layers As a result, a new generation of cross-domain RSs may work with multiplexed graph learning

Comparison of different graph-based algorithms

Comparing different graph-based recommendation algorithms can be a challenging task, as different algorithms are designed to address different problems and have different strengths and weaknesses However, some common criteria for comparison include accuracy, scalability, robustness to noise and sparsity, and the ability to handle cold start and data heterogeneity

Here is a comparison of different graph-based algorithms as well:

Table 2.1 Comparison of different graph-based algorithms

Algorithm Description Task Advantages Disadvantages

A type of machine learning algorithm that learns latent representations

Can suffer from the cold start problem of users and items

A type of machine learning algorithm that learns latent representations of users and items by simulating random walks on a graph

Can handle sparse and noisy data better than matrix factorization- based methods

May be less scalable and struggle with the cold start problem

A type of machine learning algorithm that learns latent representations of users and items by using deep neural networks

The most powerful and flexible

Can handle complex relationships between users and items

A type of machine learning algorithm that combines two or more of the methods above

Can combine the advantages of different methods

Can be more complex and difficult to implement

A tree traversal algorithm that starts at the root node and explores all of the nodes at the current level before moving on to the next level

Link prediction Simple and efficient

Can be slow for large graphs

A tree traversal algorithm that starts at the root node and explores all of the nodes in one branch before moving on to another branch

Link prediction Can be faster than BFS for large graphs

Can be difficult to implement for graphs with cycles

A shortest path algorithm that finds the shortest path between two nodes in a weighted graph

Link prediction Efficient for finding shortest paths

A shortest path algorithm that finds the shortest paths between all pairs of nodes in a weighted graph

Link prediction Can handle negative weights

A shortest path algorithm that finds the shortest paths between all pairs of nodes in a weighted graph, even if the graph contains negative weights

Link prediction The most efficient shortest path algorithm

PageRank A ranking algorithm that computes the importance of

Ranking Can be used to rank web pages, social media

Can be biased towards nodes with high in- degree a node in a graph users, and other entities

A ranking algorithm that computes the importance of a node in a graph

Ranking Can be used to rank web pages, social media users, and other entities

Can be biased towards nodes with high in- degree and high out-degree

SimRank A similarity measure that computes the similarity between two nodes in a graph

Similarity Can be used to find similar users, products, and other entities

Node2Vec A random walk algorithm that learns latent representations of nodes in a graph

Can be used for a variety of tasks, such as recommendation systems and link prediction

DeepWalk A random walk algorithm that learns latent representations of nodes in a graph

A network embedding algorithm that learns latent representations of nodes in a graph

Can be computationally expensive to train

A deep learning model that learns latent representations

Can be used for a variety of tasks, such as recommendation systems and

Can be computationally expensive to train of nodes in a graph node classification.

The advantages of using UMAP and FAISS in combination with Deep

Walk and Node2Vec a Random Walks on the graph

DeepWalk and Node2Vec are two popular algorithms for learning node embeddings in graph-based recommendation systems Both algorithms are based on random walks, where they generate sequences of nodes to represent random walks on the graph

DeepWalk uses a simple random walk approach, randomly selecting neighbors and moving to them for a fixed number of steps It then utilizes skip-gram models to learn low-dimensional representations of the nodes

Node2Vec, on the other hand, incorporates the graph’s local structure by assigning probabilities to neighbors based on their distance from the previous node in the random walk This allows Node2Vec to explore diverse parts of the graph and capture various structural information

By combining DeepWalk and Node2Vec, the algorithm benefits from capturing both global and local graph structures This combination is expected to result in more informative node embeddings, enhancing the performance of downstream tasks such as product recommendations and clustering b Preliminaries

In this section, we give an overview of graph embedding and one of the most popular methods, DeepWalk [15], based on which we propose our graph embedding methods in the matching stage Given a graph 𝒢 = (𝒱, ℰ), where 𝒱 and ℰ represent the node set and the edge set, respectively Graph embedding is to learn a low- dimensional representation for each node 𝑣 ∈ 𝒱 in the space ℝ 𝑑 , where 𝑑 ≪ |𝒱| In other words, our goal is to learn a mapping function Φ: 𝒱 → ℝ 𝑑 , i.e., representing each node in 𝒱 as a 𝑑-dimensional vector

In [13, 14], Word2Vec was proposed to learn the embedding of each word in a corpus Inspired by Word2Vec, Perozzi et al proposed DeepWalk to learn the embedding of each node in a graph [15] They first generate sequences of nodes by running random walks in the graph, and then apply the Skip-Gram algorithm to learn the representation of each node in the graph To preserve the topological structure of the graph, they need to solve the following optimization problem: minimize Φ ∑

− log⁡ Pr⁡(𝑐 ∣ Φ(𝑣)) where 𝑁(𝑣) is the neighborhood of node 𝑣, which can be defined as nodes within one or two hops from 𝑣 ⋅ Pr⁡(𝑐 ∣ Φ(𝑣)) defines the conditional probability of having a context node 𝑐 given a node 𝑣

Then, we will use the Node2Vec algorithm for graph embedding, enhancing the recommendation system’s capability to capture complex relationships and structural properties as dense vectors in a lower-dimensional space By preserving the initial structure of the graph, Node2Vec maps nodes to embeddings in a way that similar nodes in the graph have similar embeddings These embeddings serve as vectors representing each node in the network, enabling more effective and relevant recommendations for users

Node2Vec employs a combination of breadth-first search (BFS) and depth-first search (DFS) strategies to explore the graph and generate random walks By considering both the breadth and depth of the neighborhood around a node, Node2Vec captures both the homophily (similarity of connected nodes) and structural equivalence (similarity of nodes with similar connections) aspects of the graph

Figure 2.3 BFS and DFS search strategies from node 𝑢(𝑘 = 3)

The concept of a random walk and its application in exploring relationships between steps and distances In a random walk, each step is determined probabilistically, meaning that the direction of movement at each time index is based on a probability The algorithm analyzes the relationship between each step and its distance from the starting point

To calculate the probabilities of moving from one node to another in a random walk, the Node2Vec algorithm introduces a formula This formula considers the normalization constant (𝑍) and the unnormalized transition probability (𝜋 𝑣𝑥 ) between nodes 𝑣 and 𝑥 If there is no connection between the nodes, the probability is 0; otherwise, a normalized probability is determined

The paper acknowledges that introducing a bias to influence random walks can be done by assigning weights to edges However, this approach does not work for unweighted networks To address this, the authors propose a guided random walk controlled by two parameters: 𝑝 and 𝑞 Parameter 𝑝 determines the probability of returning to the previous node, while parameter 𝑞 determines the probability of exploring previously unseen parts of the graph

𝑞 if 𝑑 𝑡𝑥 = 2The shortest path between nodes 𝑡 and 𝑥 is represented by 𝑑 𝑡𝑥 , and it can be observed visually in an accompanying illustration

Figure 2.4 Illustration of the random walk procedure in node2vec The walk just transitioned from 𝑡 to 𝑣 and is now evaluating its next step out of node 𝑣 Edge labels indicate search biases 𝛼 c The advantages of using UMAP and FAISS in combination between Deep Walk and Node2Vec

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique used to visualize high-dimensional data in 2D or 3D space It can help understand relationships between nodes in a graph, making it valuable for graph- based recommendation systems By visualizing item or user embeddings, UMAP provides insights into the data structure and identifies clusters of similar items or users It also improves efficiency in downstream tasks like clustering or nearest neighbor search

FAISS (Facebook AI Similarity Search) is a library for fast approximate nearest neighbor search and clustering of high-dimensional vectors In recommendation systems, FAISS performs nearest neighbor search on learned embeddings of items or users, generating recommendations or finding similar items Its scalability and efficiency suit large-scale recommendation systems

Session-based Recommendation System

Session-based recommendation systems are tailored for personalized recommendations based on users’ sequential interactions within a session They excel in capturing temporal dynamics and short-term preferences Two notable models for session-based recommendation are Neural Attentive Recommendation Machines (NARM) and Heterogeneous Global Graph Neural Networks (HG-GNN) These models are ideal for clickstream datasets, as they effectively capture user behavior patterns and provide recommendations based on these patterns Utilizing NARM or HG-GNN becomes advantageous when dealing with session-based problems, ensuring accurate and relevant suggestions for users a) NARM

NARM (Neural Attentive Recommendation Machines) [26] is a deep learning- based model specially designed for session-based recommendation systems It effectively captures sequential patterns in user sessions using neural networks and attention mechanisms to deliver personalized recommendations based on session history The model incorporates both global and local context vectors to consider long-term user preferences and short-term session dynamics The global context vector aggregates historical user interactions across multiple sessions, while the local context vector emphasizes recent actions within the current session By combining these context vectors, NARM accurately captures user preferences and session characteristics The model architecture includes an Embedding Layer, a bidirectional GRU for temporal dependencies, a Global Encoder to compute the global context vector (𝑐 𝑡 g ), a Local Encoder (𝑐 𝑡 1 ) with attention mechanism for the local context vector 𝑐 𝑡 = [𝑐 𝑡 g ; 𝑐 𝑡 l ] = [ℎ 𝑡 g ; ∑ 𝑡 𝑗=1 𝛼 𝑡𝑗 ℎ 𝑡 l ], and a Decoder to generate recommendation scores for all items in the recommendation space This comprehensive approach allows NARM to handle the sparsity and short-term preferences typical of session data and deliver effective recommendations

Figure 2.5 NARM Architecture b) Heterogeneous Global Graph Neural Networks

HG-GNN (Heterogeneous Global Graph Neural Networks) [27] is an advanced session-based recommendation algorithm that surpasses other session-based recommenders like LESSR (Location Embedding and Sequential Session Representation), A-PGNN (Attention-based Personalized Graph Neural Network), and H-RNN (Hierarchical Recurrent Neural Network) It stands out by leveraging both item transitions and user preferences through a unified framework The architecture of HG-GNN consists of three key components: a heterogeneous global graph, a current preference learning module, and a general preference learning module The heterogeneous global graph incorporates user and item nodes with various types of edges, including adjacency edges between items in the same session, co-occurrence edges based on item similarity, edges between users with similar interactions, and edges between users and items The Current Preference Learning Module encodes the sequence of items in the current session using positional encoding to capture temporal information Meanwhile, the General Preference Learning Module encodes user preferences by comparing user embeddings with the embeddings of the current session By effectively utilizing heterogeneous graphs to represent item transitions and user-item similarities, HG-GNN distinguishes itself as a powerful session-based recommendation algorithm.

Summary

Recommender systems predict user preferences for items and are essential for e- commerce platforms They use collaborative filtering, content-based filtering, or hybrid approaches Graph-based learning is a newer method where user-item interactions are represented as a graph, allowing for more accurate and generalizable recommendations Despite advantages like capturing complex relationships, scalability and the cold start problem remain challenges Future research could focus on improving accuracy, scalability, and addressing the cold start problem Additionally, combining different graph-based techniques, enhancing interpretability, and addressing ethical and privacy implications are areas for further exploration.

IMPLEMENTATION

Proposed Methodologies

This project focuses on computing pairwise similarities between items based on users’ behaviors The main approach involves constructing an item graph from users’ behavior history and utilizing state-of-the-art graph embedding techniques The project aims to recommend the right products to customers, improving user experience and satisfaction To address the challenge of obtaining embeddings for items with few interactions, the project proposes using Graph Embedding with Side

Information (GES) and Enhanced Graph Embedding with Side Information (EGES)

GES takes into account the idea that items belonging to the same category or brand should be closer in the embedding space EGES incorporates a coefficient mechanism to learn embeddings with side information However, given time constraints, the project mainly implements the Base Graph Embedding (BGE) technique The item graph is developed based on session-based behaviors, capturing cooperative user behaviors Popular algorithms like DeepWalk and Node2Vec are employed to learn node embeddings, encoding similarities, optimizing the embeddings using the Skip- Gram model, and connections between nodes for recommendation systems and community detection This project contributes to personalized recommendation systems’ improvement in e-commerce platforms through advanced graph embedding techniques [22, 23, 24]

In recent years, there has been a significant focus on analyzing graphs and learning node embeddings Researchers have developed different approaches, broadly categorized into three main groups Factorization strategies aim to solve the adjacency matrix of the graph while preserving both first and second-order proximity, effectively capturing relationships between nodes Deep learning strategies utilize neural networks to capture non-linear patterns and complexities in the graph, enhancing the model’s ability to represent the graph structure The deep walk method, on the other hand, utilizes random walk-based algorithms to generate node sequences, making it particularly efficient for large-scale networks

Once a weighted directed item graph, denoted as 𝒢 = (𝒱, ℰ) is obtained, we can employ popular algorithms such as DeepWalk and Node2Vec to learn node embeddings These algorithms involve generating node sequences through random walks and applying the Skip-Gram algorithm to optimize the embeddings The transition probability during the random walk process is determined based on the adjacency matrix and edge weights, influencing the likelihood of transitioning between nodes

The resulting node embeddings capture the unique characteristics and relationships of nodes in the graph Both DeepWalk and Node2Vec leverage random walks and the Skip-Gram model to generate these embeddings, which encode similarities and connections between nodes Such embeddings can be utilized for various downstream tasks, including recommendation systems and community detection

Figure 3.1 Overview of graph embedding in Taobao: (a) Users’ behavior sequences: One session for user 𝑢1, two sessions for user 𝑢2 and 𝑢3; these sequences are used to construct the item graph; (b) The weighted directed item graph 𝒢 = (𝒱, ℰ); (c) The sequences generated by random walk in the item graph;

Data collection and its characteristics

The dataset used in this project is collected from REES46 Open CDP [1], an open- source customer data platform utilized by large eCommerce stores with over 15 million visitors per month The dataset encompasses customer behaviour and comprises more than 55 million event entries, spanning the month of January 2020, from a large multi-category online store

It is worth noting that this dataset closely resembles real-world data from an e- commerce company, making it an ideal choice for this research However, it is essential to acknowledge that the results obtained from this dataset may be influenced by various other factors and targets within the e-commerce domain The complexity and diversity of customer behaviours, coupled with the dynamic nature of the market, contribute to the multitude of events and interactions captured in the dataset

Furthermore, the dataset is of substantial size, providing ample data for training the models The large volume of data ensures that the models can capture a comprehensive representation of customer behaviours, enabling a more accurate analysis and evaluation of the proposed approaches

By leveraging this dataset, which closely mirrors real-world e-commerce data, this research aims to provide insights and practical applications for understanding customer behaviour and improving various aspects of e-commerce operations

Figure 3.2 Dataset Overview Each row in the file represents an event All events are related to products and users Each event is like a many-to-many relation between products and users

Property Description event_time Time when event happened at (in UTC) event_type Kinds of event product_id ID of a product category_id Product’s category ID category_code Product’s category taxonomy (code name) if it is possible to make it It usually presents for meaningful categories and skipped for different kinds of accessories brand Down cased string of brand name and it can be missed price Float price of a product user_id Permanent user ID user_session Temporary user’s session ID Same for each user’s session and is changed every time when user come back to online store from a long pause

Figure 3.3 Attributes of raw dataset Events types can be: view a user viewed a product cart a user added a product to shopping cart purchase a user purchased a product

For example: User user_id during session user_session added to shopping cart (property event_type is equal cart) product product_id of brand brand of category code

(category_code) with price price at event_time

Multiple purchases per session: A session can have multiple purchase events and it can be a single order

Assumptions and notes for this analysis:

• Each unique session is a visit

• There is no remove_from_cart events in this dataset

• A session may have just one purchase event and no related view or cart event

• A session can have multiple purchase events

• category_code is usually present for meaningful categories and skipped for the rest

• Price is assumed to be in US Dollars

Data cleaning and preparation

3.3.1 Explore faster loading and lesser memory

We will use some techniques to load data to Pandas faster and use less memory

• drop columns: select a subset of columns relevant for analysis

• identify categorical columns: change the dtype of category

• parse_dates: change columns with datetime to type DateTime

• set DateTime column as the index

• use smaller dtypes: we do not see any need as of now

Then, we have loaded the data into a Pandas dataframe, let us process the data for the following: drop duplicates, replace missing values and check for outliers

The timestamp is unique if we consider seconds, but 137.596 rows duplicates need to be dropped to avoid overcounting for number of purchases, price etc Moreover, nearly 12% of brand and 9% of category_code attributes are missed, respectively Then, we should replace “nocategory” valued in category_code attribute and

“nobrand” value in brand attribute Besides, price is the only column with possible outliers The max, min and std are seem reasonable

Then, let us check whether the dataset has a cold start problem or not Because, the recommender systems face a problem in recommending items to users in case there is very little data available related to the user or item This is called the cold- start problem Luckily, it seems that the dataset does not have any cold start This is because the cold_start_prob variable has been filled with 100% for all categories where there is no data on unique users for the first 7 days Therefore, we can conclude that there is no cold start problem in this dataset

We are now ready to dive into exploratory data analysis In framework design, a key lesson learnt and reinforced is to begin with the end goal in mind Early identification of stages of data flow, high level data summaries and columns greatly speeds up the programming to handle large datasets

Figure 3.4 Daily Visits summary Based on the provided statistical results, here are three insights that can be derived: a Daily Visits Statistics: The average number of daily visits to the e-commerce platform is approximately 452,176, with a standard deviation of 50,256.82 The minimum and maximum values indicate that the number of visits ranges from

393,876 to 633,131 This data suggests that the platform experiences a considerable amount of daily traffic, with variations in the number of visits across different days b Visitor Statistics by Dates: Analyzing the visitor statistics by dates reveals interesting patterns For instance, Fridays and Sundays tend to have relatively higher average visit counts, with values of around 470,561 and 464,869, respectively On the other hand, Wednesdays show the lowest average visit count of 427,382 These variations across different days of the week can provide insights into user behavior and inform marketing strategies targeted towards specific days c Conversion Rates Statistics by Dates: The conversion rates, representing the percentage of visitors who make a purchase, show relatively consistent values across different days of the week The average conversion rate across all days is approximately 0.06, with a standard deviation of 0.01 This indicates that the platform maintains a relatively stable conversion rate, with values ranging from 0.05 to 0.08 Monitoring and analyzing these conversion rates over time can help assess the effectiveness of marketing campaigns and identify potential areas for improvement These insights provide initial observations about the daily visits, visitor statistics by dates, and conversion rates on the e-commerce platform Further analysis and exploration of these patterns can provide valuable insights for making data-driven decisions and optimizing the platform’s performance

Most Popular Brands During the Month:

The bar chart reveals the top 10 brands based on the number of views The chart indicates that Samsung has the highest number of views, reaching approximately 2.6 million This suggests that Samsung has a significant online presence and attracts a large audience On the other hand, Apple and Xiaomi also demonstrate strong performance, with around 2.0 million and 1.4 million views, respectively These three brands stand out as the most popular and have a substantial impact on user engagement

Contribution of Brands to Monthly Sales:

The pie chart showcases the distribution of sales among different brands The chart highlights that Apple holds the largest market share, contributing approximately 48.2% to the total monthly sales This indicates that Apple has a dominant position in terms of sales performance Additionally, Samsung and Xiaomi account for around 21.4% and 4.6% of the sales, respectively, making them significant contributors to the overall revenue It is worth noting that the “others” category collectively represents 10% of the sales, suggesting the presence of various smaller brands in the market

These insights provide valuable information about brand popularity and sales performance They allow stakeholders to identify the key players in the market, prioritize marketing efforts, and make informed decisions regarding partnerships or promotional activities The data emphasizes the importance of brands like Samsung, Apple, and others in driving user engagement and generating revenue for the eCommerce platform

Views by product category and subcategories:

• The chart provides an overview of the number of views and visited for different product categories and subcategories

• It shows the distribution of views across various categories, highlighting the categories with the highest and lowest number of views, including top categories and subcategories: construction-tools, appliances-kitchen, apparel- shoes and electronic-audio They also take the highest contribution based on their sales to the overall sales performance

• This information helps identify the popularity of different product categories and understand customer interests and preferences and allows for a quick identification of the key revenue-generating categories and subcategories

• Next, the chart showcases the effectiveness of converting views into actual purchases for each category for different product categories

• Higher conversion rates indicate that customers are more likely to make a purchase in those categories from these categories and subcategories as well

• This insight helps prioritize efforts and resources towards categories with higher conversion rates to maximize revenue potential

• The chart illustrates the turnover (in million) for each product category It provides an overview of the revenue generated by different categories, highlighting the categories with the highest turnover This information helps identify the most profitable categories and informs strategic decision-making related to resource allocation and marketing efforts These top categories and subcategories with the highest turnover contribute significantly to the overall revenue This insight helps identify the key revenue drivers and allows for targeted actions to maximize sales in these high-performing subcategories Otherwise, with the bottom insight can be used to identify areas for improvement or potential opportunities for boosting sales in these underperforming subcategories

By analyzing the turnover rates at both the category and subcategory levels, stakeholders can gain insights into the revenue distribution and identify areas of strength and weakness within the product offerings This information facilitates decision-making processes to optimize sales strategies, performance, product assortment, resource allocation for improved overall performance and enhance customer engagement

Insights from the provided results:

1 Distribution of the average spend per customer:

• The distribution of average spend per customer shows that the majority of customers have an average spend in the range of $0 to $500

• There is a peak around the $100 to $200 range, indicating that many customers have an average spend within this range

• This information can be useful for understanding customer purchasing patterns and targeting marketing efforts towards specific price ranges

• The customer conversion funnel illustrates the progression of customers from views to carts and finally to purchases

• The majority of customers start with views, followed by a smaller proportion adding items to carts, and an even smaller proportion making purchases

• This funnel visualization highlights the drop-off at each stage and emphasizes the need to optimize conversion rates from one stage to the next

• Strategies to improve customer engagement and incentivize purchases can be implemented based on these insights

• The RFM segmentation categorizes customers into distinct segments based on recency, frequency, and monetary value

• The segmentation reveals different customer groups such as “Absolute Treasures”, “Champions”, “Loyal”, “Promising”, “Require Attention”, and

• Each segment represents a different level of customer value and engagement, providing opportunities for targeted marketing and personalized strategies

• By understanding the characteristics of each segment, businesses can tailor their marketing efforts and provide relevant offers to maximize customer satisfaction and loyalty

• The summary statistics provide key metrics related to visits, shoppers, number of purchases, and total sales

• These metrics give an overview of the overall customer engagement and sales performance

• For example, there were a total of 13,754,046 visits and 8,716,484 unique shoppers during the analyzed period

• The total number of purchases was 834,945, with a total sales value of

• These statistics help assess the scale and impact of customer activity and sales revenue

=> These insights provide valuable information for understanding customer behavior, identifying areas for improvement, and formulating effective marketing strategies to drive customer engagement and increase sales

5 Group of customers by cluster:

- The top 10% of customers have a significantly higher average purchase amount compared to the regular customers The mean purchase amount for the top 10% is

- The minimum purchase amount in this group is $1,536.91, while the maximum purchase amount is as high as $294,910.73

- The total sales from these top 10% customers account for $145,045,443, which is a substantial portion of the total sales ($261,034,870)

- This indicates that a small segment of customers contributes significantly to the overall revenue of the business

- The regular customers (excluding the top 10%) have a relatively lower average purchase amount of $358.88, with a standard deviation of $352.97

- The minimum purchase amount for regular customers is $0.80, while the maximum is $1,536.87

- The total sales from regular customers amount to $115,989,427, which is still a substantial contribution to the overall sales

- The analysis includes four clusters based on customer behavior or characteristics

- Cluster 0 has the highest average purchase amount ($508.25), followed by Cluster 1 ($1,072.44), Cluster 2 ($139.30), and Cluster 3 ($4,039.03)

- Cluster 0 represents customers with relatively higher purchase amounts, while Cluster 2 represents customers with lower purchase amounts

- Cluster 1 has a moderate average purchase amount, while Cluster 3, which corresponds to the top 10% customers, has the highest average purchase amount by a significant margin

- These clusters provide insights into different customer segments and their varying levels of engagement and spending patterns

=> These insights highlight the importance of identifying and understanding customer segments based on their purchase behavior The top 10% customers, with their higher average purchase amounts, play a crucial role in driving revenue By analyzing customer clusters and their purchase patterns, businesses can develop targeted strategies to retain and engage both the top customers and regular customers, ultimately maximizing sales and customer satisfaction

• The cohort analysis presents the customer count for different cohorts over time, where each row represents a specific cohort and each column represents a week

• The percentage values indicate the retention rate of customers from each cohort over time

- Cohort w1/2020 (Week 1, 2020) has the highest initial customer count, with a retention rate of 32.88% in the first week, which decreases to 28.71% in the second week, 25.03% in the third week, and 19.18% in the fourth week

- Cohort w2/2020 (Week 2, 2020) starts with a lower customer count but shows a similar decreasing trend in retention rate over time

- The pattern continues for subsequent cohorts, with a decline in customer count and retention rate

=> These insights provide valuable information on customer retention and sales performance over time The cohort analysis highlights the customer retention rates for different cohorts, showing how customer counts change over the weeks The weekly sales chart provides an overview of the sales performance for each week, giving insights into the overall sales trends Business can use these insights to evaluate customer retention strategies and identify opportunities for improving sales performance in specific weeks or cohorts.

Explanation of how the data was transformed into a graph-based

To construct the graphs for our analysis, we utilize different event types, namely View, Cart, and Purchase Each event type allows us to create a specific graph with distinct representations of products

In the first way, we have the capability to construct a co-occurrence graph based on PDP (Product Detail Page) Views This graph allows us to capture the sequential patterns of PDPs viewed together by a user in a single session, thereby facilitating the identification of associations and correlations between products

In the second way, we possess the ability to create a co-occurrence graph for products that are added to the Cart together This graph sheds light on the products that users frequently add to their carts simultaneously, revealing potential relationships and preferences among items

In the last way, we are able to develop a co-occurrence graph for products that are bought together This graph showcases the products that are commonly purchased in conjunction with one another, offering insights into complementary or related items

By constructing these graphs for each event type, we can tailor our recommendation system to suggest products that are similar or relevant to the item currently being viewed by the user as a use case in a project as an aim For instance, if a user is browsing a television, the recommendation system can leverage these graphs to suggest other television models or related electronics that align with the user’s interests and preferences

Figure 3.8 Flow of a user in a session

To optimize the handling of the large clickstream dataset, we implemented a two- pronged approach Firstly, we converted the data into the Parquet file format, benefiting from its efficient compression and encoding techniques This reduced data size and improved query performance, enabling us to store and access the data more efficiently Secondly, we integrated DuckDB, a high-performance analytical database, to optimize memory usage during data processing DuckDB’s advanced memory management techniques allowed us to handle large datasets with limited memory resources effectively By combining these strategies, we achieved improved performance and reduced resource requirements when processing and analyzing the clickstream data

The creation of tables and the execution of queries, we can use the following description:

- Creating the Product Views Graph Table: The first step in the process is to create a table called “product_views_graph” that contains the clickstream data This table represents the interactions between products, where each row represents a view event The table includes columns such as “product_id” and

“next_viewed_product_id” to capture the relationships between products

Figure 3.9 Example for view of all products of a user in a session

- Directed Graph Query: Once the “product_views_graph” table is created, we can execute a directed graph query This query selects pairs of product IDs from the table, where each pair represents a directed edge in the graph The condition ensures that only distinct pairs are selected, eliminating self-loops

- Directed Weighted Graph Query: Building on the previous query, the directed weighted graph query extends the analysis by calculating the count of occurrences for each pair of product IDs This count represents the frequency of the edge in the graph, indicating how often one product is viewed after another

Figure 3.11 Directed Graph with Weight

- Undirected Weighted Graph Query: In the next step, an undirected weighted graph is constructed This query selects pairs of product IDs in both directions, ensuring that each pair is represented only once The larger product ID is assigned to “pid_1” and the smaller product ID to “pid_2” Similar to the directed weighted graph query, the count of occurrences is included to capture the frequency of each edge in the graph

Figure 3.12 Undirected Graph with Weight Next, we can consider the following points:

- Scalability: The clickstream dataset we are working with contains more than

55 million rows Constructing a graph with all possible pairs of product IDs would result in an extremely large graph, which may not be computationally feasible to process and analyze By selecting only pairs in both directions and filtering out self-loops, we are significantly reducing the number of edges and making the graph more manageable

- Pair representation: By representing each pair only once, with the larger ID assigned to “pid_1” and the smaller ID assigned to “pid_2”, we are ensuring that each pair is uniquely identified This representation is consistent and allows for easier graph analysis and interpretation

- Edge frequency: The inclusion of the count of occurrences (occurrence_ct) in the graph captures the frequency of each edge This information can be used to understand the strength of the connection between products based on the number of occurrences in the clickstream dataset

Figure 3.13 Middle and Remaining proportion Excluding the middle portion, which accounts for 23.16% of the dataset, appears reasonable and acceptable for several reasons Firstly, the proportion is relatively small, making it a sensible choice if its exclusion doesn't significantly impact data patterns and relationships Additionally, if the middle portion lacks distinct characteristics compared to the rest of the data, its exclusion may not greatly affect the final graph representation Furthermore, if downstream tasks and models built on the graph are not significantly impacted by this exclusion, it further justifies the approach Overall, this method of constructing an undirected weighted graph provides a practical and scalable solution for analyzing the clickstream dataset, striking a balance between efficiency and meaningful representation while capturing crucial product connections

The process involves enabling performance evaluation Three graphs,

“directed_graph”, “directed_weighted_graph”, and “undirected_weighted_graph”, are created to represent different product relationships based on user interactions The

“directed_graph” connects products based on user session order without considering frequency The “directed_weighted_graph” includes edge weights representing the occurrence frequency The “undirected_weighted_graph” connects products based on co-occurrence and includes edge weights based on frequency For item-item recommendation, the “undirected_weighted_graph” is utilized to identify similar products based on co-occurrence patterns, enhancing the accuracy of recommendations This transformation of clickstream data into a meaningful graph representation allows for insightful analyses of product relationships in the eCommerce.

Random Walks algorithm

To generate node embeddings, random walks are applied to the constructed graph, simulating graph traversal to capture local structure and node relationships Algorithms like DeepWalk and Node2Vec are utilized for performing random walks and generating node embeddings The combination of DeepWalk and Node2Vec can lead to improved embedding quality in certain cases One approach involves using DeepWalk to generate initial embeddings, capturing local structure, and then refining them with Node2Vec to incorporate more global information This is achieved by using the DeepWalk embeddings as initialization for Node2Vec and fine-tuning the model to capture broader graph information Here is one way to do it:

1 Use DeepWalk to generate initial embeddings that capture the local structure of the graph

2 Refine the embeddings generated by DeepWalk using Node2Vec, which can incorporate more global information

3 Finally, fine-tune the embeddings using Skip-Gram to further improve their quality

By sequentially training DeepWalk and Node2Vec, we can take advantage of the strengths of both algorithms and potentially achieve better graph embeddings DeepWalk is used to capture the local structure of the graph, while Node2Vec incorporates more global information Finally, the embeddings are further fine-tuned using Skip-Gram to improve their quality

Figure 3.14 Final results of embedding vectors The specific numbers and embedding vectors obtained are determined by the chosen algorithms, DeepWalk and Node2Vec, along with their respective parameter settings DeepWalk and Node2Vec are both random walk-based graph embedding algorithms that aim to capture structural information and node similarities in a graph

The numbers and vectors are a result of the optimization process performed by these algorithms to learn meaningful representations of the nodes in the graph

Determining what constitutes a good result in evaluating graph embeddings depends on the specific task or objective we are trying to achieve Here are some factors to consider when assessing the quality of graph embeddings: a Nearest Neighbors

The embeddings should capture the similarity and relationship between nodes in the graph For a given node, the nearest neighbors based on the embeddings should be semantically similar or closely related nodes in the graph A good result would show that the nearest neighbors are meaningful and relevant b Similarity Scores:

If we are evaluating the similarity between pairs of nodes, such as using cosine similarity, a good result would indicate that nodes with a higher similarity score indeed have more similarity or similarity in the desired aspects The similarity scores should align with our expectations or ground truth

As we can see a specific case, here are some insights based on the provided output:

Nearest neighbors for product ID 17301504:

- The nearest neighbors for product ID 17301504 are [17301504, 17301039,

- These products have similar embeddings, indicating that they are likely to be related or have similar characteristics based on the embedding space

- This information can be useful for tasks such as recommendation systems, where recommending similar products to users can be beneficial

Cosine similarity between product ID 17301504 and product ID 17301505:

- The cosine similarity between product ID 17301504 and product ID

- This similarity score indicates that the embeddings of these two products are relatively close in the embedding space

- The higher the cosine similarity, the more similar the products are in terms of their embedding representations

These insights can be used to understand the relationships and similarities between different products based on their embeddings c Visualization:

Visualizing the embeddings using dimensionality reduction techniques like t-SNE can provide insights into the structure and clustering of nodes in the graph A good result would show clear clusters or patterns in the visualization, indicating that the embeddings capture meaningful relationships between nodes

Figure 3.16 Node2Vec Embeddings Visualization

Visualization with UMAP

After obtaining the node embeddings, we can apply technique like UMAP and K- means in order to help visualize and analyze the embeddings in a lower-dimensional space while preserving the underlying structure and relationships between items

We have a new dataset with an embedding vector from model above:

Figure 3.17 New dataset is generated from embedding vectors

Then, with the support of UMAP, it is very effective for visualizing clusters or groups of data points and their relative proximities), we can visualize these categories in clusters

Figure 3.18 Category-code in level 1 visualization

In this project, K-means is utilized as a classic clustering algorithm, while UMAP serves as a dimensionality reduction technique to construct a low-dimensional representation of high-dimensional data UMAP can be used as a preprocessing step before clustering with K-means to enhance clustering performance by reducing data dimensionality

Apart from its primary purpose, UMAP is also employed for visualizing the category code information Although the category code itself may not require dimensional reduction, UMAP’s visualization capabilities help identify hidden patterns, clusters, or relationships within the data This visualization enables a better understanding of how different categories relate to each other, uncovering potential correlations and trends in customer behavior

Moreover, the UMAP visualization of the category code aids in effective communication with stakeholders and decision-makers By presenting complex findings in a visually appealing manner, it facilitates better comprehension and informed decision-making in the e-commerce domain

Although UMAP’s visualization does not directly impact the underlying algorithms, it adds significant value by improving the interpretability and communication of results This step enhances the overall project by providing meaningful insights into the dataset and its implications for e-commerce operations.

Embedding Vector Search with FAISS

Once the dimensionality is reduced, we can use FAISS [29] to build an efficient search index for the embedding vectors FAISS enables fast and efficient similarity search by indexing the embeddings and providing algorithms for nearest neighbor search This step allows us to quickly find the most similar items for a given item based on their embedding vectors

Figure 3.19 Cosine Similarity and L2 scores The performance of the FAISS algorithm in embedding vector search is evaluated using cosine similarity and L2 score [30] Cosine similarity measures the similarity between vectors, with a higher score indicating greater similarity The L2 score, also known as Euclidean distance, measures the distance between vectors, and a smaller score indicates higher similarity

Comparing the results of both metrics provides valuable insights into the algorithm’s performance Negative values for cosine similarity scores are due to FAISS returning distances, which can be negative The nearest neighbors may differ between cosine similarity and L2 distance, emphasizing the importance of using multiple distance metrics

In the dataset analysis, we observed differences in nearest neighbors based on the two metrics, with some vectors being closer according to L2 distance but not necessarily according to cosine similarity However, nearest neighbors belong to the same category, suggesting similarity in features

Although these results provide an indication of the embedding vectors’ performance, it is essential to consider multiple distance metrics for a comprehensive evaluation of vector relationships.

Evaluation of the Recommendation System

The performance of the recommendation system can be evaluated using rank- aware top-N metrics [31] These metrics assess the system’s ability to provide accurate and relevant recommendations By comparing the recommended items with the ground truth data (user interactions or ratings), the system’s ranking of relevant items in the top-N recommendations is measured The evaluation metrics will be calculated for different recommendation scenarios (top 5, top 10, top 20, top 50, and top 100) on both the train and test sets

• Formula: Number of recommended relevant items @k / k

• Formula: Number of recommended relevant items @k / Number of relevant items

• Formula: Average of the reciprocal ranks of the first relevant item in the top-k recommendations

- nDCG@k (Normalized Discounted Cumulative Gain):

• Formula: Average of the normalized nDCG values for each user For performance results, we will present the performance metrics in a tabular format for each recommendation scenario (top 5, top 10, top 20, top 50, and top 100) on both the train and test sets Include the mean values for each metric

By presenting the performance metrics in this manner, a below tables provide a comprehensive overview of the recommendation system’s effectiveness at different recommendation scenarios and on both the train and test sets

The evaluation results provide insights into the model’s performance across different recommendation levels (5, 10, 20, 50, 100) The metrics used include precision, recall, mean reciprocal rank (MRR), and normalized discounted cumulative gain (nDCG)

- Precision@k values indicate higher accuracy on the train set than the test set, suggesting potential overfitting or the imbalance in user interactions within the dataset The model accurately recommends relevant items in the train set

- Recall@k values show good performance on both train and test sets, with higher values for the train set The model recommends a significant number of relevant items

- MRR@k values are high, indicating that the model effectively places relevant items early in the recommendations list

- nDCG@k values demonstrate good performance on both sets, with higher values for the train set The model’s recommendations prioritize relevant items higher in the list

As we can see, larger k values (top 50, 100) lead to lower metrics as recommending more items increases the chance of including irrelevant ones, impacting precision and MRR The trade-off between recall and precision should be considered when selecting k

Then, we deep dive the reason why the lower mean precision score It may not necessarily be attributed to overfitting in this context Instead, it could be primarily due to the imbalance in user interactions within the dataset As the number of user interactions varies significantly across different items, the precision metric may be affected since it measures the proportion of relevant recommendations among all recommendations

Figure 3.20 An imbalance in user interactions There is a significant imbalance in the number of interactions among users The user with the most interactions has 805 times more interactions than the user with the fewest one It appears that the dataset has an imbalance in user interactions, with a few users having a significantly higher number of interactions compared to others This imbalance can affect the precision score of the recommendation system since it may lead to a skewed representation of user preferences and limited exposure to less active users’ reactions This indicates a potential imbalance in user engagement or preferences within the dataset

In graph-based recommendation systems, achieving high precision and mean reciprocal rank (MRR) values together is uncommon due to an imbalance in user interactions The system may struggle to accurately predict relevance for items with limited data, leading to a lower precision score However, MRR can still be relatively high if the system effectively ranks the most relevant items among sparse interactions Understanding this trade-off helps optimize the system by fine-tuning graph construction, embeddings, and ranking strategies to provide more accurate recommendations despite imbalanced user interactions

Then, the picture shows this technique can recommend similar or relevant to the item currently being viewed by the user:

Figure 3.21 List of results for a specific user Based on the provided user data and their associated product categories, it seems that the graph-based approach is recommending relevant items to users

For this user, all the recommended products belong to the category

“apparel.shoes.sandals” This indicates that the graph-based approach is effectively identifying and recommending items that are similar or related to the products viewed by user The system understands the user’s preference for sandals and suggests similar products within that category In conclusion, the fact that the recommended items align with the categories of the products viewed by a user suggests that the graph-based approach is effective in recommending relevant items based on item- item similarity

We also have formular for calculating percentage of True Results:

% True Results = (Total Correct Recommendations / Total Interactions) * 100

- Total Correct Recommendations: total number of recommendations that were deemed relevant by the users

- Total Interactions: total number of interactions or items that the users have engaged with

The percentage of true results indicates that approximately 56.25% of the recommendations made by the model on the test set were relevant to the users This means that more than half of the recommended items were actually items that the users found useful or interesting

In conclusion, the graph-based recommendation system has demonstrated its effectiveness in understanding user preferences and providing relevant recommendations By accurately identifying and recommending items within the same category as the products viewed by the user, the system shows its ability to capture similarities and relationships between products Moreover, with approximately 56.25% of the recommendations on the test set being relevant to users, the model’s performance is validated, indicating that more than half of the suggested items were useful and interesting to the users These results affirm the system’s capability to deliver personalized and meaningful recommendations, enhancing the overall user experience and satisfaction.

EXPERIMENTAL RESULTS

All Machine Learning (ML) Models

To predict whether the product added to the cart is actually purchased by the customer based on factors such as its category, event_weekday, activity of the user in that session

We added some new features in training data:

• event_weekday - weekday of the event

• activity_count – number of activities in that session

• is_purchased - This field indicates whether the item, which was added to the cart, has been purchased

The training dataset contains every non-duplicated cart transaction with above mentioned features We will use these features with original price and brand to predict whether the customer will eventually purchase the item in the cart

During the data preparation phase for applying machine learning models, the dataset was divided into two sets: “is_purchase_set” and “not_purchase_set” The

“is_purchase_set” contains instances where a purchase occurred, while the

“not_purchase_set” contains instances where no purchase was made The sizes of these sets were determined using the shape attribute, with “is_purchase_set” having 676,476 rows and “not_purchase_set” having 841,517 rows

The goal is to predict whether a product added to the cart will be purchased based on various factors New features were added to the training data, including category, subcategory, weekday of the event, activity count, and whether the item was purchased The dataset was divided into “is_purchase_set” (50%) and

“not_purchase_set” (50%) for data preparation The proportions of purchase and non- purchase instances were calculated, revealing an equal distribution of both categories This balanced distribution is crucial for training unbiased machine learning models The dataset was further split into training and test sets using an 80/20 ratio, allowing the models to learn patterns from a large portion of the data while evaluating their performance on an unseen subset This practice ensures the models’ generalization and provides insights into their predictive capabilities

Figure 4.1 Correlation among categories The picture illustrates two main points:

- Correlation between Category Attributes: Category attributes in a dataset often have strong correlations, leading to co-occurrence and similarity between certain categories This can introduce redundancy and noise, affecting the performance of recommendation systems and data analysis tasks However, the “is_purchased” attribute has low correlation with others, making it an important target attribute for analysis

- Dimensional Reduction: To address the issue of correlated category attributes, dimensional reduction techniques like T-SNE, UMAP, and PCA are employed These techniques transform categorical data into a lower-dimensional space, capturing underlying structure and relationships between categories while reducing redundancy and noise

Using these dimensional reduction techniques can enhance the performance of recommendation systems and other data analysis tasks by mitigating the impact of correlated category attributes The table below provides the performance results of all Dimensionality Reduction techniques on the test set

Table 4.1 Comparison among dimensional reduction techniques

The overall insights from the scores are as follows: a) Without dimensionality reduction:

• Model performance varies, but generally, scores are relatively low across all evaluation metrics

• Decision tree-based models (entropy and gini) and XGBoost Classifier perform better compared to other models

• Logistic Regression and Gaussian naive bayes show lower scores, indicating poorer performance b) PCA dimensionality reduction:

• PCA does not significantly improve model performance

• XGBoost Classifier performs the best among models with PCA reduction, achieving higher scores in most evaluation metrics

• Gaussian naive bayes shows high recall but low precision, suggesting potential bias towards identifying positive cases c) T-SNE dimensionality reduction:

• T-SNE dimensionality reduction does not lead to notable improvement in model performance

• XGBoost Classifier performs relatively better among models with T-SNE reduction

• Gaussian naive bayes exhibits high recall but low precision, similar to results without dimensionality reduction d) UMAP dimensionality reduction:

• UMAP dimensionality reduction does not result in significant improvements in model performance

• XGBoost Classifier shows relatively better performance among models with UMAP reduction

• Gaussian naive bayes has a relatively low precision score, indicating a higher false positive rate

Overall, the results suggest that the models struggle to achieve high accuracy, precision, and recall scores Decision tree-based models and XGBoost Classifier generally perform better compared to other models However, the overall performance of the models is not exceptional, indicating the need for further improvements or alternative modeling approaches XGBoost is well-suited for clickstream datasets due to its ability to handle nonlinear relationships, regularization techniques, gradient boosting, and imbalanced data It can effectively capture complex patterns and make accurate predictions based on clickstream data Additionally, XGBoost’s feature importance analysis provides insights into factors influencing user behavior, making it valuable for understanding clickstream data and enhancing recommendation systems Its strong performance in various domains and flexibility in handling different types of data make it a popular choice for analyzing clickstream datasets

In addition, when comparing the results of the graph-based technique with the scores obtained from the ML classifier models, we can observe the following:

Table 4.2 Comparisons among ML models and Graph-based approach

Model Recall@5 Recall@10 Recall@20 Recall@50 Recall@100 Logistic

The table presents the evaluation results of various machine learning (ML) models and a graph-based model Due to an imbalance in user interactions, the precision scores are relatively low for graph-based model Comparing the ML models and the graph-based model, Logistic Regression, Decision Tree Classifier, Random Forest Classifier, AdaBoost Classifier, XGBoost Classifier, Gaussian Naive Bayes, Gradient Boosting Classifier, and Sequential models were evaluated

Overall, the ML models show better precision scores than the graph-based model However, due to the imbalance in user interactions, the precision values are not very high for any of the models It has led to be difficult accurately recommending the top few items to users Besides, the graph-based model exhibits higher recall scores, suggesting that it can effectively identify relevant items, but at the cost of including more irrelevant items in the recommendations

It is worth noting that the imbalance in user interactions contributes to the lower precision scores for all models Addressing this imbalance could potentially improve the precision of both the ML models and the graph-based model in recommendation scenarios Additionally, fine-tuning the graph-based model and exploring different graph construction approaches may help enhance its recommendation performance.

Association Rules

Association rules are generated using the Apriori algorithm for clickstream data The data is first grouped by user session and user ID, resulting in a basket of unique product IDs for each session The association rules are then generated for each chunk of the data

Figure 4.2 Each chunk in Association Rules algorithm The dataset is divided into chunks, and the “fpgrowth” function is used to generate association rules between product baskets and the next product These rules provide insights into the likelihood of customers purchasing related items together

By applying these associations, the system can offer personalized recommendations based on customers’ previous purchase history For example, if a customer buys product id 4804055, there is a 44.1% probability they will also buy product id

4804056, and continue the example for other product ids, indicating a strong association Leveraging these associations enhances the customer experience by providing tailored recommendations

After that, we create the training and test sets The dataset was split into training and test sets, with a test size of 20% The resulting training and test sets were stored in the variables “train_data” and “test_data”, respectively

Figure 4.3 Result in Association Rule algorithm

In the case of association rule mining, it was able to achieve a 35% true result, indicating the probability of customers purchasing related items together Besides, it has limitations such as being limited to binary data and frequent item sets, facing the cold start problem, scalability issues, and lacking personalization However, the graph-based approach outperformed it, achieving a higher true result of 56.3% This suggests that the graph-based approach is more effective in capturing complex item- item relationships and providing accurate recommendations based on item similarity While association rule mining offers valuable insights into customer behavior, the graph-based approach surpasses it in terms of personalized and relevant recommendations Each technique has its strengths and limitations, and the choice between them depends on the specific requirements and characteristics of the recommendation system.

Traditional Recommendation Techniques

The summary highlights the implementation of a traditional recommendation system involving various techniques such as User-based Collaborative Filtering, Item-based Collaborative Filtering, Matrix Factorization (Singular Value Decomposition - SVDpp and Non-negative Matrix Factorization - NMF), and Hybrid Recommenders SVDpp extends traditional SVD by considering implicit feedback, improving recommendation accuracy NMF enforces non-negativity constraints, making it useful for non-negative data and providing interpretable representations Both methods are widely used due to their effectiveness in capturing user-item interactions and generating high-quality recommendations The dataset is loaded, and popular products are recommended based on their frequency Collaborative filtering techniques are used to make recommendations, considering user and item similarities Matrix factorization is employed to decompose the user-item interaction matrix, and hybrid recommendations are generated by combining different techniques Evaluation metrics like MRR@20 and Precision@20 are used to assess the performance of the models

The evaluation metrics, MRR@20 and Precision@20, are explained in detail

MRR@20 measures the average rank of the first relevant item in the top 20 recommendations, indicating the ranking quality of relevant items Precision@20 calculates the proportion of relevant items among the recommended items, assessing the precision of the recommendations The steps to calculate MRR@20 and Precision@20 are outlined, considering relevant events like “views” or “purchases” as indications of relevance

Overall, the summary provides a comprehensive overview of the traditional recommendation system implementation, including its techniques, evaluation metrics, and the use of relevant events for rating values The result is presented in a picture format to provide a clear representation of the system’s performance

Figure 4.4 Traditional Recomendatin System Models Results

The Test Precision@20 being the same and equal to 1.0 for all models indicates that each model successfully recommended at least one relevant item within the top 20 recommendations for each user in the test set This high precision value suggests a high level of accuracy in retrieving relevant items for users The similarity in precision values across all algorithms might be due to them using the same set of recommendations and ground truth items for evaluation If the recommendations from all algorithms closely overlap with the ground truth items, it can result in similar precision values It is also possible that the precision values were rounded to the same value during printing The evaluation metrics are based on predictions made by each algorithm on the same test set, created using the train_test_split() method, leading to consistent evaluation conditions for all algorithms

Besides, the code implements a hybrid recommendation system that combines User-based Collaborative Filtering (CF) and Singular Value Decomposition with probabilistic model (SVDpp) It performs the following steps:

- Data Preparation: The dataset is loaded and preprocessed to extract purchase events A constant rating value of 1 is assigned to purchase events, and a sample size of 50,000 is considered for the analysis

- Popularity-based Recommendations: Popular products are recommended based on their frequency in the dataset

- User-based Collaborative Filtering: A User-based CF model using KNNBasic with cosine similarity is trained on the data and tested on the test set

- Item-based Collaborative Filtering: An Item-based CF model using KNNBasic with cosine similarity is trained on the data and tested on the test set

- SVDpp: A matrix factorization model SVDpp is trained on the data and tested on the test set

- NMF: A Non-negative Matrix Factorization model NMF is trained on the data and tested on the test set

- Evaluation Metrics: MRR@20 and Precision@20 are calculated to assess the performance of each model on the test set

- Hybrid Recommender: The hybrid model combines predictions from User-based CF and SVDpp using a weighted average approach (0.5 for each model) MRR@20 and Precision@20 are calculated for the hybrid model on the test set The goal is to provide accurate and personalized product recommendations to users

A table shows the comparison among models:

Table 4.3 Comparisons among traditional and graph-based models

Based on the evaluation results of the different recommendation models, several conclusions can be drawn The User-based CF and Item-based CF models achieved perfect performance with an MRR@20 and Precision@20 of 1.0, indicating successful ranking of relevant items for users However, validation on diverse datasets is essential for generalization The SVDpp and NMF models showed reasonable MRR@20 values of 0.57 and 0.82, respectively, ranking relevant items well The Graph-based model achieved a high MRR@20 of 0.93, but its low Precision@20 (0.14) was affected by the imbalance in user interactions The hybrid model combining User-based CF and SVDpp achieved a competitive MRR@20 of 0.56, providing accurate recommendations Further enhancements could address the precision and imbalance issues Fine-tuning and advanced hybridization techniques may lead to improved results, considering factors like model compatibility, data representation, and algorithm complexity Understanding user preferences can design a more effective hybrid recommendation system tailored to specific needs.

Sequence Models for Session-Level Data

Comparing graph-based recommendation models with session-based models is important to understand their strengths and suitability for different recommendation tasks Graph-based models excel in capturing complex patterns and item-item or user-user similarities, making them ideal for scenarios with explicit user- item interactions On the other hand, session-based models focus on capturing sequential patterns and short-term preferences, making them suitable for real-time recommendations based on sequential data By comparing these models, we can identify which approach best fits the data and the recommendation requirements Additionally, this comparison opens up opportunities for developing hybrid systems that combine the strengths of both models for improved recommendations

- The data is split into train, validation, and test sets based on session IDs

- Session sequences are preprocessed by sorting them in ascending order of time

- Input sequences are created by removing the last item from each sorted session sequence

- Labels are created as the last item in each session sequence

- The original dataset has 55,866,594 rows, the train dataset has 43,198,369 rows, and the test dataset has 12,668,225 rows Therefore, the train dataset constitutes approximately 77.3% of the original dataset, and the test dataset constitutes approximately 22.7% of the original dataset b) Vocabulary and Indexing:

- A vocabulary class (VoC) is used to create dictionaries for mapping items to indices and vice versa

- The vocabulary class provides methods to add items, trim items based on a minimum count, and convert item sequences to index sequences c) Evaluation Metrics:

- The model’s performance is evaluated using Recall@K and MRR@K metrics

- Recall@K measures the proportion of correctly predicted items within the top-K recommendations

- MRR@K (Mean Reciprocal Rank) measures the average reciprocal rank of the correctly predicted item within the top-K recommendations d) Training the Model:

- The model is trained using the Adam optimizer and cross-entropy loss

- The learning rate is adjusted using a step scheduler

- During training, the model is evaluated on the validation set, and the best performing model is saved e) Testing the Model:

- The trained model is loaded

- The test set is transformed into a data loader

- The model is used to make predictions on the test set, generating the top-20 item indices

- The target item is checked for its presence in the top-20 predictions

Figure 4.5 Model and Result in NARM model The NARM model was trained for 15 epochs, and the loss gradually decreased during training, indicating effective learning from the data On the validation set, the model achieved a Recall@20 of 0.3445 and an MRR@20 of 0.0998 These metrics indicate the model's ability to accurately recommend relevant items within the top-

20 predictions The recall value of 34.45% implies that the model correctly suggested 34.45% of the items that the user actually clicked on The MRR value of 0.0992 (9%) indicates that the model tends to rank clicked items higher in the recommendation list However, in a single session test, the top 20 recommendations did not include the next clicked item, suggesting room for improvement in this aspect

4.4.2 Heterogeneous Global Graph Neural Networks a) Code Implementation Structure:

The code implementation follows a structured approach It can be divided into the following steps:

- Break sequences into sessions with an interval of 8 hours

- Truncate long sessions and remove infrequent items and short sessions

- Split sessions into train and test sets

- Create training samples from the session sequences

- Define the Session Dataset Class and Data Loader

Step B: Heterogeneous Global Graph Construction

- Create a co-occurrence matrix of items based on adjacency in sessions

- Create item and user similarity matrices based on item co-occurrence and user-item interactions

- Build the Heterogeneous Global Graph using the four types of edges mentioned above

• Create Co-occurrence matrix of items based on adjacency of items in same session:

Figure 4.6 Co-occurrence matrix of items based on adjacency of items in same session

• Create Item & User Similarity matrix based on If Items appear in same session & if Users interact with same set of items

Figure 4.7 Item & User Similarity matrix

• Create Heterogenous Global Graph using the 4 types of edges mentioned: o Edge Type A) Between 2 items based on if 2 items are adjacent to each other in item sequence of same session o Edge Type B) Between 2 items based on if 2 items co-occur (item similarity) in same session o Edge Type C) Between 2 users based on if 2 users interact with similar items o Edge Type D) Between user & item based on if user interacted with item

Step C: Model Training and Evaluation

The “HG_GNN” model is a graph-based recommendation system that utilizes

Graph Sage as the Graph Neural Network (GNN) to aggregate neighbor information

It includes two key learning modules: the Current Preference Learning Module, which generates embeddings based on the sequence of items in the current session, and the General Preference Learning Module, which generates embeddings based on user and session embeddings The model is implemented as a class inheriting from the “nn.Module” class in PyTorch

Key functionalities of the model include the initialization of parameters and layers, forward pass definition, training loop implementation, and evaluation of top-

K performance metrics like accuracy, MRR, and NDCG on the test set The model utilizes graph convolutions, positional encoding, sequence encoding, and linear layers to predict item scores based on the encoded representations

During training, the loss gradually decreases, indicating effective learning from the data The model’s performance on the validation set shows promising results with a Recall@20 of 0.3445 and an MRR@20 of 0.0998, signifying its ability to recommend relevant items accurately However, there is room for improvement, as the top 20 recommended items in a single session did not include the next clicked item Further optimization and fine-tuning could enhance this aspect of the model

Figure 4.9 Model Training and Evaluation

As we can see, the training process continued for two more epochs (12 and 13), but there was no improvement in the model’s performance on the test set The training loss at the beginning of epoch 11 was 4.56, indicating the initial loss of the training process The training process terminated early after epoch 13 since there was no further improvement observed in the model’s performance b) Performance:

Table 4.4 Comparison among metrics in HG-GNN model

Metrics Top 5 Top 10 Top 20 Top 50

Figure 4.10 Results in HG-GNN model The model’s performance on the test set for different top-K recommendations (top-50, top-20, top-10, top-5) is summarized as follows:

- Accuracy (acc): The model achieves an accuracy of 46.76% for the top-20 recommendations, indicating that approximately 46.76% of the recommended items are relevant to the users

- Mean Reciprocal Rank (MRR): The model achieves an MRR of 0.1278 for the top-20 recommendations, suggesting that the relevant items are relatively well-ranked in the recommendations

- Normalized Discounted Cumulative Gain (NDCG): The model achieves an NDCG of 0.2034 for the top-20 recommendations, indicating that the recommended items are ranked relatively well and have higher relevance

- Precision (prec): The precision values range from 0.0234 for the top-20 recommendations to 0.0460 for the top-5 recommendations, suggesting room for improvement in recommending highly relevant items

Overall, the model shows decent performance in terms of accuracy, MRR, and NDCG for the top-K recommendations However, improving precision could enhance the relevance of the recommendations Among the top-K recommendations, the top-20 recommendations appear to have the best performance, striking a balance between accuracy, MRR, and NDCG However, the choice of the "best" top-K recommendation depends on specific requirements and goals of the recommendation system and user experience

In addition, considering the NARM and HG-GNN models in the session-based recommendation context, we can analyze their performance based on the provided results

- NARM: o Top20: MRR@20 is 0.0998, and Recall@20 is 0.3445

- HG-GNN: o Top20: The accuracy is 0.4676, MRR is 0.1278, and NDCG is 0.2034 o Top10: The accuracy is 0.3476, MRR is 0.1195, and NDCG is 0.1730 o Top5: The accuracy is 0.2300, MRR is 0.1038, and NDCG is 0.1350 These results indicate that NARM achieved a recall of 34.45% and an MRR of 9.98% for the top 20 recommendations on the validation set Comparing this with the HG-GNN model, which showed higher performance in the previous analysis, it seems that HG-GNN outperforms NARM in terms of both recall and MRR

Table 4.5 Comparison between Sequence Models and Graph-based model

Model Graph-based Model NARM HG-GNN

The graph-based model consistently outperforms the session-based models (NARM and HG-GNN) across various evaluation metrics It achieves higher precision, recall, MRR, and nDCG values for different recommendation list lengths (5, 10, 20, 50, 100) The NARM model performs better than HG-GNN on Recall@20 and MRR@20 metrics In summary, the graph-based model shows superior performance compared to session-based models in terms of precision, recall, MRR, and nDCG for various recommendation list lengths

Graph-based machine learning and session-based recommender systems are valid approaches that aim to improve recommendation performance The former leverages graph structures to capture relationships and interactions between users and items, while the latter focuses on modelling user behaviour within individual sessions using sequential patterns and temporal dynamics

The choice between these approaches depends on dataset characteristics, recommendation problem nature, and available resources Graph-based models excel in capturing long-range dependencies and considering the global context, suitable for complex user-item relationships On the other hand, session-based recommender systems are effective in capturing short-term preferences and exploiting sequential patterns.

DISCUSSION AND CONCLUSION

In conclusion, this research has made significant contributions to the field of graph-based recommendation systems The results demonstrate successful graph construction based on user behaviour data, enabling the representation of item-user relationships Additionally, the application of graph embedding techniques, including DeepWalk, Node2Vec, and FAISS, has proven effective in generating embeddings for items and facilitating efficient nearest search vector indexing The developed recommendation system leverages these embeddings to offer personalized recommendations based on user preferences and behaviour

While the research has achieved promising outcomes, there are areas that require further improvement and exploration Addressing the limitations of the user interaction dataset by focusing on precision in evaluation metrics is paramount The scalability of the recommendation system must be thoroughly evaluated to handle real-world scenarios with a high number of users and items effectively Moreover, tackling the cold start problem for new items with limited user behaviour data may involve exploring content-based methods or leveraging metadata and item characteristics

To provide valuable insights and practical implications, real-world deployment and user studies are essential These efforts will shed light on user satisfaction and the overall effectiveness of the recommendation system, informing iterative improvements and optimizations

The contributions of this research lie in the proposal of a novel graph-based approach that harnesses user behaviour data and graph embedding techniques Additionally, the evaluation and comparison of the proposed approach have provided valuable insights into its superiority and effectiveness The consideration of scalability ensures the development of recommendation systems capable of handling large-scale datasets, benefiting from a high number of users and items Practical insights are expected through real-world deployment and user studies, providing valuable feedback on system performance and user satisfaction

Looking to the future, further research should focus on incorporating user feedback and implicit signals for improved accuracy and personalization Exploring advanced embedding methods, such as Deep Representation Learning and Graph Neural Networks, and combining multiple recommendation techniques can lead to enhanced performance Consideration of efficiency while maintaining embedding quality through computational complexity reduction is also important

To gain a comprehensive understanding of system effectiveness and usability, evaluating user satisfaction and real-world impact beyond traditional metrics is crucial User studies, A/B testing, and feedback analysis will provide valuable insights into the system's performance and user experience

It is essential to acknowledge the limitations of this research Node embeddings via matrix factorization and random walks have limitations, particularly in obtaining embeddings for nodes not present in the training set To address these challenges, exploring Deep Representation Learning and Graph Neural Networks can offer viable solutions Additionally, DeepWalk and Node2Vec embeddings may not capture structural similarity or incorporate node, edge, and graph features Overcoming these limitations by leveraging advanced techniques will further enhance the effectiveness and capabilities of the recommendation system

Besides, to store large-scale user behaviour data efficiently, we can apply the following solutions:

- Use distributed database systems like Apache Cassandra or Amazon DynamoDB for horizontal scaling and fault tolerance

- Implement data shading to partition data across multiple nodes for parallel processing and improved performance

- Apply data compression algorithms to reduce storage requirements without compromising data integrity

- Archive historical data in cost-effective storage systems to free up space in the live database

- Consider cloud storage services like Amazon S3 or Google Cloud Storage for scalable and cost-efficient solutions

- Employ data indexing and partitioning techniques to enhance data retrieval speed

- Establish a data lifecycle management strategy to manage data growth effectively

- Implement load balancing to evenly distribute data across nodes and optimize resource utilization

- Use data replication for improved data availability and fault tolerance

By adopting these measures, the recommendation system can handle large volumes of user behaviour data while ensuring scalability, performance, and cost- effectiveness

In conclusion, this research has laid a strong foundation for the advancement of graph-based recommendation systems By addressing challenges, offering practical insights, and suggesting future research directions, this work contributes to the growing body of knowledge in the field and paves the way for more accurate, personalized, and efficient recommendation systems.

[1] M Kechinov, “eCommerce behavior data from multi category stores,” Kaggle: eCommerce behavior data from multi category store & REES46 Marketing

Platform [Online] Available: https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from- multi-category-store, 2020

[2] J Wang, P Huang, H Zhao, Z Zhang, B Zhao, and D L Lee, “Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba,” arXiv’s blog in Computer Science, arXiv:1803.02349, 2018

[3] J Herlocker, J Konstan, A Borchers, and J Riedl, “An algorithmic framework for performing collaborative filtering,” in Proceedings of the 22nd Annual

International ACM SIGIR Conference on Research and Development in

[4] G Linden, B Smith, and J York, “Amazon.com recommendations: Item-to-item collaborative filtering,” IEEE Internet computing, vol 7, no 1, pp 76-80, 2003 [5] B Sarwar, G Karypis, J Konstan, and J Riedl, “Item-based collaborative filtering recommendation algorithms,” in Proceedings of the 10th International Conference on World Wide Web, pp 285-295, WWW, 2001

[6] M Balabanović and Y Shoham, “Fab: Content-based, collaborative recommendation,” Communications of the ACM, vol 40, no 3, pp 66-72, 1997 [7] H T Cheng et al., “Wide & deep learning for recommender systems,” Technical report, 2016

[8] P Covington, J Adams, and E Sargin, “Deep neural networks for YouTube recommendations,” in Proceedings of the 10th ACM Conference on Recommender

[9] H Wang, N Wang, and D Y Yeung, “Collaborative deep learning for recommender systems,” in Proceedings of the 21st ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, pp 1235-1244, KDD,

[10] J Wei et al., “Collaborative filtering and deep learning-based hybrid recommendation for cold start problem,” in 2016 IEEE 14th Intl Conf on

Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive

Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress

(DASC/PiCom/DataCom/CyberSciTech), pp 874-877, IEEE, 2016

[11] B Kupisz, and O Unold, “Collaborative filtering recommendation algorithm based on Hadoop and Spark,” in 2015 IEEE International Conference on Industrial

[12] Y C Chen, “User behavior analysis and commodity recommendation for point earning apps,” in 2016 Conference on Technologies and Applications of Artificial

[13] Y H Zhou, D Wilkinson, and R Schreiber, “Large scale parallel collaborative filtering for the Netflix prize,” in Proceedings of 4th International Conference on

Algorithmic Aspects in Information and Management, pp 337-348, Shanghai:

[14] L T Ponnam et al., “Movie recommender system using item based collaborative filtering technique,” in International Conference on Emerging Trends in

Engineering, Technology, and Science (ICETETS), IEEE, 2016

[15] V Dev, and A Mohan, “Recommendation system for big data applications based on set similarity of user preferences,” in International Conference on Next

Generation Intelligent Systems (ICNGIS), IEEE, 2026

[16] X Zeng et al., “Parallelization of latent group model for group recommendation algorithm,” in IEEE International Conference on Data Science in Cyberspace

[17] J Jooa, S Bangb, and G Parka, “Implementation of a recommendation system using association rules and collaborative filtering,” Procedia Computer Science, vol 91, pp 944-952, 2016

[18] S Panigrahi, R K Lenka, and A Stitipragyan, “A Hybrid Distributed

Collaborative Filtering Recommender Engine Using Apache Spark,” in

Proceedings of the 2026 IEEE/ACS 23rd International Conference on Antenna Technology (ICAT), pp 1000-1006, ANT/SEIT, January 2026

[19] Y Koren, R Bell, and C Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol 42, no 8, pp 30-37, 2009

[20] Z Liu et al., “Monolith: Real Time Recommendation System with Collisionless Embedding Table,” arXiv’s blog in Computer Science, arXiv:2209.07663v2, 2022 [21] S Wang et al., “Graph Learning based Recommender Systems: A Review,”

Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21), April

[22] A Grover and J Leskovec, “Node2vec: Scalable feature learning for networks,” in

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 855-864, KDD, 2016

[23] B Perozzi, R Al-Rfou, and S Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, pp 701-710, KDD, 2014

[24] J Tang et al., “Line: Large-scale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web, pp 1067-1077, WWW,

[25] M A Masethe, S O Ojo, and S A Odunaike, “Taxonomy of Recommender Systems for Educational Data Mining (EDM) Techniques: A Systematic Review,” in Proceedings of the World Congress on Engineering and Computer Science 2019

(WCECS 2019), October 22-24, 2019, San Francisco, USA

[26] J Li et al., “Neural Attentive Session-based Recommendation,” arXiv preprint arXiv:1711.04725v1, 13 Nov 2017 Retrieved from https://arxiv.org/pdf/1711.04725.pdf

[27] Y Pang et al., “Heterogeneous Global Graph Neural Networks for Personalized Session-based Recommendation,” arXiv preprint arXiv:2107.03813v4, 27 Feb

2022 Retrieved from https://arxiv.org/pdf/2107.03813.pdf

[28] R van den Berg et al., “Graph Convolutional Matrix Completion,” arXiv:1706.02263v2 [stat.ML] Retrieved from https://arxiv.org/pdf/1706.02263.pdf

[29] H Jegou, M Douze, and J Johnson, “Faiss: A library for efficient similarity search,” Posted on March 29, 2017 to AI Research, Data Infrastructure, ML

Applications [Online] Available: https://engineering.fb.com/2017/03/29/data- infrastructure/faiss-a-library-for-efficient-similarity-search/

[30] S Wannasuphoprasit et al., “Solving Cosine Similarity Underestimation between High Frequency Words by L2 Norm Discounting,” arXiv:2305.10610v1 [cs.CL],

17 May 2023 [Online] Available: https://arxiv.org/pdf/2305.10610.pdf

[31] D Valcarce et al., “Assessing ranking metrics in top-N recommendation,” Inf

Retrieval J, vol 23, no 5, pp 411-448, 2020 https://doi.org/10.1007/s10791-020-

Drop duplicate Check for missing data Check for outliers

The purpose of this code is to create a new table that contains filtered and selected data from the

“optimised_raw_data.parquet” file, focusing on sessions with multiple viewed products This table can then be used for further analysis and exploration of user behaviours and product views

By performing this code, the code creates the

“product_views_graph” table, which represents the product views and the next viewed product within each user session

This table can be used for further analysis and modelling, such as building recommendation systems or studying user behaviours patterns a Directed Graph Query: b Directed Weighted Graph Query: c Undirected Weighted Graph Query:

4 Embeddings Vector Search with FAISS:

5 Hyperparameters implementation for training model:

This approach can potentially result in better embeddings compared to using any of the methods in isolation from enum import Enum class Constants (Enum):

GRAPH_FILE_NAME = 'undirected_weighted_product_views_graph.parquet'

This code defines an “Enum” class called “Constants” An enumeration is a set of named values that represent a collection of related constants In this case, the

“Constants” enum class contains several constant values

The meanings of the constant values in this code are as follows:

- “GRAPH_FILE_NAME”: This constant represents the name of a graph file, specifically the “undirected_weighted_product_views_graph.parquet” file It likely refers to a file containing the graph representation of product views data

- “WEIGHTED”: This constant is a Boolean value (“True”) indicating whether the graph is weighted It suggests that the edges in the graph have associated weights, representing the strength or importance of the connections between products

- “DIRECTED”: This constant is a Boolean value (`False`) indicating whether the graph is directed If the value is “True”, it means the graph's edges have a specific direction, indicating a one-way relationship between products

- “NUM_WALK”: This constant represents the number of random walks to be performed on the graph It likely indicates the number of times a random walker traverses the graph from a starting node to generate training data

- “WALK_LEN”: This constant represents the length of each random walk It indicates the number of steps or nodes visited in each random walk

By using an “Enum” class, these constants can be accessed and referenced using dot notation, making the code more readable and organized

The reason why chooses 10 and 50 for Num_Walk and Walk_Len:

- Trade-off between exploration and computation time: Increasing the number of random walks (“NUM_WALK”) and the length of each walk (“WALK_LEN”) can provide more exploration of the graph, allowing for better coverage of the product relationships However, it also increases the computation time required for generating the embeddings The chosen values strike a balance between exploration and computational efficiency

Tiêu đề	Ecommerce Graph-Based Recommendation System
Tác giả	Vo Thi Kim Nguyet
Người hướng dẫn	Le Thanh Van, Ph.D
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	Master’s Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	128
Dung lượng	3,38 MB