Video sage video recommendation using graph convolution neural network = xây dựng mô hình gợi Ý video Áp dụng mạng nơ ron Đồ thị

Hanoi University of Science and TechnologySchool of Information and Communication Technology Master Thesis in Data Science Video Sage: Video Recommendation Using Graph Convolution Neural

Overview

The rising demand for entertainment and information sharing has significantly increased the popularity of video content However, in an age flooded with information, many individuals struggle to find content that aligns with their interests This challenge presents a considerable obstacle for various online service providers, including social networks, websites, and entertainment platforms.

The development of an effective video recommendation system is crucial in addressing the challenges of content discovery in the rapidly evolving landscape of online video consumption With the rise of platforms like Netflix, Apple TV, YouTube, and TikTok, the significance of personalized video recommendations has grown immensely As users turn to these diverse services for entertainment and information, sophisticated recommendation systems play a vital role in enhancing user experience by providing tailored content that meets individual preferences.

Despite significant advancements in recommendation systems for images and audio, video recommendation systems remain underexplored in both research and practical applications The complexity of video content, characterized by its rich and diverse information, poses unique challenges in identifying and extracting distinct features necessary for tailoring personalized user recommendations.

Historically, video content has lagged behind photos and audio in terms of popularity,

1https://www.theinsaneapp.com/2021/03/system-design-and-recommendation- algorithms

The development of robust recommendation systems is becoming increasingly essential for gaining a competitive edge in the online platform landscape This trend is driven by the large file sizes of videos, which can pose challenges for user engagement However, significant advancements in telecommunications, particularly the widespread adoption of 4G and the emergence of 5G networks, are transforming this scenario These technologies enable high-speed data transmission, providing users with unprecedented access to a vast selection of video content.

The analysis and recommendation of videos have become crucial in the digital landscape, highlighting the importance of effectively utilizing the extensive array of video resources available With the rise of high-speed data networks, the significance of video recommendation systems continues to grow.

Despite the growing demand for video recommendations, the field remains underexplored, with research efforts often limited Traditional models struggle as they typically rely on only a fraction of available video features Given that videos consist of various attributes, including text, audio, and images, effectively leveraging these elements adds complexity Additionally, accurately capturing nuanced relationships between items and understanding user habits and preferences poses significant challenges.

While some studies have ventured into the exploration of user behavior sequences in video recommendation systems, employing core Recurrent Neural Network (RNN)

Long Short-Term Memory (LSTM) architectures face significant performance efficiency challenges due to their high computational demands This can result in slow processing speeds, making them less ideal for dynamic and real-time applications like modern recommendation systems in fast-paced digital environments Therefore, there is an urgent requirement for innovative and efficient solutions to effectively tackle these issues.

In contrast, the CER [5] model has the capability to learn the inﬂuence of individual video features and subsequently integrate them eﬀectively Importantly, CER

Recent advancements in video features allow for integration without major model architecture changes; however, CER struggles to capture user behavior due to high computational demands Current research has largely concentrated on user-side data analysis, as seen in studies [6], [7], and [8], which, while effective, often depend on extensive user data This reliance can result in subpar performance in situations with limited data, creating a cold-start issue Shifting the focus towards learning video representations, rather than exclusively on user data, can enhance recommendation flexibility and accommodate new user profiles.

Figure 1.2: The cold-start problem is a signiﬁcant challenge for contemporary models as they primarily focus on analyzing user proﬁles instead of exploiting the distinctive features of videos 2

Introducing the Video Sage model, a revolutionary approach that integrates Graph Convolution Network (GCN) technology, effectively leveraging diverse video features to create a comprehensive video representation This model excels in analyzing user-video interactions, uncovering complex relationships between videos and user preferences for deeper insights.

2https://medium.com/@i.vikas/cracking-the-cold-start-challenge-personalized-video- recommendations-on-any-platform-47d732dfc90d

Furthermore, the scalability of our model positions it as a dynamic solution capable of gracefully accommodating the inﬂux of new data, a paramount feature in today’s ever-evolving digital landscape.

Inspired by the Graph Sage model, Video Sage effectively learns video representations through a graph-based structure where each video is a node, and edges represent user-video interactions This innovative method utilizes various video attributes, including titles, thumbnail images, and engagement metrics like likes and views, to create node features Instead of training individual embedding vectors for each node, Video Sage utilizes specialized aggregator functions to gather and synthesize information from the immediate neighborhood of each node, allowing for versatile aggregation from nodes at different distances or search depths.

Video Sage enhances video processing by integrating advanced techniques for embedding complex features into unified node representations Recognizing the diversity of video features, which include various data types such as textual titles and visual thumbnails, our approach utilizes pre-trained models specific to each type for efficient embedding We also apply essential graph algorithms to analyze the constructed graph, extracting critical information that enriches node features This enriched representation captures the core essence of video content, facilitating the development of effective video recommendation systems in today’s evolving digital environment.

Objectives

This research aims to create the Video Sage model leveraging Graph Convolutional Neural Networks for enhanced video representation learning The resulting learned representation will be applied to multiple downstream tasks, including video recommendations and video classification.

We aim to develop a method that integrates rich video features with data derived from product relationship graphs, enabling the generation of input vectors for our model This approach significantly improves the model's learning performance and representation capabilities.

Subsequently, we conduct experiments to demonstrate the excellent performance of the Video Sage model on a dataset obtained from the Lotus social media platform.

We aspire to see widespread application of our research in the future.

Main contributions

The primary contributions of this study are as follows:

• Eﬀective Feature Embedding: We propose a practical method for eﬃciently embedding video input features.

The Video Sage model is designed to effectively leverage various video features to learn representation vectors, which are essential for enhancing video recommendations and other downstream tasks.

• Empirical Validation: We conduct experiments using real-world data from ourLotus social network platform to conﬁrm the superior performance of our VideoSage model in video recommendations.

Outline of the thesis

The rest of this thesis is organized as follows:

Chapter 2 Provides an overview and explores related research in the ﬁeld of this study.

Chapter 3 Describing the speciﬁc architecture of the Video Sage model and the method for embedding video features.

Chapter 4 Describes the experimental content, the achieved results, and certain ablation study ﬁndings.

Chapter 5 Concludes the entirety of the thesis.

In Section 2.1, we will explore essential theories of graphs and graph representation learning, highlighting their practical applications Section 2.2 will review pertinent research in video recommendation systems Lastly, we will discuss the pre-trained deep learning models utilized to embed video features as node features within the Video Sage model.

Background

Graph basic concepts

Before analyzing specific problems related to Graph Convolutional Networks (GCN), it's essential to understand fundamental concepts of graph theory A graph is defined as G = (V, E), where V represents the set of vertices and E denotes the set of edges connecting these vertices.

• V is the set of nodes in the graph (vertices/nodes).

• E is the set of edges connecting the nodes in the graph.

In a graph, each edge (v i, v j) symbolizes the connection between nodes v i and v j The nodes v i and v j belong to the set of nodes V, while (v i, v j) signifies an edge within the set of edges E, linking node v i to node v j.

We use the notation N(v) ={u ∈V |(v, u) ∈ E} to denote the set of neighboring nodes, where N(v) represents the nodes adjacent to node v (i.e., the neighboring nodes that share an edge with node v).

G= (V, E), (v i , v j )∈E, N(v) ={u∈V |(v, u)∈E} (2.1) The adjacency matrix (A), also known as the weighted matrix, is a square matrix of sizen×n(wherenis the total number of nodes in the graph), deﬁned as:

The adjacency matrix (A) reflects the weights of the edges within a graph In the provided example, all edges are assigned equal weights; however, these weights can differ based on the specific problem and data involved.

The degree matrix (D) is a square diagonal matrix of size n× n and contains information about the degree of each vertex, deﬁned as:

Please note that for directed graphs, the degree of each node only considers edges directed towards that node.

Node Feature: Nodes in a graph can also include individual features or attributes.

These features can be extracted from the information associated with each node. For example:

• In a social network problem, where nodes represent people, node features may include attributes such as age, gender, occupation, education level, and more.

In document topic classification, nodes in a graph symbolize the documents, with node features typically represented as a binary vector corresponding to the vocabulary size, such as 50,000 words, where a value of 1 indicates word presence and 0 indicates absence More advanced representations can be achieved through language models or word embeddings, which create feature vectors for specific words or passages For example, employing a doc2vec model allows for the mapping of a document to a 300-dimensional vector.

Graph representation learning

Graph representation learning has emerged as a prominent research domain, focusing on the creation of graph representation vectors that effectively encapsulate the structure and features of large graphs The accuracy of these vectors is crucial, as it directly impacts their performance in various downstream tasks, including node classification, link prediction, and anomaly detection Techniques for generating effective graph representations generally fall into two main categories: traditional graph embedding methods and graph neural network (GNN)-based approaches, applicable to both static and dynamic graphs Static graphs remain fixed, while dynamic graphs evolve over time with changing nodes and edges This overview will concentrate on the two categories of graph representation learning specifically for static graphs, as our research primarily targets this type, with the Graph Convolutional Neural Network (a GNN model) discussed further in Chapter 3, Section 3.1.

Traditional static graph embedding methods are designed for unchanging graphs that maintain a constant set of nodes and edges These methods focus on preserving various properties of the graph's structure, including node proximities In this context, we categorize proximities into first-order and second-order types, each reflecting different relationships among nodes and edges.

A higher order of proximities can be similarly deﬁned.

First-order proximity refers to the direct connection between nodes linked by an edge, with edge weights serving as the measurement of this proximity A higher edge weight indicates a greater similarity between the two connected nodes.

• Second-order proximity: The second-order proximity between two nodes is the similarity between their neighborhood structures Nodes sharing more neighbors are assumed to be more similar.

Traditional static graph embedding methods fall into three main categories: factorization-based, random walk-based, and non-GNN deep learning methods This article reviews these categories and the specific techniques employed within each approach.

Matrix factorization methods are foundational in graph representation learning, comprising two key steps First, a proximity-based matrix is created, where each element represents the proximity between nodes In the second step, a dimension reduction technique is employed to generate node embeddings The Graph Factorization algorithm utilizes the adjacency matrix as the proximity measure, optimizing a specific function to achieve the desired node representations.

In the context of graph representation learning, the node representation vectors for nodes \( v_i \) and \( v_j \) are denoted as \( z_i \) and \( z_j \), respectively The adjacency matrix element \( a_{ij} \) represents the connection between these nodes Notably, in methods like GraRep and HOPE, this adjacency value is often substituted with alternative similarity measures, which may include higher-order adjacency matrix representations, the Katz index, and Rooted PageRank.

[18], and the number of common neighbors STRAP [19] employs the personalized

PageRank serves as an effective proximity measure, approximating pairwise proximities between nodes to reduce computational costs A network embedding update algorithm has been proposed to compute higher-order proximities between node pairs more accurately However, relying on a single proximity matrix for learning node representations may restrict the effectiveness of matrix factorization methods To address this limitation, a new framework has been developed that simultaneously learns proximity measures and SVD decomposition parameters in an end-to-end manner.

Random walk-based methods have gained significant attention due to their effectiveness in graph representation These methods generate random walks for each node, allowing them to capture the graph's structure and produce similar embedding vectors for nodes that appear together in these walks This approach, which uses co-occurrence in random walks as a measure of node similarity, offers greater flexibility compared to traditional fixed proximity measures, demonstrating promising performance across various applications.

A random walk in a graph G = (V, E) is defined as a sequence of nodes v0, v1, , vk, beginning at node v0 Each transition from node vi to node vi+1 occurs only if the edge (vi, vi+1) is present in E, with k+1 representing the total length of the walk The selection of the next node in the sequence is determined by a probabilistic distribution.

DeepWalk and Node2vec are based on the Word2vec embedding technique, which is commonly used in natural language processing (NLP) and operates on the principle that words appearing together share semantic similarity These methods extend this idea to graphs, suggesting that nodes that co-occur in random walks also exhibit similarity, leading to the creation of comparable node embedding vectors Both algorithms involve two main steps: generating a set of random walks and using them to train a SkipGram model to produce embedding vectors The key difference between the two lies in their random walk generation methods; DeepWalk randomly selects the next node from neighboring nodes, while Node2vec employs a more sophisticated approach This section will first explain the random walk generation process in Node2vec and then explore the SkipGram model.

1 Random Walk Generation Assume we want to generate a random walk sequence v 0 , v 1 , , v k where v i ∈ V Given that the edge (v i−1 , v i ) has been traversed, the next node v i+1 in the walk is selected based on the following probability:

(2.5) where Z is a normalization factor, and α v v i+1 i is deﬁned as: α v i v i+1 ⎧⎪

In Node2vec, the distance \( d(v_{i-1}, v_{i+1}) \) represents the shortest path length between nodes \( v_{i-1} \) and \( v_{i+1} \), with possible values of 0, 1, or 2 The user-defined parameters \( p \) and \( q \) influence the random walk's direction; a high \( p \) value promotes global exploration and minimizes revisiting nodes, while a high \( q \) value favors local exploration By leveraging these parameters, Node2vec effectively merges breadth-first search (BFS) and depth-first search (DFS) techniques in its random walk process.

2 SkipGram After generating random walks, the walks are input to a SkipGram model to generate the node embeddings SkipGram learns a language model that maximizes the probability of sequences of words in the training corpus. The objective function of SkipGram for node representation is: maxΦ v i ∈V logP(N(v i )|Φ(v i )) (2.7) whereN(v i ) is the set of neighbors of nodev i generated from the random walks. Assuming independence among the neighbor nodes, we have:

The conditional probability P(Φ(v k )|Φ(v i )) is modeled using a softmax function:

The softmax function's numerator is derived from the dot product of node representation vectors, which encourages similarity among neighbor nodes when maximized However, calculating the denominator of the conditional probability between a target node and all nodes in the graph is computationally intensive To address this challenge, DeepWalk and Node2vec utilize hierarchical softmax and negative sampling, respectively, to approximate the process efficiently.

2.1.2.2 Graph Neural Net (GNN) based graph embedding

Graph Neural Network (GNN) based graph embedding methods represent a significant advancement in graph embedding techniques, utilizing GNNs to create embeddings that effectively generalize to unseen nodes Unlike traditional methods, GNN-based approaches leverage node and edge attributes more effectively A GNN operates as a deep learning model that generates node embeddings by aggregating the embeddings of neighboring nodes, reflecting the idea that a node's state is shaped by its interactions within the graph The following section delves into the fundamental architecture and training processes of GNNs.

Application of GNN

To effectively utilize graphs in machine learning and data mining, it is essential to represent graphs and their components, such as nodes and edges, as numerical features The resulting embedding vectors can be applied in various applications, including node classification, link prediction, and graph classification This article explores these key applications in detail.

Node classification involves assigning labels to nodes within a test dataset, with numerous applications across various domains For example, in social networks, one can predict an individual's political affiliation based on their connections In this process, each training instance consists of a node embedding vector paired with its corresponding label Various classification methods, including Logistic Regression and Random Forests, can be utilized to train on the dataset, ultimately generating classification scores for the test data.

Link prediction is a crucial application of node embedding methods, aiming to forecast the probability of edge formation between two nodes This process is essential in various contexts, such as recommending friends on social networks and uncovering biological relationships in networks The task can be framed as a classification problem, where edges are assigned labels—1 indicating a likely connection and 0 indicating otherwise During training, a dataset is created using both positive samples (existing edges) and negative samples (non-existing edges), with the latter's representation generated from node vectors Similar to node classification, any classification algorithm can be employed to train on this dataset and predict edge labels for new edge instances.

• Graph Clustering: In addition to classiﬁcation tasks, graph embeddings can be used in clustering tasks as well This task can be useful in domains such

1https://blogs.nvidia.com/blog/2022/10/24/what-are-graph-neural-networks/

Knowledge across various scientific and industrial fields can be represented through graphs, facilitating the analysis of social networks for community detection and biological networks for identifying similar protein groups Clustering methods, such as the K-means algorithm, can be utilized to detect groups of similar graphs, nodes, or edges by analyzing their embedding vectors.

Node embedding methods play a crucial role in graph visualization by mapping nodes into lower-dimensional spaces, which enhances the visibility of nodes, edges, communities, and various graph properties This technique significantly aids the research community in gaining insights into complex graph data, particularly for large graphs that are challenging to visualize effectively.

Related work

Overview of Video Recommendation System

Current video recommendation systems primarily rely on traditional machine learning techniques, including Content-based filtering and Collaborative filtering methods like Weighted Matrix Factorization (WMF), Collaborative Topic Regression (CTR), Collaborative Deep Learning (CDL), and Bayesian Personalized Ranking (BPR) These approaches often face challenges such as high computational costs and limited scalability While the integration of Deep Learning in video recommendations is emerging, it largely remains focused on matrix factorization methods, particularly in platforms like YouTube.

Recommendation systems often face high computational costs, particularly when utilizing user behavior sequences with recurrent neural networks (RNNs) like GRU and LSTM, which can be slow and ineffective in dynamic environments Traditional methods are limited to specific video characteristics, such as ID, title, views, and likes, failing to leverage the full spectrum of available video data The CER model has made strides by incorporating multiple features to enhance recommendation efficiency, but it requires retraining for each new feature, leading to significant computational expenses Recent studies have focused on user-related data, achieving notable results but often struggling with the cold-start problem due to reliance on extensive user data By shifting the emphasis to learning video representations, systems can become more adaptable and better serve new user profiles Our approach, Video Sage, effectively learns video representations by utilizing a comprehensive array of features, enhancing recommendation capabilities.

Video Feature Engineering

Videos consist of various elements, including titles, thumbnails, descriptions, and engagement metrics such as likes and views Utilizing this diverse content can significantly improve the model's learning efficiency However, due to the different data types associated with each content element, a feature engineering process is crucial This process entails converting video content into the vector space domain and incorporating it as feature nodes within the data graph, alongside the video nodes.

In our research, we speciﬁcally employ pre-trained deep learning models to embed video features and then combine them as node features Speciﬁcally, we use PhoBert

[32] for titles, EﬃcientB0 [33] for thumbnail images, and SlowFast [34] to extract video features Below, we will provide a brief overview of these models.

PhoBert is a specialized BERT model designed for the Vietnamese language BERT, which stands for Bidirectional Encoder Representation from Transformers, utilizes a bidirectional approach to word representation through the Transformer architecture.

[35] technique BERT is designed to pre-train word embeddings The special feature of BERT is its ability to balance context in both left and right directions.

The Transformer model employs an attention mechanism that processes all words in a sentence simultaneously, enabling non-directional training This bidirectional capability allows the model to understand the context of each word by considering all surrounding words, both to the left and right, enhancing its comprehension of language.

BERT introduces a unique capability not found in earlier pre-trained models: the ability to fine-tune training outcomes By incorporating an additional output layer into its architecture, BERT allows for enhanced customization during the training process.

The ﬁne-tuning process is as follows:

Figure 2.3: Complete BERT pre-training and ﬁne-tuning process 2

To begin, utilize a pre-trained model to embed all tokens from the sentence pair using word embedding vectors This process incorporates both the [CLS] token to signify the start of the question and the [SEP] token to delineate the two sentences These tokens are crucial as they will be predicted in the output to identify the Start and End Span components of the resulting sentence.

In Step 2, the embedding vectors are processed through a multi-head attention architecture, which consists of several blocks of code—usually 6, 12, or 24, based on the specific BERT architecture This process generates an output vector from the encoder.

In Step 3, we predict the probability distribution for each position in the decoder by combining the output vector from the encoder with the input embedding vector from the decoder at each time step This process involves computing encoder-decoder attention, projecting the result through a linear layer, and applying softmax to generate the probability distribution for the output at that specific time step.

• Step 4: In the output returned by the transformer, we will align the result of the Question sentence to match the input Question sentence The remaining posi-

I don't know!

During the training process, we will optimize all parameters of the BERT model, including those in the customized linear layer specifically designed to address the problem at hand.

Masked Machine Learning (ML) enables the fine-tuning of word representations using unsupervised text datasets across various languages By applying Masked ML, we can generate effective embeddings for different languages Notably, English datasets, which can range from hundreds to thousands of gigabytes, have achieved remarkable outcomes when trained on BERT.

The training process of BERT for Masked ML task can be described as follows:

Figure 2.4: BERT architecture diagram for Masked ML jobs 3

Approximately 15% of the tokens in an input sentence are replaced with the [MASK] token, indicating the masked words The model uses the surrounding non-masked words and their context to predict the original values of these masked words This limited masking of 15% ensures that the remaining 85% of the context remains significant for accurate predictions.

BERT's core architecture is based on a seq2seq model that includes an encoder phase for embedding input words and a decoder phase for predicting the probability distribution of output words The Transformer encoder structure is utilized in the Masked Language Modeling task, where self-attention and feed-forward processes generate embedding vectors, denoted as O1, O2, O3, O4, and O5, at the output.

3https://phamdinhkhanh.github.io/2020/05/23/BERTModel.html/3-c´ac-kin-tr´uc- model-bert/

To calculate the probability distribution of output words, a fully connected layer is added right after the Transformer Encoder The softmax function is then applied to determine this probability distribution, ensuring that the number of units in the fully connected layer corresponds to the size of the vocabulary.

We derive the embeddings for each word positioned at the MASK by utilizing dimension-reduced embedding vectors, which are generated after processing through the fully connected layer, as illustrated in the accompanying diagram.

BERT's loss function focuses solely on the loss from masked words, neglecting non-masked ones, which results in slower model convergence However, this approach enhances the model's contextual understanding By randomly masking 15% of the words, BERT creates a variety of input scenarios, necessitating extensive training for the model to fully grasp the diverse possibilities.

Currently, there are several diﬀerent versions of the BERT model These versions are all based on changes to the Transformer architecture, focusing on three parameters:

Preliminary

Notations and Preliminaries

This section outlines the essential notations and foundational concepts for understanding graph convolutional networks (GCNs) We use bold uppercase letters for matrices, bold lowercase letters for vectors, and regular lowercase letters for scalars Specific elements within matrices are referenced using the notation A(i, j), indicating the entry at the intersection of the ith row and jth column Additionally, we discuss the transpose of a matrix.

In this section, our focus centers around graph convolutional network models applied to undirected connected graphsG={V, E, A} This graph consists of a set of nodes

In a graph represented by a set of vertices V with |V| = n and a set of edges E with |E| = m, the adjacency matrix A is utilized to denote relationships between nodes If there is an edge connecting node i to node j, the matrix entry A(i, j) reflects the weight of that edge; if no edge exists, A(i, j) is equal to 0 In the case of unweighted graphs, this is further simplified by assigning A(i, j) a value of 1.

We deﬁne the degree matrix ofAas a diagonal matrixD, whereD(i, i) = n j=1 A(i, j).

Subsequently, the Laplacian matrix ofAis denoted asL=D−A The corresponding symmetrically normalized Laplacian matrix is represented asL=I−D − 1 2 AD − 1 2, where I is an identity matrix.

A graph signal is represented as a vector x∈R n, where each element x(i) indicates the signal value at node i Node attributes derived from graph signals contribute to the overall characteristics of the graph.

X ∈R n×d as the node attribute matrix of an attributed graph, where the columns of X represent the dsignals associated with the graph.

Spectral graph convolutional networks

In this section and the subsequent section titled ”Spatial Graph Convolutional Networks,” we classify graph convolutional neural networks into two categories: spectral-based methods and spatial-based methods, respectively.

Spectral-based methods in classification involve the construction of frequency filters The pioneering spectral-based graph convolutional network was developed by Bruna et al and draws inspiration from traditional Convolutional Neural Networks (CNNs) This deep learning model for graphs features several spectral convolutional layers, where each layer processes an input feature map, represented as X p with dimensions n×d p, and generates an output feature map, X p+1, of size n×d p+1 through a specific formula.

In the context of feature maps, X p (:, i) and X p+1 (:, j) denote the i-th and j-th dimensions of the input and output, respectively The parameter θ (p) p,i,j represents a vector of learnable filter parameters for the p-th basis in the i-th column at the p-th layer Additionally, the matrix L is identified as the graph Laplacian, while σ(ã) refers to the activation function utilized in the process.

The convolutional structure encounters significant challenges, primarily due to the time complexity of computing the eigenvector matrix \( V \), which requires eigenvalue decomposition of the graph Laplacian \( L \) and has a complexity of \( O(n^3) \), rendering it impractical for large-scale graphs Moreover, even with precomputed eigenvectors, the convolution process still maintains a time complexity of \( O(n^2) \) Each layer also contains \( O(n) \) learnable parameters, and the non-parametric filters used are not localized within the vertex domain.

To overcome existing limitations in graph spectral convolutions, the authors proposed a rank-r approximation of eigenvalue decomposition, significantly reducing the number of parameters for each filter to O(1) This method allows for localized filters in graphs with clustering structures through rank-r factorization Building on this foundation, Henaf et al introduced an input smoothing kernel, such as splines, utilizing interpolated weights as filter parameters Although this approach enhanced spatial localization in the vertex domain, challenges related to computational complexity and localization persisted, hindering improved graph representations.

To address these challenges, Deferrard et al introduced ChebNet, which utilizes K-polynomial filters in its convolutional layers to enhance localization These filters are mathematically expressed as ˆy(λ l ) = K k=1 θ k λ k l, where θ k represents the filter parameters and λ l denotes the eigenvalues The K-polynomial filters demonstrate effective localization in the vertex domain by combining node features from within the K-hop neighborhood.

[36], reducing the number of trainable parameters to O(K) = O(1) Additionally,

Chebyshev polynomial approximation is used to compute spectral graph convolution eﬃciently.

Mathematically, the Chebyshev polynomial T k (x) of order k can be computed re- cursively by T k (x) = 2xT k−1 (x)−T k−2 (x) with initial valuesT 0 = 1 andT 1 (x) =x.

Deferrard et al [39] normalize the ﬁlters ˜λ l = 2 λ l λ max −1 to ensure scaled eigenvalues lie within the interval [-1, 1] The convolutional layer is expressed as:

Here, θ (p) i,j represents a K-dimensional parameter vector for the i-th column of the input feature map and thej-th column of the output feature map at thep-th layer.

The authors also introduced a max-pooling operation using the Graclus multilevel clustering method to eﬃciently uncover hierarchical graph structures.

A special variant known as the Graph Convolutional Network (GCN), introduced by Kipf et al., focuses on the semi-supervised node classiﬁcation task on graphs [9].

In this model, the authors simplify the Chebyshev polynomial to the ﬁrst order (i.e.,

In the context of Eq (3.8) with K set to 2, the parameters are defined as θ i,j (1) = −θ i,j (2) = θ i,j The eigenvalues of the modified Laplacian operator, ˜L, are constrained within the interval [0, 2] Notably, even with the relaxation of λ max = 2, it remains assured that the eigenvalues satisfy −1 ≤ λ˜ l ≤ 1 for all l ranging from 1 to n This leads to a more streamlined convolution layer.

The equation ˜A=I+A represents the addition of self-loops to the original graph, with ˜D serving as the diagonal degree matrix of ˜A, and Θ p as a parameter matrix of dimensions d p+1 × d p This formulation is closely tied to the Weisfeiler–Lehman isomorphism test and highlights how GCN aggregates node representations from direct neighborhoods, emphasizing its vertex localization concept As a result, GCN is often viewed as a connection between spectral-based and spatial-based methods However, GCN's training can be memory-intensive, particularly with large-scale graphs, and its transductive nature creates challenges for generalization, complicating the learning of representations for unseen nodes within the same graph and across different graphs.

Many graph structures, such as kNN graphs, are manually created based on the similarities of data points, but these predefined graphs often lack optimal learning capabilities for specific tasks To address this issue, Li et al introduce a spectral graph convolution layer that learns the graph Laplacian simultaneously This innovative approach parameterizes a function over the graph Laplacian through a concept called residual Laplacian, rather than directly parameterizing filter coefficients However, a notable limitation of this method is the unavoidable drawbacks it presents.

Spatial graph convolutional networks

Spectral graph convolution relies on the specific eigenfunctions of the Laplacian matrix, making it difficult to transfer knowledge from one graph to another with different eigenfunctions In contrast, vertex domain graph filtering provides a more adaptable method for convolution by aggregating graph signals from neighboring nodes This section categorizes spatial graph convolutional networks into two main types: classic CNN-based models and propagation-based models.

Classic CNN-based spatial graph convolutional networks

Traditional CNN models have demonstrated remarkable success in handling grid-like data, particularly in tasks such as image classification, object detection, and semantic segmentation This success can be largely attributed to the unique characteristics of grid-like data.

• (1) a ﬁxed number of neighboring pixels for each pixel

• (2) a natural spatial order for scanning images, typically from left to right and top to bottom.

Unlike images, arbitrary graph data does not have a fixed number of neighboring units for each node or a predetermined spatial order To address these challenges, various methods have been created to modify traditional CNN architectures for direct application with graph data.

Niepert et al propose a novel approach to address challenges in graph analysis through their PATCHY-SAN model, which extracts locally connected regions from graphs The model begins by establishing a node ordering based on various graph labeling methods that consider structural characteristics such as degree, PageRank, and betweenness A fixed-length sequence of nodes is then selected, and to manage varying neighborhood sizes, a consistent neighborhood size is created around each node Subsequently, the neighborhood graph is normalized using graph labeling techniques to align nodes with similar structural roles This process culminates in representation learning utilizing traditional CNNs.

The PATCHY-SAN approach has a notable limitation: it determines the spatial order of nodes exclusively through the selected graph labeling method, usually reliant on the graph's structure As a result, PATCHY-SAN may lack the necessary flexibility and generalizability for diverse applications.

In contrast to PATCHY-SAN, which organizes nodes based on structural information

The LGCN model uniquely transforms irregular graph data into grid-like data by utilizing both structural information and the input feature map from the p-th layer For a node \( u \) in graph \( G \), it aggregates the feature maps of \( u \)’s neighboring nodes into a matrix \( M \), where the number of rows corresponds to the neighboring nodes and the columns represent the feature dimensions By retaining the largest values in each column, LGCN creates a new matrix \( M \sim \), which is then converted into a 1-D grid-like data format \( X p \sim \) This allows for the application of traditional 1-D CNN operations to derive new node representations \( X p+1 \) Additionally, a subgraph-based training method enhances the model's scalability for large-scale graphs.

To adapt classic CNNs for graph data, researchers have introduced structure-aware convolution operations that can effectively manage both Euclidean and non-Euclidean data Chang et al established a link between classical filters and univariate functions, known as functional filters, and enhanced their structural awareness by integrating graph structure into generalized functional filters Due to the infinite parameters needed for these convolutions, Chebyshev polynomials are utilized for approximation Alternatively, another approach involves re-engineering the classic CNN architecture by creating a set of fixed-size learnable filters, ranging from size-1 to size-K, to demonstrate their adaptability to the graph's topology.

Propagation-based spatial graph convolutional networks

In this section, we explore spatial graph convolutions that effectively propagate and aggregate node representations from neighboring nodes within the vertex domain A significant contribution to this area is highlighted in [50], which formulates the graph convolution for a node \( u \) at the \( p \)-th layer as follows: \( x^p_N(u) = X^p(u,:) + \sum_{v \in N(u)} \).

In the p-th layer, Θ p |N (u)| represents the weight matrix for nodes sharing the same degree as |N(u)| However, in the case of large graphs, the diversity of unique node degree values can become excessively high, which may lead to overfitting as a result of the multiple weight matrices required to be learned at each layer.

Hamilton et al present GraphSAGE, an aggregation-based inductive representation learning model In its full batch version, the algorithm processes a node \( u \) through several steps: first, it aggregates the representation vectors from all immediate neighbors using a learnable aggregator; second, it concatenates this aggregated representation with the vector of node \( u \); finally, it passes the resulting concatenated vector through a fully connected layer with a non-linear activation function \( \sigma(\tilde{a}) \) and applies normalization The p-th convolutional layer in GraphSAGE is mathematically defined as \( x^p_{N(u)} \leftarrow AGGREGATE^p (\{X^p(v,:), \forall v \in N(u)\}) \).

GraphSAGE oﬀers various choices of aggregator functions, including mean aggregators, LSTM aggregators, and pooling aggregators When using mean aggregators, the equation simpliﬁes to:

This approximate formulation resembles the GCN model [9] Additionally, the pooling aggregator is expressed as:

To enable minibatch training, the authors also provide a variant that uniformly samples a ﬁxed number of neighboring nodes for each node [10].

Methodology

Motivation

In this article, we have chosen to implement a propagation-based spatial graph convolutional network model, inspired by the GraphSAGE framework, for our video representation learning model This decision is driven by several important factors outlined in section 3.1.

Social media networks often present irregular graph structures, where nodes have differing numbers of neighbors, posing challenges for spectral graph convolutional networks However, GraphSAGE effectively addresses these irregularities by aggregating information from neighboring nodes, making it a suitable solution for such complex graph scenarios.

GraphSAGE offers flexible information aggregation by allowing the adaptation of neighboring node data through various aggregation functions, such as mean, LSTM, and pooling aggregators This versatility enables customization to meet the specific structure of the graph and the distinct needs of different applications Additionally, GraphSAGE demonstrates improved generalization capabilities when handling unseen data.

GraphSAGE utilizes a multi-layer learning model that enhances node representations by leveraging information from preceding network layers This advanced capability allows the model to develop complex representations, ultimately leading to improved video recommendations.

GraphSAGE demonstrates enhanced performance and computational efficiency, especially when handling large and complex graphs, in contrast to Spectral graph convolutional networks This advantage is primarily due to GraphSAGE's avoidance of resource-intensive processes such as eigenvalue calculations and Laplace matrix analysis.

GraphSAGE is highly effective in navigating the complexities of social network graphs, where aggregating information from various sources is crucial Its ability to learn hierarchical representations of users and content within social media platforms significantly improves the quality of video recommendations.

In the subsequent sections, we will present an overview of the model architecture and the development of a comprehensive framework for video recommendations to users.

Problem Setup

Online video streaming services like YouTube, Netflix, and TikTok are gaining immense popularity, attracting significant daily traffic and user engagement Each video features rich content, including titles, thumbnails, descriptions, and key metrics such as likes and views Our goal is to leverage this data, along with users' interaction history, to provide personalized video recommendations tailored to individual preferences.

Our goal is to create high-quality video embeddings that facilitate effective recommendations These embeddings can be employed for nearest neighbor searches to suggest related videos or integrated into downstream re-ranking systems to enhance user experience.

We model the Video Sage environment as a graph, where nodes represent videos and edges are formed based on user interaction history Specifically, an edge between two nodes is created when a user interacts with one video immediately after another Each video is associated with real-valued attributes, which may include metadata or content information, allowing us to utilize the rich content of the videos Our goal is to leverage these attributes and the bipartite graph structure to generate high-quality embeddings These embeddings are then used for candidate generation in recommendation systems through nearest neighbor lookup or as features in machine learning systems for ranking candidates.

Overall Architecture

Our research closely aligns with the GraphSAGE algorithm by Hamilton et al., utilizing localized convolutional modules to generate node embeddings We begin with input node features and train neural networks that modify and integrate these features throughout the graph to derive the final node embeddings.

Algorithm 1 operates on the principle that, in each iteration or search depth, nodes gather data from their local neighbors As this iterative process continues, nodes gradually acquire information from more remote areas of the graph.

Algorithm 1 describes the process of generating embeddings under the condition that the complete graph, G = (V,E), and features for all nodes x v ,∀v ∈ V, are provided as input At each step within the outer loop of Algorithm 1, denoted

Input: Graph G = (V,E); node input features x v ,∀v ∈ V; weight matrices W; aggregator functions α; set of neighbor weights β; set of neighbor embedding

9: z v ←h K v ,∀v∈ V by k representing the current iteration (or the search depth) andh k representing a node’s representation at this particular step, the process unfolds as follows: Initially, every node v ∈ V aggregates the representations of the nodes within its immediate neighborhood, h k−1 u ,∀u∈ N(v)

In the aggregation step, a single vector \( h_{k-1}^N(v) \) is formed, drawing on the representations created in the previous iteration of the outer loop (i.e., \( k-1 \)) For the base case when \( k = 0 \), the representations are defined by the input node features.

Video Sage aggregates neighboring feature vectors and concatenates the node's current representation, h k−1 v , with the aggregated neighborhood vector, h k−1 N(v) This combined vector is then transformed using a fully connected layer with a nonlinear activation function σ, generating new representations for the next step of the algorithm, h k v , for all nodes v in V.

In graph data, which is inherently unordered and lacks positional significance unlike sequences or images, it is essential for aggregator functions to exhibit symmetry, ensuring they remain unaffected by permutations of neighboring nodes In Video Sage, we utilize three specific aggregator functions to address this requirement.

Mean aggregation is a technique akin to Graph Convolutional Networks (GCN), which computes the average of neighboring nodes However, unlike GCN, which incorporates the node representation \( h_{v}^{k-1} \) into the mean, our approach concatenates the node representation with the mean of its neighbors This method effectively prevents the loss of node information.

The graph convolution layer architecture is depicted in Figure 3.1, showcasing Algorithm 1 for synthesizing the node embedding of node A from the k-th layer This process utilizes the embeddings of its neighboring nodes (B, C, D) and itself, which were previously synthesized from the k−1 th layer In this context, α k denotes the aggregation function of the k-th layer.

An LSTM aggregator effectively combines representations of neighboring nodes through an LSTM architecture, which maintains the sequential order of nodes However, since there is no inherent order among the neighboring nodes, we address this issue by introducing a random permutation of the nodes during input.

• Pooling aggregator In this aggregation, each neighbor node is fed through a fully connected neural net, and then an elementwise max operation is applied to the transformed nodes as follows:

The equation utilizes the max operator for pooling, but the mean operator is also applicable This pooling aggregator is both symmetric and learnable, capturing various aspects of a node's neighborhood set.

To achieve valuable anticipatory representations in an unsupervised framework, we utilize a graph-oriented loss function applied to the output representations \( z_u \) for all nodes \( u \) in the graph \( V \) We optimize the weight matrices \( W_k \) for \( k \) ranging from 1 to \( K \), as well as the parameters of the aggregator functions, through stochastic gradient descent This graph-based loss function enhances the similarity of representations for nearby nodes while ensuring that representations for distant nodes remain significantly different.

(3.11) wherev represents a node that appears in proximity touduring a ﬁxed-length random walk[51] σ denotes the sigmoid function,P n represents the negative sampling distribution, and Qspeciﬁes the number of negative samples.

A key innovation in our approach is the redefinition of node neighborhoods N(v) for convolution in Algorithm 1 Unlike traditional GCN methods that analyze k-hop graph neighborhoods, we define a node v's neighborhood as the N nodes exerting the greatest influence on v This is achieved through a weight random walk initiated from node v, where we compute the L1-normalized visit counts of the nodes encountered Consequently, the neighborhood of v is formed by selecting the top N nodes with the highest normalized visit counts related to node v.

The importance-based neighborhood definition offers two key benefits: it allows for controlled memory usage during training by selecting a fixed number of nodes for aggregation, and it enhances Algorithm 1's ability to weigh the significance of neighbors when merging their vector representations, as indicated by the β parameter.

Using Rich Contents As Node Feature

Videos contain a variety of content elements, such as titles, thumbnails, descriptions, likes, and views Leveraging this diverse information can enhance the model's learning capabilities Due to the varying data types of each content element, a feature engineering process is necessary to transform video content into a vector space format, allowing these features to be integrated as nodes within a data graph.

In our work, we leverage video content elements such as titles, thumbnail images, video frames, and key metrics like likes and views Additionally, we harness data from graphs, using it as a feature to enhance our content The specific processing approach for each content type is illustrated in Figure 3.2.

The title of a video serves to provide an overview and purpose, undergoing preprocessing that includes cleaning and removing stop-words We employ a pre-trained PhoBert model for Vietnamese language data to embed these video titles, extracting the vector from a layer near the output layer to create a distinctive feature vector For videos lacking titles, we initialize a default vector with all elements set to zero.

• Thumbnail Image Thumbnail images are also important content elements of a video, providing viewers with a preliminary visual representation of the

In our methodology, we preprocess video content to generate node features, specifically utilizing the pre-trained EfficientNet B0 model to extract features from thumbnail images We also derive a distinctive feature vector from a layer near the output layer, similar to the approach used for the video title, ensuring a comprehensive representation of the thumbnail image's content.

To effectively capture viewers' attention on social media, we focus on the first 30 seconds of videos, as they are typically short and impactful This approach avoids the computational inefficiency of processing entire videos The selected segment is resized and transformed into a sequence of 64 frames, which are then input into the SlowFast model The output vector generated serves as the representative feature for the video.

In our analysis, we focus solely on numerical data, including likes, views, shares, and comments for videos, due to dataset constraints We standardize these numerical values to create a Node Feature known as Numeric Feature Depending on the specific dataset, we can also integrate additional numerical attributes to enhance our analysis.

Videos possess unique characteristics, but the relationships between them, illustrated in a graph, offer valuable insights These insights can be utilized to enhance node features, which we define as "graph features."

In this study, we utilize graph algorithms to uncover hidden information within graphs, aggregating the findings into a vector that serves as a graph feature The specific algorithms employed in our analysis include various advanced techniques designed to enhance data extraction and representation.

The Weakly Connected Components (WCC) algorithm identifies groups of connected nodes in both directed and undirected graphs Nodes are deemed connected if a path exists between them, forming a component of mutually connected nodes regardless of the direction of their relationships For instance, in a directed graph with a relationship from node A to node B, both nodes A and B belong to the same component, even in the absence of a directed relationship from B back to A.

WCC is often used as a preliminary method in graph analysis to understand the structural arrangement of the graph By applying WCC, one can effectively identify clusters, enabling the independent application of additional algorithms on these identified groups.

The PageRank algorithm evaluates the significance of nodes within a graph by analyzing the quantity of incoming links and the authority of the linking nodes Essentially, it posits that a page's importance is determined by the quality and quantity of pages that link to it.

PageRank is introduced in the original Google paper as a function that addresses the following equation:

• we assume that page A has incoming links from pages T 1 to T n

• ”d” represents a damping factor, which can be set between 0 (inclusive) and 1 (exclusive), with a common value being 0.85.

• The term ”C(A)” denotes the number of links originating from page A.

The iterative application of this equation is used to update a candidate solution, leading to an approximate solution for the same equation.

There are some things to be aware of when using the PageRank algorithm:

• If there are no relationships from within a group of pages to outside the group, then the group is considered a spider trap.

• Rank sink can occur when a network of pages is forming an inﬁnite cycle.

• Dead-ends occur when pages have no outgoing relationship.

Adjusting the damping factor can address various concerns by allowing for a probability that a web surfer will occasionally navigate to a random page, preventing them from becoming trapped in content sinks.

1https://neo4j.com/developer-blog/using-neo4j-graph-data-science-in-python-to- improve-machine-learning-models/

Figure 3.3: An example network where nodes are colored based on their PageRank score 1

Closeness centrality is a key metric in connected graphs that evaluates the centrality of a node within a network It is determined by calculating the sum of the shortest path lengths from the node to every other node in the graph Therefore, a node with higher closeness centrality is positioned closer to all other nodes, indicating its importance in the network's structure.

Figure 3.4: Example about Closeness centrality 2

Closeness was deﬁned by Bavelas (1950) as the reciprocal of the fairness, that is:

C(x) = 1 y d(y, x) (3.13) where d(y, x) is the distance between vertices x and y When discussing closeness

Closeness centrality is a key measure in network analysis, typically represented in its normalized form to indicate the average length of the shortest paths rather than their total This metric is calculated using a formula that incorporates the number of nodes (N) in the graph, often simplified by omitting the subtraction of 1 for large graphs, as this distinction becomes negligible.

This adjustment facilitates the comparison of nodes across graphs of varying sizes In undirected graphs, distances to or from other nodes are not significant, while in directed graphs, they can yield vastly different outcomes For instance, a website may exhibit high closeness centrality based on outgoing links but show low closeness centrality when assessed through incoming links.

Video Recommendation System Architecture

In this section, we will outline the complete architecture of Video Sage, a model designed for video recommendations on social media platforms Building on the previously discussed Graph Convolutional Neural Network framework, we will explore how to effectively integrate rich video features with the hidden information from the graph to create node features These node features will serve as essential inputs for the representation learning model, enhancing the overall performance of video recommendations.

The primary components of the Video Sage architecture include:

Figure 3.5: Overview architecture of Video Sage.

• Feature Extract: This module is used to construct the initial node features for videos, serving as input for the representation learning model, as described in Section 3.2.4.

• Graph Convolutional Neural Network: This is the core component of the model, where video representations are learned through the algorithm described in Section 3.2.3.

• Embedding Store: The graph structure and embedding vectors representing the videos are stored, facilitating retraining or video candidate recommendations.

Here, we will outline the video recommendation process using the Video Sage model:

When a user engages with a video, the system first determines if the video exists in the learned representation dataset If the video is new, it undergoes feature extraction and is processed through a Graph Convolutional Neural Network for representation learning before being saved in the Embedding Store.

The Find Candidate module analyzes the embedding vector of the input video and compares it with a dataset of video representations to identify the top K video candidates that closely resemble the input video, providing tailored recommendations for the user.

Video Sage with cold-start problem

Video Sage is an innovative video recommendation model that focuses on learning video representations instead of relying solely on user-specific data By utilizing user interaction history to establish relationships between videos in a graph, Video Sage effectively mitigates cold-start issues for new users or those with limited profiles This allows the model to generate relevant video recommendations even with minimal user interactions, making it particularly advantageous for emerging social networks like Lotus, which have sparse user data As Lotus grows and attracts more users, Video Sage's capabilities will continue to expand, offering substantial potential for enhanced video recommendations in the future.

This chapter outlines the data utilized, the experimental setup, and the results achieved, allowing for conclusions regarding the performance of the video recommendation model based on a Graph Convolutional Neural Network Section 4.1 details the specific dataset employed, along with a thorough summary of hyperparameter values, augmentation techniques, and experimental configurations Section 4.2 showcases the results from the experiments conducted, while Section 4.3 evaluates the effectiveness of specific components within the study.

Experimental Setup

Dataset

Current recommendation systems primarily rely on user behavior data, with most datasets that include both video content and user interaction being proprietary to large corporations To address this gap, we leverage our own dataset from the Lotus social network platform, which includes comprehensive information on videos and user interactions as of August 2021.

Table 4.1: Information of Lotus Dataset

Due to hardware limitations, we utilized a subset of 10,411,554 user interaction records from the dataset, which includes 165,667 videos and 301,539 users Each video is detailed with information such as title, thumbnail image, likes, and views The data was sourced from the Kafka message queue system of the Lotus social network, and underwent several preprocessing steps to filter the relevant fields, resulting in an output similar to Figure 4.2.

Training Setup and Strategies

The table below, Table 4.2, provides hardware conﬁguration information that we used for model training:

Name Speciﬁcations CPU Intel(R) Xeon(R) 2.30GHz

Table 4.3 is a list of libraries and programming tools that I used during the experi- mentation process, along with their respective versions.

In our study, we define a set of positive training examples based on historical user engagement data, identifying pairs of videos (q, i) where users interacted with video i after video q All other videos are treated as negative items, with 100 videos designated as such for our experiments We carefully divided the dataset into three subsets: 80% for training, 5% for validation, and 15% for testing For model training, we utilized aggregation functions from the GraphSage research, including max-pooling, mean-pooling, and LSTM.

Oﬄine Evaluation

Metrics

To evaluate the effectiveness of video recommendations, we introduce the hit-rate rankk (HR@k) metric This involves using each positive video pair (q, i) from the test dataset, where q serves as the query video, to identify its top k nearest neighbors The hit rate is then calculated as the ratio of queries q for which the video i appears in the top k rankings This metric effectively measures the algorithm's capability to recommend videos relevant to the query video q Our experiments present results for k values of 10, 20, and 30.

We assess the effectiveness of our methods through Mean Reciprocal Rank (MRR@k), which measures the position of an item among the top k recommended items for a given query In our experiments, we present the results with k set to 30.

Baselines for comparison

We conducted experiments comparing VideoSage with leading models in the video recommendation domain, using the CER model as the primary baseline due to its top performance on our platform Additionally, we included a comparison with YouTubeDNN, which is frequently utilized in real-world systems like YouTube, TikTok, and Tencent for its simplicity and effectiveness The results of these comparisons are presented in Table 4.4.

• Weighted Matrix Factorization (WMF)[26] a classical collaborative ﬁl- tering (CF) model, operates within an in-matrix setting.

• Bayesian Personalized Ranking (BPR)[29] BPR utilizes pairwise opti- mization for collaborative ﬁltering Similar to WMF, BPR is only suitable for the in-matrix scenario.

YouTubeDNN, developed by Google, is an advanced recommendation system that utilizes deep neural networks to evaluate vast amounts of user behavior data and video content, delivering tailored video recommendations to enhance user experience on YouTube.

• Collaborative embedding regression (CER)[5] combine collaborative ﬁltering (CF) with any speciﬁc content feature, CER employs linear embedding to acquire latent content vectors from content vectors.

Table 4.4 highlights the exceptional performance of Video Sage, which outperforms competing models by achieving a notably higher MRR@30 score, approximately 6% superior to the best-performing model available on our platform, CER [5].

Table 4.4: Performance metrics of Video Sage and diﬀerent models

Model HR@10 HR@20 HR@30 MRR@30

Ablation Studies

Eﬀectiveness of Using Rich Contents

Our experiments evaluated the effectiveness of combining multiple video features, revealing that leveraging diverse feature sets yields optimal results, as shown in Table 4.5 This highlights the robustness of Video Sage, which can seamlessly integrate various features regardless of their original formats We chose not to compare with numeric data or graph features at this stage, as they serve only as supplementary data and do not adequately represent the diversity of different videos; their effectiveness will be assessed in future research.

Table 4.5: Performance metrics of Video Sage with all features and individual iso- lated features.

VideoSage with HR@10 HR@20 HR@30 MRR@30

Eﬀectiveness of Using Numeric Data

The evaluation results in Table 3 demonstrate that incorporating numeric data as additional features significantly boosts the performance of Video Sage Specifically, performance increases by 5% with only the Title as the node feature, 4% with only the Thumbnail Image, 8% when using only the Video Frame, and 2% when all features are combined Notably, graph features were not included in this experiment to prevent conflicts with numeric data, and their results will be addressed in future experiments.

Figure 4.1: Eﬀectiveness of Using Numeric Data.

Eﬀectiveness of Using Graph Features

This section validates the performance enhancement achieved by integrating information from the graph using basic graph algorithms As shown in Table 4.6, the model experiences a 5-10% improvement with the inclusion of graph features This indicates significant potential for future enhancements, especially considering that our current graph information extraction methods are still in their infancy Additionally, this finding highlights a positive trend, as traditional models have not yet fully utilized the valuable insights derived from the relationships between products or users represented in the graph.

Table 4.6: Eﬀectiveness of Using Graph Feature (Performance metrics for MRR@30)

VideoSage with graph features no graph features

Inﬂuence of the Number of Neighbors 44 4.3.5 Evaluating Aggregator Functions and Weighted Random Walks 44

Video Sage utilizes the random walk algorithm to learn the vector representation of each node by analyzing its neighbors To evaluate the impact of the number of neighbors on the quality of the model's representation, we conducted an assessment, as illustrated in Figure 4.2.

Figure 4.2: Inﬂuence of the Number of Neighbors

As the number of neighbors increases, performance improves, but gains become marginal beyond 30 neighbors This is due to the diminishing influence of distant neighbors, as indicated by the weighting mechanism Furthermore, exceeding 100 neighbors can lead to model overfitting Therefore, we have selected 30 neighbors to achieve an optimal balance between performance and computational efficiency.

4.3.5 Evaluating Aggregator Functions and Weighted Ran- dom Walks

Our research utilized three Aggregator Functions: mean, LSTM, and pooling aggregators, as detailed in Table 4.7, which compares the Video Sage model's performance using all three types versus each individually The findings demonstrate that integrating multiple aggregator functions significantly improves the model's effectiveness, allowing nodes to share information in a more diverse and generalized way Notably, LSTM outperformed the other aggregators, likely due to its capability to utilize the sequential historical behavior of users.

In our experiment, we compared Weight Random Walk with traditional Random Walk methods The findings, presented in Table 4.8, highlight the superior effectiveness of Weight Random Walk This approach allows nodes to prioritize information aggregation from nearby nodes, significantly enhancing the model's efficiency.

Table 4.8: Evaluating Weighted Random Walks.

MRRRandom Walk 0.092574Weight Random Walk 0.094803

This article presents Video Sage, an innovative video recommendation model that integrates various video attributes with user interaction history At its core, this model is based on the principles of Graph Convolutional Neural Networks (GCN).

We developed an innovative solution that enhances video recommendation efficiency by embedding and consolidating complex video features, which are used as input for the Video Sage model Our method effectively discerns intricate user-video interactions while maintaining scalability for future data growth Comprehensive testing on a dataset of users' video viewing histories from social networks demonstrated remarkably competitive results This research highlights the critical need to bridge the gap between the rich information in videos and the effectiveness of recommendation systems By integrating user interactions, video characteristics, and advanced graph-based techniques, we pave the way for an improved personalized video recommendation experience.

[1] S Bhandarkar and R Kammar, “4g technology,” International Journal of Sci- entiﬁc Research and Modern Education, vol 1, no 2, pp 96–99, 2016.

[2] R Dangi, P Lalwani, G Choudhary, I You, and G Pau, “Study and investi- gation on 5g technology: A systematic review,”Sensors (Basel), vol 22, no 1, p 26, 2021.

[3] A Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short- term memory (lstm) network,”arXiv preprint arXiv:1808.03314, 2018.

[4] R C Staudemeyer and E R Morris, “Understanding lstm–a tutorial into long short-term memory recurrent neural networks,” arXiv preprint arXiv:1909.09586, 2019.

[5] L C Y W Y Y X Z Xingzhong Du, Hongzhi Yin, “Personalized video recommendation using rich contents from videos,” arXiv preprint arXiv:1612.06935,

[6] W Cao, K Zhang, H Wu, T Xu, E Chen, G Lv, and M He, “Video emotion analysis enhanced by recognizing emotion in video comments,” International Journal of Data Science and Analytics, vol 14, no 2, pp 175–189, 2022.

[7] X Xiao, H Dai, Q Dong, S Niu, Y Liu, and P Liu, “Social4rec: Distilling user preference from social graph for video recommendation in tencent,”ArXiv, vol abs/2302.09971, 2023.

In their research presented at the International AAAI Conference on Web and Social Media, Papadamou et al (2022) explore the impact of user watch history on YouTube's recommendations for pseudoscientific videos Their findings highlight how algorithmic suggestions can shape public perceptions of health-related issues, particularly regarding misinformation about diseases like the flu This study underscores the need for critical evaluation of content on social media platforms and the role of algorithms in disseminating potentially harmful information.

[9] M W Thomas N Kipf, “Semi-supervised classiﬁcation with graph convolutional networks,”arXiv preprint arXiv:1609.02907, 2016.

[10] J L William L Hamilton, Rex Ying, “Inductive representation learning on large graphs,”arXiv preprint arXiv:1706.02216, 2017.

[11] P Goyal and E Ferrara, “Graph embedding techniques, applications, and performance: A survey,” Knowledge-Based Systems, vol 151, pp 78–94, 2018.

[12] W L Hamilton, R Ying, and J Leskovec, “Representation learning on graphs: Methods and applications,” arXiv preprint arXiv:1709.05584, 2017.

[13] C Yang, M Sun, Z Liu, and C Tu, “Fast network embedding enhancement via high order proximity approximation.,” in IJCAI, pp 3894–3900, 2017.

[14] A Ahmed, N Shervashidze, S Narayanamurthy, V Josifovski, and A J Smola,

“Distributed large-scale natural graph factorization,” inProceedings of the 22nd international conference on World Wide Web, pp 37–48, 2013.

[15] S Cao, W Lu, and Q Xu, “Grarep: Learning graph representations with global structural information,” in Proceedings of the 24th ACM international on conference on information and knowledge management, pp 891–900, 2015.

[16] M Ou, P Cui, J Pei, Z Zhang, and W Zhu, “Asymmetric transitivity preserving graph embedding,” inProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1105–1114, 2016.

[17] L Katz, “A new status index derived from sociometric analysis,” Psychome- trika, vol 18, no 1, pp 39–43, 1953.

[18] H H S T W Cho and V D Y Z L Qiu, “Scalable proximity estimation and link prediction in online social networks,”networks, vol 48, p 50, 2009.

[19] Y Yin and Z Wei, “Scalable graph embeddings via sparse transpose proximities,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 1429–1437, 2019.

[20] X Zhang, K Xie, S Wang, and Z Huang, “Learning based proximity matrix factorization for node embedding,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp 2243–2253, 2021.

[21] B Perozzi, R Al-Rfou, and S Skiena, “Deepwalk: Online learning of social representations,” inProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 701–710, 2014.

[22] A Grover and J Leskovec, “node2vec: Scalable feature learning for networks,” inProceedings of the 22nd ACM SIGKDD international conference on Knowl- edge discovery and data mining, pp 855–864, 2016.

[23] T Mikolov, K Chen, G Corrado, and J Dean, “Eﬃcient estimation of word representations in vector space,”arXiv preprint arXiv:1301.3781, 2013.

[24] Y Luo, Q Liu, and Z Liu, “Stan: Spatio-temporal attention network for next location recommendation,” inProceedings of the web conference 2021, pp 2177–

[25] W L Hamilton,Graph representation learning Morgan & Claypool Publishers,

[26] Y K R B C Volinsky, “Matrix factorization techniques for recommender systems,”IEEE Computer, vol 42, no 8, pp 30–37, 2009.

[27] S Purushotham, Y Liu, and C.-C J Kuo, “Collaborative topic regression with social matrix factorization for recommendation systems,” arXiv preprint arXiv:1206.4684, 2012.

[28] H Wang, N Wang, and D.-Y Yeung, “Collaborative deep learning for recommender systems,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015.

[29] S Rendle, C Freudenthaler, Z Gantner, and L Schmidt-Thieme, “BPR: Bayesian personalized ranking from implicit feedback,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pp 452–461, 2009.

[30] J L T V V U G Y H B L James Davidson, Benjamin Liebald, “The youtube video recommendation system,” in2010 ACM Conference on Recom- mender Systems, RecSys 2010, pp 293–296, 2010.

[31] C G D B F B H S Y B Kyunghyun Cho, Bart van Merrienboer, “Learn- ing phrase representations using rnn encoder-decoder for statistical machine translation,”arXiv preprint arXiv:1406.1078, 2014.

[32] A T N Dat Quoc Nguyen, “Phobert: Pre-trained language models for vietnamese,”arXiv preprint arXiv:2003.00744, 2020.

[33] Q V L Mingxing Tan, “Eﬃcientnet: Rethinking model scaling for convolutional neural networks,”arXiv preprint arXiv:2105.13137, 2021.

[34] J M K H Christoph Feichtenhofer, Haoqi Fan, “Slowfast networks for video recognition,”arXiv preprint arXiv:1812.03982, 2018.

[35] A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N Gomez,

L Kaiser, and I Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol 30, 2017.

[36] D I Shuman, S K Narang, P Frossard, A Ortega, and P Vandergheynst,

“The emerging ﬁeld of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”IEEE signal processing magazine, vol 30, no 3, pp 83–98, 2013.

[37] J Bruna, W Zaremba, A Szlam, and Y LeCun, “Spectral networks and locally connected networks on graphs,”arXiv preprint arXiv:1312.6203, 2013.

[38] M Henaﬀ, J Bruna, and Y LeCun, “Deep convolutional networks on graph- structured data,”arXiv preprint arXiv:1506.05163, 2015.

[39] M Deﬀerrard, X Bresson, and P Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral ﬁltering,” Advances in neural information processing systems, vol 29, 2016.

[40] N Shervashidze, P Schweitzer, E J Van Leeuwen, K Mehlhorn, and K M. Borgwardt, “Weisfeiler-lehman graph kernels.,” Journal of Machine Learning Research, vol 12, no 9, 2011.

[41] R Li, S Wang, F Zhu, and J Huang, “Adaptive graph convolutional neural networks,” inProceedings of the AAAI conference on artiﬁcial intelligence, vol 32, 2018.

[42] A Kirzhevsky, I Sutskever, and G E Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,”Advances in neural information processing systems, vol 25, pp 1097–1105, 2012.

[43] R Girshick, “Fast r-cnn,” inProceedings of the IEEE international conference on computer vision, pp 1440–1448, 2015.

[44] J Long, E Shelhamer, and T Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440, 2015.

[45] M Niepert, M Ahmed, and K Kutzkov, “Learning convolutional neural networks for graphs,” inInternational conference on machine learning, pp 2014–

[46] H Gao, Z Wang, and S Ji, “Large-scale learnable graph convolutional networks,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1416–1424, 2018.

[47] J Chang, J Gu, L Wang, G Meng, S Xiang, and C Pan, “Structure-aware convolutional neural networks,”Advances in neural information processing systems, vol 31, 2018.

[48] D K Hammond, P Vandergheynst, and R Gribonval, “Wavelets on graphs via spectral graph theory,”Applied and Computational Harmonic Analysis, vol 30, no 2, pp 129–150, 2011.

[49] J Du, S Zhang, G Wu, J M Moura, and S Kar, “Topology adaptive graph convolutional networks,”arXiv preprint arXiv:1710.10370, 2017.

[50] D K Duvenaud, D Maclaurin, J Iparraguirre, R Bombarell, T Hirzel,

A Aspuru-Guzik, and R P Adams, “Convolutional networks on graphs for learning molecular ﬁngerprints,”Advances in neural information processing systems, vol 28, 2015.

Tiêu đề	Video Sage: Video Recommendation Using Graph Convolution Neural Network
Tác giả	Nguyen Sy Duc
Người hướng dẫn	Dr. Dao Thanh Chung
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Data Science
Thể loại	thesis
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	60
Dung lượng	871,67 KB