TASKS AND CONTENTS: • Task 1: Research and experiment for Data Augmentation techniques in preprocessing dataof session-based recommendation system to create more variations of input data
Trang 1NGUYEN QUYNH ANH PHUONG
SESSION-BASED RECOMMENDATION SYSTEM
Trang 2HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM
Supervisor(s):
• Assoc Prof Thoai Nam, Ph.D
• Nguyen Quang Hung, Ph.D
Examiner 1: Assoc Prof Dr Huynh Tuong Nguyen, Ph.D
Examiner 2: Ha Viet Uyen Synh, Ph.D
This master’s thesis is defended at HCM City University of Technology, HCM City on 18th June 2024
VNU-Master’s Thesis Committee:
1 Chairman: Le Thanh Sach, Ph.D
2 Secretary: Le Thanh Van, Ph.D
3 Examiner 1: Assoc Prof Dr Huynh Tuong Nguyen, Ph.D
4 Examiner 2: Ha Viet Uyen Synh, Ph.D
5 Commissioner: Nguyen Le Duy Lai, Ph.D
Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty ofComputer Science and Engineering after the thesis is corrected (If any)
CHAIRMAN OF THESIS COMMITTEEDEAN OF FACULTY OF
COMPUTER SCIENCE AND ENGINEERING
Trang 3VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY SOCIALIST REPUBLIC OF VIETNAMHO CHI MINH CITY UNIVERSITY OF TECHNOLOGYIndependence – Freedom - Happiness
THE TASK SHEET OF MASTER’S THESIS
Full name: Nguyen Quynh Anh PhuongStudent ID: 2171069Date of birth: 02/12/1996Place of birth: Lam Dong ProvinceMajor: Computer ScienceMajor ID: 8480101
I THESIS TITLE: Session-based recommendation system in Fashion
(Hệ thống gợi ý dựa vào phiên làm việc ngành thời trang)
II TASKS AND CONTENTS:
• Task 1: Research and experiment for Data Augmentation techniques in preprocessing dataof session-based recommendation system to create more variations of input data.
• Task 2: Research and proposed suitable approaches for predicting next items in based recommendation system domain Fashion using deep neural networks approaches.• Task 3: Experiment and evaluate proposed approaches.
session-III THESIS START DAY: 15/01/2024IV THESIS COMPLETION DAY: 20/05/2024
V SUPERVISOR(s): Assoc Prof Thoai Nam, PhD and Nguyen Quang Hung, Ph.D
Ho Chi Minh City, 15/08/2024
SUPERVISOR 1SUPERVISOR 2CHAIR OF PROGRAM COMMITTEE
(Full name and signature)(Full name and signature)(Full name and signature)
DEAN OF FACULTY OFCOMPUTER SCIENCE AND ENGINEERING
(Full name and signature)
Trang 4First and foremost, I would like to express my deepest appreciation to my belovedparents, my family and especially my aunt Mrs Nguyen Thi Hanh for inspiring andsupporting me throughout my academic and professional endeavors
Furthermore, I would like to thank Ho Chi Minh City University of Technologygave me an invaluable opportunity to pursue my academic aspirations The journeyto learn and grow with a support and guidance from great teachers and colleagues isthe best transformation to cultivate my character and knowledge
I want to express my profound appreciation to Associate Professor Thoại Nam ofComputer Science and Engineering at the Ho Chi Minh City University of Technol-ogy, for his supportive guidance, encouragement, and invaluable feedback throughoutthis study I am immensely grateful for his patience in guiding me and reviewing mywork Without his guidance and support, this thesis would not become possible
I also would like to thank Mr Nguyen Tan Sang, Mr Dao Ba Dat, Mr Ha MinhDuc, Ms Vo Thi Kim Nguyet for their enthusiastic cooperation and encouragementin offering valuable advice during the thesis
Last but not least, I would like to thank for my company and my leader for alwaysgive me a chance to still work while studying This experience has not only enrichedmy professional career development but also instilled me in pursuing knowledge
Trang 5ABSTRACT OF DISSERTATION
Recommendation system plays a key role in business and customers focus on proving customers shopping experience, creating customer satisfaction by understandtheir preferences Recently, session-based recommendation system has been arisento capture the short-term but dynamic user preferences This helps business engageusers quickly to explore products and make decision instead of become their loyaltycustomers first Recommendation system is widely applied in various domain includ-ing fashion Fashion domain is challenge domain due to its fast changing trends andenormous items adapt to customer’s tastes
im-Deep learning based approaches are widely used in many domains and mendation system Furthermore, there are more and more approaches and techniquesare applied in intersection domain between Computer Vision, Deep Learning, Natu-ral Language Processing These methods are applied on a problem of session-basedrecommendation system as well This thesis focus on researching and conductingexperiments to building recommendation system with the target to predict the nextitems in the customer’s sessions based on deep neural networks approaches Thisthesis contributes to the Fashion recommendation system when experiment and applytechniques various techniques from other fields to solve problems There are somehighlight points to address here:
recom-• The first highlight is a research, experiments and evaluation when apply dataaugmentation techniques to create variance data input for session-based recom-mendation system
• The second highlight is a research, experiments and evaluation approaches usingDeep Neural Networks focused on Attention and Neural Networks to predict thenext items in a anonymous sessions Furthermore, applying some techniques tobuilding recommendation system in Fashion domain
Trang 6TÓM TẮT LUẬN VĂN
Hệ thống gợi ý đóng vai trò quan trọng đối với cả doanh nghiệp và khách hàng.Đối với khách hàng, hệ thống gợi ý giúp cải thiện trải nghiệm mua sắm, tăng sự yêuthích của khách hàng với sản phẩm khi thể hiện sự thấu hiểu sở thích, cung cấp nhiềuhàng hóa phù hợp với nhu cầu của họ Ngoài ra, hệ thống gợi ý theo phiên ngày càngđược thu hút nghiên cứu và ứng dụng khi giúp doanh nghiệp có thể nhanh chóng thấuhiểu khách hàng, cung cấp những trải nghiệm chỉ với một vài tương tác Điều nàygiúp khách hàng có thể dễ dàng sử dụng sản phẩm và không cần trở thành thành viêntrước tiên nhưng vẫn có thể khám phá và mua hàng Hệ thống gợi ý được áp dụngrộng rãi với nhiều ngành kể cả ngành thời trang với nhiều thách thức như thay đổi xuhướng nhanh chóng, số lượng sản phẩm đa dạng, nhu cầu cá nhân hóa
Những năm gần đây, những giải pháp dựa trên Học sâu đang được áp dụng rộngrãi cho mảng hệ thống gợi ý Hơn thế nữa, nhiều giải pháp kĩ thuật lai giữa các mảngHọc sâu, Xử lý ảnh và Xử lý ngôn ngữ tự nhiên cũng đang được áp dụng rộng rãi vớinhiều ứng dụng đặc biệt hệ thống gợi ý theo phiên Bài luận văn này sẽ chú trọng đếnnghiên cứu, thí nghiệm và đánh giá khi áp dụng các giải pháp trên để xây dựng hệthống gợi ý với mục tiêu có thể dự đoán được những sản phẩm tiếp theo trong phiênhoạt động của khách hàng Những nghiên cứu này đóng góp khi áp dụng cho ngànhthời trang và hệ thống gợi ý khi thí nghiệm, đánh giá khi áp dụng nhiều kĩ thuật ápdụng khác nhau Một vài điểm nhấn được liệt kê như sau:
• Đầu tiên, nghiên cứu, thí nghiệm và áp dụng kĩ thuật làm giàu dữ liệu trong giaiđoạn tiền xử lý dữ liệu - được sử dụng rộng rãi với những ngành khác như Xử lýảnh, Xử lý ngôn ngữ tự nhiên cũng sẽ được áp dụng trong bài toán hệ thống gợiý dựa vào phiên
• Tiếp theo, nghiên cứu, thí nghiệm và đánh giá khi xây dựng hệ thống gợi ý dựavào phiên sử dụng áp dụng Mạng nơ-ron sâu để dự đoán những sản phẩm tiếptheo trong phiên gợi ý ẩn danh Áp dụng để xây dựng hệ thống gợi ý trong ngànhthời trang
Trang 7I declare this thesis to be a work of mine under the supervision of Assoc Prof.Thoại Nam was built to meet society’s demands, and my ability to achieve informa-tion The data and figures presented in this thesis for analysing, comments and eval-uations from various resources by my own work and have been duly acknowledged inthe reference part
In addition, the contents of external assistance should be recorded, referenced,and cited
I will take full responsibility for any fraud detected in my thesis Ho Chi MinhCity University of Technology (HCMUT) - VNU-HCM is unrelated to any copyrightinfringement caused on my work (if any)
Ho Chi Minh City, June 2024Nguyen Quynh Anh Phuong
Trang 8List of Figures ix
List of Tables xii
1INTRODUCTION11.1 Overview 1
1.2 Recommendation system in Fashion 2
2.2 Recurrent Neural Networks 13
2.3 Gated Recurrent Unit 15
2.4 Transformer 18
2.5 Word2Vec 24
3RELATED WORKS283.1 Session based Recommendation System 28
3.2 Natural Language Processing and Recommendation system relationship 293.3 Data Augmentation 31
vi
Trang 93.4 Discussion 33
4APPROACHES344.1 Overview 34
4.2 Data exploration 35
4.3 Data augmentation 40
4.3.1 Data Augmentation methods 40
4.3.2 Data augmentation applied strategies 42
4.4 Feature Engineering 43
4.5 Model 43
4.5.1 Neural Attentive Session-based Recommendation 43
4.5.2 Behaviour Sequence Transformer 49
4.5.3 Bidirectional Encoder Representations from Transformer 54
4.6 Evaluation metrics 60
5EXPERIMENTS AND EVALUATION625.1 Dataset 62
5.2 Experiments 63
5.2.1 Experiment data augmentation 63
5.2.2 Experiment NARM model 64
5.2.3 Experiment Transformer-based model 65
5.2.4 Experiment BERT4Rec 66
5.3 Results and discussion 67
5.3.1 Data Augmentation techniques in preprocessing 67
5.3.2 Deep neural networks based approaches in processing 70
6CONCLUSION79References 81
Trang 101.1 The difference between Sequential recommendation system and
Ses-sion recommendation system [2] 2
1.2 A session of anonymous user in session-based recommendation tem problem description 3
sys-2.1 Neural Networks in Deep Learning 7
2.2 A neural unit [4] 8
2.3 Feedforward in Neural Networks in NLP [4] 11
2.4 Selecting the embedding vector for word V5by muliplying the ding matrix E with a one-hot vector with a 1 in index 5 [4] 12
embed-2.5 Recurrent Neural Networks training [4] 14
2.6 GRU cell and it’s gates [5] 16
2.7 GRU explained with equation 17
2.8 Transformer model [6] 19
2.9 Mechanism of First Stack of Encoder [7] 20
2.10 One-head Attention in Transformer [7] 21
2.11 Decoder in Transformer 23
2.12 SkipGram model Word2Vec [8] 25
3.1 The categorization of session-based recommendation system approaches 293.2 Timeline relationship of Natural Language Processing and Recom-mendation system [11] 30
viii
Trang 113.3 Data Augmentation in text classification Natural Language
Process-ing taxonomy [25] 32
4.1 Item interaction counts distribution 36
4.2 Item interaction distributed by date 38
4.3 Item interaction distributed by week 38
4.4 Item interaction distributed by month 39
4.5 Data Augmentation methods: Noise Injection, Redundancy Injection,Random Swap, Random Deletion, Synonym Replacement 41
4.6 Data Augmentation strategies applied Example: get fraction 20%of the dataset and using method Swap Random, Naug = 4, keep theoriginal one and generate 3 more sequences 42
4.7 Overview framework of encoder-decoder-based NARM [17] 44
4.8 Global encoder in NARM model - the last hidden state is interpretedas the user’s sequential behaviour feature 45
4.9 Local encoder NARM 46
4.10 Model of NARM 48
4.11 Transformer-based model based on Behaviour Sequence Transformer 494.12 Label smoothing with ε 0.1 53
4.13 BERT4REC model architecture overview 54
4.14 Transformer layer in BERT 55
4.15 The non-linearity in the negative range of GELU [35] 57
4.16 BERT4Rec model by stacking many Transformer layers 58
Trang 124.1 Compared Dressipi dataset in Fashion with other popular datasets 39
5.1 Parameters for NARM model 655.2 Parameters for Transformer model 665.3 Parameter for BERT4Rec 675.4 Compare when apply Noise Injection data augmentation method with
different fraction and Naug 685.5 Compare when apply Redundancy Injection data augmentation method
with different fraction and Naug 685.6 Compare when apply Random Swap data augmentation method with
different fraction and Naug 695.7 Compare when apply Random Deletion data augmentation method
with different fraction and Naug 695.8 Compare when apply Synonym Replacement data augmentation method
with different fraction and Naug 695.9 Data augmentation methods applied comparison overall 705.10 Recall@20 metric result when experiment impact of learning rate of
Optimizer Adam Epochs: 5; batch size 512, hidden size 100, ding size 50 715.11 Impact of max length sequence in overall results with Recall@20 met-
embed-ric Learning rate of Optimizer Adam is 0.01, epochs is 10, batch sizeis 512, hidden size 100, embedding size is 50) 72
x
Trang 135.12 Impact of max length sequence in Recall@20 metrics Experimentwith learning rate of Optimizer Adam 0.01, batch size 512, hiddensize 100, embedding size is 50, max sequence length 19 725.13 Impact of hidden units of GRU Experiments with 15 epochs, learning
rate of Optimizer Adam 0.01, batch size 512, embedding size 50, maxsequence length is 19 725.14 Experiment number of Transformer encoder stacking in BST model
with sequence length 6, label smoothing 0.1, embedding dim 128,hidden d 128, number head of Attention 2 735.15 Experiment number of head of Attention in BST model with Number
of Transformer Layer: 2; sequence length: 6; label smoothing: 0.1,embedding dim: 128; hidden d: 128 735.16 Experiment label smoothing in BST model with Number of Trans-
former Layer: 2; n head: 2; sequence length: 6, embedding dim: 128;hidden d: 128 745.17 Experiment Hidden dimensionality (d) in BST model with Number
of Transformer Layer: 2; n head: 2; sequence length: 6, embeddingdim: 128; hidden d: 128 745.18 Experiment sequence length (slide window is 2) in BST model with
Number of Transformer Layer: 2; n head: 2; embedding dim: 128;hidden d: 128 745.19 Experiment number of epochs when training BST model with Num-
ber of Transformer Layer: 2; n head: 2; sequence length: 6, ding dim: 128; hidden d: 128 745.20 Recall@20 metric result when experimenting number of Transformer
embed-layers Experiment with mask probability = 0.3 Max length sequenceis 20 Hidden dimensionality d = 256 75
Trang 145.21 Recall@20 metric result when experimenting number of head of tention Experiment with Number of Transformer layer: 2 Maxlength sequence: 20; Mask probability 0.3; Hidden dimensionalityd: 128 755.22 Recall@20 metric result when experimenting Hidden dimensionality
At-(d) Experiment with Number of Transformer layer:2, Max Lengthsequence: 20; Mask probability: 0.3 765.23 Recall@20 metric result when experimenting mask probability of the
sequences Number of Transformer layer: 2; Max length sequence:20; Hidden d: 256 765.24 Overall Recall@20 results when apply three approaches 77
Trang 151.1Overview
Recommendation system(RS) [1] plays a key role in assisting customers by ing them to predict, narrow down, and find what they are looking for among an expo-nentially growing number of options Traditional recommendation system may usu-ally focus on long-term user engagement In many of application nowadays, there aresome cases happened that no long-term user history exists to engage users click andbuy quickly In that case, users may even be anonymous and only available informa-tion is obtained from the current session Session-based recommendation problemsexists with user history which is organized in activity sessions and sessions consist ofsequentially ordered observations
help-Even the main problem in recommendation system is to predict the next useraction, there are some common types related based on [1]:
• Sequential recommendation system: user history sequentially ordered, but notorganized in sessions User information is collected
• Session-based recommendation system (SBRS): Only anonymous interactions of
an ongoing session are known and user information is not collected
Figure 1.1 illustrates the difference of two types in recommendation system:
Se-1
Trang 16Figure 1.1: The difference between Sequential recommendation system and Sessionrecommendation system [2]
quential and session recommendation system.Session-based recommendation system is a hot topic for research recently espe-cially with the research crossing with the advent of deep learning, natural languageprocessing in the last 10 years Session-based recommendation system has been re-searched and explored in many popular topic especially the popular of application inrecommendation system: ecommerce, news, video or music, education,
1.2Recommendation system in Fashion
Session-based recommendation system is experimented in many domain cially in ecommerce, news, fashion, Building RS in domain fashion could be chal-lenging based on the research about recommendation system in garments and fashionproducts [2] Due to market dynamics and customer preferences, there is a large vo-cabulary of distinct fashion products, as well as high turnover This leads to sparsepurchase data, which challenges the usage of traditional recommender systems Fur-thermore, precise and detailed product information is often not available, making itdifficult to establish similarities between them The paper also addressed some majorchallenges faced by recommender systems in fashion domain
espe-• Firstly, building recommendation system might face some struggles about ion item representation due to the rich product representation such as images,text descriptions, customer reviews
Trang 17fash-• Secondly, training a model that could predict if two fashion item compatible orcombine together fit in trends is a challenging task.
• Thirdly, the best fashion products to recommend could be depend on other factorssuch as location, season or occasion, cultural or social backgrounds
• Fourthly, forecasting customer preferences or fashion trends in retailers to helpoptimizing business operation such as marketing, product testing, advertisingand logistics
• Finally, most of the existing models are black box and needs to be more pretable and well explained This challenge is also a same challenge in deeplearning training as well
inter-1.3Problem description
Given a sequence of item views, the label data for those items, and the label datafor all candidate items, the task is to predict the item that was purchased in the session[3]
Figure 1.2: A session of anonymous user in session-based recommendation systemproblem description
The input of the session-based recommendation system is the known part of thecurrent session with a list of happened interactions In this session, what items andwhen users view Besides, the input of current session is anonymous with singletype of actions: view An viewed item at the end of session has purchased is sepa-rated to distinguish Therefore, the main challenge is to precisely capture the user’s
Trang 18personalized preference with limited contextual information to provide accurate ommendation.
rec-The output is to make recommendation according to a given session context Tobe detailed, the output of this thesis problem is a predicted interaction or a predictedlist of interactions that happen subsequently in the current session
1.4Objectives and Missions
This thesis aims to explore techniques and model architectures to build based recommendation system in the domain Fashion Therefore, the main objectivesin this thesis con be listed as:
session-• Understanding session-based recommendation system, states of the art methodsin recommendation system
• Research techniques and model architectures for in Deep Learning (DL), Natural
Language Processing(NLP) applying in building session-based recommendationsystem
• Studying and stating the special challenges when applying in domain Fashion
• Propose suitable solutions for building session-based recommendation system
To achieve the stated objectives stated above, this thesis needs to completed thefollowing tasks:
• Studying techniques and implement model architectures using deep learning ral networks, and intersection of methods between Natural Language Processing,Deep Learning for a problem in session-based recommendation system
neu-• Research on domain Fashion and observe some special features in this domain
• Proposed some methods and techniques applying in building session-based ommendation system
Trang 19rec-• Experimenting and evaluating the proposed approaches.
• Stating the contributions, the existing issues, and the future research directions
1.5Scope of work
Even the problem is still opened with a lot of modern approaches bridging twen Deep Learning, Natural Language Processing with Recommendation System,this thesis could be focus only a few aspects as the following lists:
be-• Dataset: session-based recommendation system in Fashion
• Apply data augmentation techniques methods which had been experimented widelyin DL and NLP in the session-based recommendation system
• Applying neural networks model approaches to predict the next items of the sions
• Observing some different special points of Fashion datasets domain
• Applying the research techniques in preprocessing and processing of based recommendation to predict the next item of the sessions
session-• Experimenting and evaluating the approaches
Trang 201.7Thesis structure
This current thesis structure includes six chapters:
• Chapter 1 INTRODUCTION: The introduction to this thesis which provides a
general overview of the thesis
• Chapter 2 BACKGROUND: This chapter provides necessary backgrounds for
further approaches later
• Chapter 3 RELATED WORKS: This chapter provides an overview states of
art approaches, the problems related for further approaches
• Chapter 4 APPROACHES: This chapter describes the proposed methods, their
motivation and how they work
• Chapter 5 EXPERIMENTS AND EVALUATIONS: This chapter states the
environment, tools and configurations to conduct the experiments and tions
evalua-• Chapter 6 CONCLUSION: This chapter summarizes the contributions and
dis-cuss future improvements
Trang 212.1Neural Networks
Neural networks are also called Artificial Neural Networks (ANNs) The
archi-tecture forms the foundation of deep learning, which is merely a subset of machinelearning concerned with algorithms that take inspiration from the structure and func-tion of the human brain Neural networks form the basis of architectures that mimichow biological neurons signal to one another
Figure 2.1: Neural Networks in Deep Learning
The generic neural network architecture shown in Figure 2.1 consists of the lowing:
fol-• Input layer: Data is fed into the network through the input layer The number ofneurons in the input layer is equivalent to the number of features in the data
7
Trang 22• Hidden layer: The layers between the input and output layers are called hiddenlayers A network can have an arbitrary number of hidden layers - the morehidden layers there are, the more complex the network.
• Output layer: The output layer is used to make a prediction Neurons: Each layerhas a collection of neurons interacting with neurons in other layers
• Activation function: Performs non-linear transformations to help the model learncomplex patterns from the data
Neural unit
The core of neural networks is neural unit Neural unit could be visualized as thefigure belown: In this figure , the unit takes 3 input value x1, x2, x3 and computes a
Figure 2.2: A neural unit [4]
weighted sum, multiplying each value by a weight (w1, w2, w3) adds them to a biasterm b, and then passes the resulting sum through a sigmoid function to result in anumber between 0 and 1
Once a neuron receives its inputs from the neurons in the preceding layer of themodel, it adds up each signal multiplied by its corresponding weight and passes themon to an activation function So that, weights are very important to adjust a model’sweights when the model is training
There are common weight initialization:
Trang 23• Zero initialization: Zero initialization means that weights are initialized as zero.This is not a good solution as our neural network would fail to break symmetry.
• Random initialization: Random initialization breaks the symmetry and is betterthan zero initialization Besides, if the weights are randomly initialized with largevalues, then we can expect each matrix multiplication to result in a significantlylarger value When a sigmoid activation function is applied in such scenarios,the result is a value close to one, which slows down the learning rate
• Xavier/Glorot initialization: is an approach used to initialize weights This nique aims to keep the variance across the network equal to prevent gradientsfrom exploding or vanishing
tech-• He/Kaiming initialization: uses a different scaling factor for the weights that sider the non-linearity of activation functions This technique is recommendedwhen using the ReLU activation function
con-Neural networks work by taking a weighted average plus a bias term and applyingan activation function to add a non-linear transformation In the weighted averageformulation, each weight determines the importance of each feature
Instead of using z, a linear function of x, as the output, neural units apply a non-linear
function f to z This will refer the output of this function as the activation value for
Trang 24the unit, a The activation for the node is the final output of the network.
There are popular non-linear functions f():
• Sigmoid: The sigmoid function is characterized by an “S”-shaped curve that isbounded between the values zero and one It would typically use the sigmoidactivation function for binary classification problems
σ (z) = 1
• Tanh: The hyperbolic tangent (tanh) has the same “S”- shaped curve as the moid function, except the values are bounded between -1 and 1 Thus, smallinputs are mapped closer to -1, and larger inputs are mapped closer to 1
sig-tanh(x) = e
x− e−xex+ e−x =
1 − e−2x
• Softmax: The softmax function is generally used as an activation function in theoutput layer It’s a generalization of the sigmoid function to multiple dimensions.Thus, it’s used in neural networks to predict class membership on more than twolabels
Trang 25A feedforward network is a multilayer network also called Multi-layer
Percep-trons (MLP) in which the units are connected with no cycles; the outputs from unitsin each layer are passed to units in the next higher layer, and no outputs are passedback to lower layers [4]
Simple feedforward networks have three kinds of nodes: input units, hidden units,and output units as presented in 2.1
Figure 2.3: Feedforward in Neural Networks in NLP [4]
To relate directly to the project model architectures applied, here is an exampleforward inference in the neural language model Given an input, of running a forwardpass on the network to produce probability distribution over possible outputs are nextwords
First of all, represent each of N words as a one-hot vector of length |V| with onedimension for each word in the vocabulary Then multiply these one-hot-vectors by
the embedding matrix E which has a column for each word, each a column vector of
d dimensions, the dimensionality is d x |V| The embedding vectors are concatenated
Trang 26to produce e, the embedding layer This is followed by a hidden layer and an output
layer whose softmax produces a probability distribution over words
Figure 2.4: Selecting the embedding vector for word V5 by muliplying the embeddingmatrix E with a one-hot vector with a 1 in index 5 [4]
Feedforward process could be explained by following steps At each timestep t:
• Select three embeddings from E: Given the three previous words, by looking uptheir indices, create 3 one-hot vectors, and then multiply each by the embeddingmatrix E Since each column of the input matrix E is an embedding for a word,and the input is a one-hot column vector xi for word Vi, the embedding layer forinput w will be Exi = ei, the embedding for word i Now concatenate the threeembeddings for the three context words to produce the embedding layer e
• Multiply by W: multiply by W (and add b) and pass through the ReLU (or other)activation function to get the hidden layer h
• Muliply by U: h is now multiplied by U
• Apply softmax: After the softmax, each node i in the output layer estimates theprobability P(wt = i| wt−1, wt−2, wt−3)
The goal of training is to learn parameter Wi and bi for each layer i that make ypredict for each training observation as close as possible to the true y
After feedforward process, first requirement is a loss function that models
dis-tance between the system output the gold output Common to use the loss function iscross-entropy loss Secondly, to find the parameters that minimize this loss function,
use gradient descent optimization algorithm.
Trang 27In backward pass, the goal is to compute the derivatives that we’ll need for theweight update In this example, to compute the derivative of the output function Lwith respect to each of the input variables ∂ L
∂ a, ∂ L∂ b and ∂ L
∂ c The derivative ∂ L
∂ a reflectshow much a small change in a affects L
Optimization in neural networks is a non-convex optimization problem In ral networks, initialize the weights with small random numbers It’s also helpful tonormalize the input values to have 0 mean and unit variance
neu-Various forms of regularization are used to prevent overfitting One of the mostimportant is dropout: randomly dropping some units and their connections from thenetwork during training Besides, tuning of hyperparameters is also important Theparameters of a neural network are the weights W and biases b that are learned bygradient descent The hyperparameters include the learning rate, mini-batch size, themodel architecture (n layers, hidden nodes per layers, activation functions)
2.2Recurrent Neural Networks
For further related model architecture mentioned in this thesis project, Recurrent
Neural Networks(RNN) is explained in language models as an detail example.RNN language models process the input sequence one word at a time, attemptingto predict the next word from the current word and the previous hidden state
The input sequence X = [x1; ; xt; ; xn] consists of a series of word embeddingseach represented as a one-hot vector of size |V|x1, and the output prediction y is avector representing a probability distribution over the vocabulary
At each step, the model uses the word embedding matrix E to retrieve the bedding for the current word, and then combines it with the hidden layer from theprevious step to compute a new hidden layer This hidden layer is then used to gener-ate an output layer which is passed through a softmax layer to generate a probability
Trang 28em-distribution over the entire vocabulary.At time t:
The vector resulting from Vh can be thought of as a set of scores over the vocabularygiven the evidence provided in h Passing these scores through the softmax normalizesthe scores into a probability distribution
The probability that a particular word i in the vocabulary is the next word isrepresented by yi[i], the ith component of yt:
P(wt1 = i|w1, , wt) (2.10)
Figure 2.5: Recurrent Neural Networks training [4]
To train an RNN as a language model, train the model to minimize the error inpredicting the true next word in the training sequence, using cross-entropy as the lossfunction
Trang 29The input embedding matrix E and the final layer matrix V which feeds the output
softmax are quite similar Other technique is weight typing which is a method that
dispenses with this redundancy and uses a single set of embeddings at the input andsoftmax layers In this way, set the dimensionality of the final hidden layer to bethe same dh (or add an additional projection layer to do the same thing), and use thesame matrix for both layers In addition to providing improved perplexity results, thisapproach significantly reduces the number of parameters required for the model
2.3Gated Recurrent Unit
The problem of Short-Term Memory
Recurrent Neural Networks(RNN) suffer from short-term memory If a sequenceis long enough, they’ll have a hard time carrying information from earlier time stepsto later ones So if you are trying to process a paragraph of text to do predictions,RNN’s may leave out important information from the beginning
During back propagation, recurrent neural networks suffer from the vanishinggradient problem Gradients are values used to update a neural networks weights.The vanishing gradient problem is when the gradient shrinks as it back propagatesthrough time If a gradient value becomes extremely small, it doesn’t contribute toomuch learning
Long-Short Term Momory(LSTM) and Gated Recurrent Unit (GRU) were
Trang 30cre-ated as the solution to short-term memory They have internal mechanisms calledgates that can regulate the flow of information.
These gates can learn which data in a sequence is important to keep or throwaway By doing that, it can pass relevant information down the long chain of sequencesto make predictions Almost all state of the art results based on recurrent neuralnetworks are achieved with these two networks LSTM’s and GRU’s can be found inspeech recognition, speech synthesis, and text generation You can even use them togenerate captions for videos
GRU
The GRU is the newer generation of Recurrent Neural networks and is prettysimilar to an LSTM GRU’s got rid of the cell state and used the hidden state totransfer information It also only has two gates, a reset gate and update gate
Figure 2.6: GRU cell and it’s gates [5]
The reset gate is another gate is used to decide how much past information to
Trang 31rt = σ (Wirxt+ bir+ Whrht−1+ bhr) (2.12)
The resulting reset vector r represents the information that will determine what willbe removed from the previous hidden time steps In the forget gate, apply the for-get operation via element-wise multiplication Calculate the reset vector as a linearcombination of the input vector of the current timestep as wll as the previous hiddenstate
Both operations are calculated with matrix multiplication And the first timestepthe hidden state is usually a vector filled with zeros Finally, a non-linear activation isapplied (i.e sigmoid) Moreover, by using an activation function (sigmoid) the resultlies in the range of (0, 1), which accounts for training stability
zt = σ (Wizxt+ biz+ Whzht−1+ bhz) (2.13)
The merging of the input and output gate of the GRU in the called update gate Whencalculate another representation of the input vector x and the previous hidden state,but this time with different trainable matrices and biases The vector z will representthe update vector
Figure 2.7: GRU explained with equation
Trang 32With the output component:
nt = tanh (Winxt+ bin+ rt⊙ (Whnht−1+ bhn)) (2.14)
The vector n consists of two parts; the first one being a linear layer applied to theinput, similar to the input gate in an LSTM The second part consists of the resetvector r and is applied in the previous hidden state Note that here the forget/resetvector is applied directly in the hidden state, instead of applying it in the intermediaterepresentation of cell vector c of an LSTM cell
ht = (1 − zt) ⊙ nt+ zt⊙ ht−1 (2.15)
Since the values of z lie in the range (0,1), 1-z also belongs in the same range ever, the elements of the vector z have a complementary value It is obvious thatelement-wise operations are applied to z and (1-z)
How-2.4Transformer
Transformer was proposed in the paper Attention is all you need [6] and becamea ubiquitous method in modern deep learning models Attention is a concept thathelped improve performance of neural machine translation applications
Trang 33them and is passed to the next one All decoders present the same structure as welland get the input from the last encoder and the previous decoder The optimal layersare experiments in [6] is 6 But the variance of Transformer are applied in manyapplications with different number of layers.
Figure 2.8: Transformer model [6]
Encoder
Mechanism of first stack of Encoder in Transformer includes four layers: a attention layer, two Add & Norm layer and a Feed-Forward layers that is illustrated inthe figure 2.9
self-Input embeddings
The encoder begins by converting input tokens - words or subwords - or everyitem interacted in an sequence - into vectors using embedding layers These embed-
Trang 34Figure 2.9: Mechanism of First Stack of Encoder [7]
dings capture the semantic meaning of the tokens and convert them into numericalvectors All the encoders receive a list of vectors, each of size is fixed, examples 512.In the bottom encoder, that would be the word embeddings, but in other encoders, itwould be the output of the encoder that’s directly below them
Positional Encoding
Positional encoding added to the input embeddings to provide information aboutthe position of each token in the sequence This allows them to understand the positionof each word within the sentence
Combine various sine and cosine functions to create positional vectors, enablingthe use of this positional encoder for sentences of any length In this approach, eachdimension is represented by unique frequencies and offsets of the wave, with thevalues ranging from -1 to 1, effectively representing each position
Stack of Encoder Layers
The Transformer encoder consists of a stack of identical layers (6 in the originalTransformer model) The encoder layer serves to transform all input sequences into
Trang 35a continuous, abstract representation that encapsulates the learned information fromthe entire sequence This layer comprises two sub-modules:
• A multi-headed attention mechanism
• A fully connected network
Additionally, it incorporates residual connections around each sublayer, which arethen followed by layer normalization
Multi-headed Self-attention mechanism
First of all, to understand it clearly, single self-attention allows the model to
relate words to each other In this simple case we consider the sequence length seq =6 and dmodel = dk= 512 The matrices Q, K and V are just the input sentence
Figure 2.10: One-head Attention in Transformer [7]
• A query Q is a vector that represents a specific word or token from the inputsequence in the attention mechanism
• A key K is also a vector in the attention mechanism, corresponding to each wordor token in the input sequence
• Each value V is associated with a key and is used to construct the output of theattention layer When a query and a key match well, which basically means thatthey have a high attention score, the corresponding value is emphasized in theoutput
Trang 36It compute attention scores based on:
Attention(Q, K,V ) = so f tmax(QK
T
√dk
where√dk is the dimension of the key vector k and query vector q.Once the query, key, and value vectors are passed through a linear layer, a dotproduct matrix multiplication is performed between the queries and keys, resulting inthe creation of a score matrix The score matrix establishes the degree of emphasiseach word should place on other words Therefore, each word is assigned a score inrelation to other words within the same time step A higher score indicates greaterfocus This process effectively maps the queries to their corresponding keys
Then, scaling down by dividing them by the square root of the dimension of thequery and key vectors√dk to ensure more stable gradients
Then, applying softmax to the adjusted scores to obtain the attention weights Theresults in probability values ranging from 0 to 1 The softmax function emphasizeshigher scores while diminishing lower scores
This first Self-Attention module enables the model to capture contextual mation from the entire sequence Instead of performing a single attention function,queries, keys and values are linearly projected h times
infor-Multi-heads attention mechanism performs the self-attention operation in
par-allel multiple times, each head using different learned matrices query Q, key K, andvalue V to capture complex relationships between sentence parts from different se-mantic and grammatical perspectives Eight heads were employed in the Transformeras indicated in the paper [6] and formulated in this equation:
MultiHead(Q, K,V ) = Concat(head1, , headh)WO (2.17)
where
headi = Attention(QWiQ, KWiK,VWiV) (2.18)
Trang 37Feed-Forward Neural Network
The journey of the normalized residual output continues as it navigates througha pointwise feed-forward network, a crucial phase for additional refinement Picturethis network as a duo of linear layers, with a ReLU activation nestled in between them,acting as a bridge Once processed, the output embarks on a familiar path: it loopsback and merges with the input of the pointwise feed-forward network This reunionis followed by another round of normalization, ensuring everything is well-adjustedand in sync for the next steps
The output of the final encoder layer is a set of vectors, each representing theinput sequence with a rich contextual understanding This output is then used as theinput for the decoder in a Transformer model This careful encoding paves the wayfor the decoder, guiding it to pay attention to the right words in the input when it’stime to decode
Decoder
In the decoder, the generated text sequences are produced one by one Eachoutput word is considered as the new input, which is then passed through the encoder’sattention mechanism After N encoder stacks, a softmax output predicts the mostprobable generated sequence
Figure 2.11: Decoder in Transformer
Trang 38The mechanism of each layer on the decoder side is similar to that on the encoderside, with some differences due to the causual masking effects.
In Attention Mechanism of decoder in Transfomer:
• Masked Multi-Head Attention Layer: This self-attention layer in decoder alsoemploys the attention mechanism, but with future word masks to prevent accessto future information Thus, it is also called causal self-attention layer Thecausal masking mechanism is depicted on the right side of 2.11
• Cross Attention Layer: The subsequent attention layer is referred to as a attention layer, which concatenates the encoder’s output embeddings with theembeddings from the previous layer “Add & Norm” to perform another round ofattention calculations
cross-2.5Word2Vec
Word2Vec had reolutionized in Natural Language Processing by transformingwords into dense vector representations to capture semantic relationships Word2Vecembeddings are static embeddings, meaning that the method learns one fixed embed-ding for each word in the vocabulary [4] The intuition of word2vec is that insteadof counting how often each word w occurs near, train a classifier on a binary predic-tion task with the question: ’Is word w likely to show up near word ’apricot’? Thepurpose is to take the learned classifier weights as the word embeddings
In another word, word2vec simplifies the task by making it binary classificationinstead of word prediction Besides, word2vec simplifies the architecture by traininga logistic regression classifier instead of a multi-layer neural network with hiddenlayers Two different model architectures that can be used by Word2Vec to create theword embeddings are the Continuous Bag of Words or called CBOW and the Skip-Gram model
The overview of Skip-gram model is:
Trang 39• Treat the target word and a neighboring context word as positive examples.
• Randomly sample other words in the lexicon to get negative samples
• Use logistic regression to train a classifier to distinguish those two cases
• Use the learned weights as the embeddings
Figure 2.12: SkipGram model Word2Vec [8]
Trang 40is similar to the target embedding To compute similarity between these dense beddings, rely on intuition that two vectors are similar if they have a high dot product.So that:
The dot product c is not a probability but it is a number ranging from −∞ to ∞.To turn the dot product into a probability, use logistic or sigmoid function σ (x), thefundamental core of logistic regression:
P(−|w, c) = 1 − P(+|w, c) = σ (−c · w) = 1
1 + exp(x · w) (2.24)Skip-gram makes the simplifying assumption that all context words are indepen-dent, allowing us to just multiply their probabilities:
P(+ | w, c1:L) =
L∏i=1
log P (+ | w, c1:L) =
L∑i=1
In summary, skip-gram trains a probabilistic classifier that, given a test target word wand its context window of L words c1:L, assigns a probability based on how similar