Along with a sub layer so called multi-head attention layer, each encoder has a feed- forward network (FFN). FFN has two linear transformations with different learnable weight and a ReLU activation function in between:
Output of attention
From figure 3.25, each position shares the same feed forward position with different weight parameter per layer, the output is then calculate as following:
FFN(x) = max(0, xW1 + b1)W2 + b2 (Eq.1)
With W¡ and W› are weight parameters of two linear transformations, max function is
a ReLU return its own value if it is a positive number and is zero if it is a negative number.
An output of this layer is also the input for the next encoder. The process executes until input flows through twelve layers of BERT architecture. One important thing is the residual connection is added after the multi-head attention and feed forward networks followed by the layer normalization except the embedding layer [10].
34
Add & Normalize
Self-Attention
POSITIONAL ENCODING
ENCODER #1
Figure 3.26: Residual connection in encoding component from Exploring the Depths
of Recurrent Neural Networks with Stochastic Residual Learning
In figure 3.26, the residual network can ameliorate the vanishing gradient problem and allow for deeper network training [6].
3.5.8 Layer Normalization
In recent years, many methods have been created as a leverage to improve the efficiency of AI application. Normalization is one that helps the model to produce the probabilities output as exactly as possible of many machine learning and deep learning applications, normalization will demonstrate its role in a specific task with the main idea is to preserve the distribution of an input of layer that could prevent an updated weight is too large or too small and less independent by the other layers. The first normalization technique was introduced by [12] - Batch Normalization.
35
Figure 3.27: The covariance shift method
In figure 3.27, we would like to train a model to learn knowledge of a black dog then we try to apply this network on a colored dog - this is called X to Y mapping. If the distribution of X changed, we need to re-train the whole network when the parameters also update and depend on the output of the previous layer many times to fit with the distribution of Y - it is a covariance shift problem. Batch Normalization increases the constancy of the networks, it normalizes the output of the previous layer
by subtracting the batch mean and divide by the batch standard deviation. However, Batch Normalization requires more batches to get the desired performance, it is really
a hard problem when we are dealing with audio or video which is highly memory consuming or is limited by GPU. Layer normalization comes as a solution to minimize these circumstances.
Previous layer —*
Layer Normalization
. In matrix in shape of (H,W)
ô We normalize each value in matrix by:
Xi © (xi - wi) / V(oi? + 2)
Where:
vent divide by 0
————* Next layer
Figure 3.28: Matrix in layer normalization
36
In Figure 3.28, the matrix is the concatenation from an output of twelve self- attention layers with the shape of (Height, Width) and it could be changed based on the model architecture. When the model goes backward, a new weight will be updated by multiplying a learning rate and an old one. Via many layers, the equation can be very long and it could result in a large value - called exploding gradient and small - vanishing gradient. When normalizing the input layer, we can control the computed flow through
a deeper network.
3.5.9 Residual Network
As the rise of an efficient AI application, we are inheriting an achievement of leading people in our fields such as a model architecture, techniques or an open-source tool,... BERT is one of them which was built by Google in 2018. With a deeper network architecture, BERT can learn better the representation of input to give an approximately output, this opportunity can also cause an error called exploding and vanishing gradient when executed in back-propagation. In figure 3.29, residual network is designed based
on the principles of summing the output of a current layer with the input of the previous layer.
x
Mi
hidden layer
F(x)
¥
hidden layer
F(x)+x GD
M
Figure 3.29: Two layers of neural network
37
3.6 Classification
In the previous section, we have broken down the specific components which make BERT more powerful. However, to serve the Vietnamese triple classification task, we need an external part called fully connected layer and the softmax function which indicates the triple is existence or not based on its description and the last hidden vector
of [CLS] token is chosen to feed into this layer.
Because BERT is pre-trained on the next sentence prediction task which uses the [CLS] token to classify the second sentence as following the first one or not. After pre- training, the [CLS] token can understand the structure and context of words in this task.
In fine-tuning, this token inherits the weight that was demonstrated its usage in next sentence prediction and combined with the contextual information of other words is learnt via 12 layers of BERT to classify the existence of triple. In figure 3.30, after the fully connected layer, the result is an index that has higher probability in the softmax function.
38
KG-BERT(a)
(ÍCLS])|Tokj] -- (Tok?) ( [SEP] )(Tok;) -- (Tok; ){ [SEP] )(oi) -- { Tokt || [SEP] }
LT Head Entity Relation Tail Entity
Figure 3.30: Classification layer to product the result
In figure 3.30, the last vector of [CLS] token as an input to a fully connected layer
to produce the output.
3.7 PyTorch
3.7.1 Overview
As the rise in demand of an AI application. Deep Learning is a method that helps machines understand human language by using a deep neural network. Many big tech companies have provided their experiments in some model architectures. To optimize the training time and speed, many frameworks have been introduced and can give a certain performance. However, the need for interaction between machines and people becomes larger and the increment of data day by day, these frameworks could not
maintain their performance. Facebook and Google have open-sourced their products
which are named PyTorch and TensorFlow. (Below plot from 1°).
1! https://trends.google.com.vn/trends/explore?date=today %205-y &q=PyTorch,TensorFlow
39
PyTorch and TensorFlow
== PyTorch = TensorFlow
Search Volumns
Figure 3.31: The comparison between PyTorch and TensorFlow
In figure 3.31, when the demand of understanding human language increases, we need an optimized framework which can support a developer to complete their task.
TensorFlow was built up by the Google Brain team and was released under The Apache License 2.0 in 2015. Through a long experiment time, version 1.0.0 was released on February 11, 2017. In some first year, this can be seen as the most interesting topic in the industry. Via many years, TensorFlow has gradually lose their position and is replaced by PyTorch. This framework was launched in 2016 by Facebook AI. These two frameworks take turns to solving hard problem in Deep
Learning which have need of a lot of data and deep network to training so they it can
consume more resources. (Below plot in |).
12 https://paperswithcode.com/trends
40
Frameworks
Paper Implementations grouped by framework
100%
@ MxNet
@ Caffe2
Share of Implementations 50%
Sepl6 Decl6 Marl7 Jluni7 Sepi7 Deel7 MariB uni8 Sep18 DeclB Mai3 Junil3 Sep13 Deci3 Mar20 lun20 — Sep20
Repository Creation Date
Figure 3.32: The number of repositories has been written in PyTorch
In figure 3.32, after being released in 2016 - the number of documents written in PyTorch is moderately increased in year and is opposite with TensorFlow and other frameworks. This leads to a question: “Why does PyTorch is a choice for research and production”. Each framework has their strength in a specific task and we could not be given the comparison which one is better. Based on the information above, we device
to use PyTorch as the backbone to implement and deploy our BERT model.
3.7.2. What is PyTorch ?
PyTorch is a Python-based scientific computing package that utilizes the power of
a graphic processing unit (GPU) for faster training. PyTorch is the most popular open- source Deep Learning framework, many researchers and developers decide to use
PyTorch because it provides fast, flexible experimentation and a seamless transition to
the production deployment.'*
'3 https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html
41
p a Fast
(0.78, 0.56]] |
Ó PyTorch -›
tensor( [[0 34, 0 67]
[0.78, 0.56]] )
5 “3 Flexible3
| “
`
` Seamless
iS
Figure 3.33: The advantage of Pytorch
In figure 3.33, the development of PyTorch is manipulated in Tensor, it inherits nearly all the properties of a numpy array. The difference between them is that tensors can be configured to run on GPU or CPU so we can distribute the resource to concentrate on certain operations.
3.7.3 The strength of PyTorch
In the limited time of thesis, we have worked in the functionality of PyTorch. Facebook AI provides a flex framework that allows a lot of functions to experiment from research to production deployment. By reading the post from heartbeat.ai and our
working time, we would like to condense the key features that make PyTorch are
popular.'4
In the first time we layout the code snippet, my partner exclaimed that: ‘It is really Python, our job can be shortened’. It is exactly, PyTorch bequeathed almost the functionality and properties of Python. Pytorch is mainly manipulated in a tensor that
is transformed from a numpy array so when employed in PyTorch, the feeling is the same as coding in Python.
14 https://bit.ly/3q0UTAT
42
BERT is one of the memory-consumed models in Deep Learning. With millions of parameters and data used for training, BERT requires an amount of resources for both training and testing. PyTorch has launched a method called Data Parallelism, it comes
as a key to distribute the computation on more than one GPU in one time to serve a huge data source. The ability of Data Parallelism can be demonstrated when we are working in a big model and would like to reduce the pressure of computation by setting the parameter to run on another GPU.
Users can use PyTorch to build a deep learning application based on dynamic graphs which can be seen in runtime. In other frameworks, it was built on top of static graphs which we could not follow what the GPU or CPU is going on. But in PyTorch,
we can configure the code flow to in depth-understanding to observe what the model
is actually executing in each layer.
In Deep Learning, to run a function we must understand its algorithm first and implement it later. Developer team of PyTorch contributes a large amount of available libraries that are designed from popular algorithms such as TorchAudio, AllenNLP, BoTorch, etc. These can help those who are working in PyTorch to reduce the time of coding and focus on other important things.
3.8 Doccano
The qualified data can bring a huge performance to any application. In our project,
to extract a valued triple from a paragraph that saves time and resource, we decide to use doccano as an open-source tool for triple labeling. Doccano allow us to define unlimited number of label types:
43
Nam 2018, Quảng Ngãi là đơn vị
hành chính Việt Nam đông thứ
5 i omer lỆ
FESTIVAL E h_PLAcr J t_LANDSCAPE h_SUBJECT B B | BD
B B
Bắc Kan là tỉnh giàu tiém năng du lịch bởi sự phong phú của
Kon Tum có nhiều cảnh quan tự
Hải Phòng là thành phố cảng
SH Sộc bu nong Năm 2018, Quang Ngai ô| 1a ô |
“don vi hh chính Vigt Nam đông thd 19 về số dân ©
Figure 3.34: Doccano interface
From interface of Doccano in figure 3.34, Vietnamese language has variety structure in writing phase, to collect a triple from paragraph, we create a logic program
to get them from a huge of non-use word. From: “Đà Lat có đặc sản mứt dâu, noi đây
có danh lam thắng cảnh núi Langbiang”, via our program will yield the two-triple results (Da Lat, có đặc sản, mut dâu) and (Đà Lạt, có danh lam thắng cảnh, núi
Langbiang).
3.9 Evaluation metrics
Evaluation metrics are an essential part of machine learning or deep learning problems. To determine performance of the model, consider whether our classifier works as expected or not or make comparison between separate models, reliable evaluation methods is absolutely required. Commonly used metrics in classification
algorithms that can be known as accuracy, precision, recall, F-measure and ROC. The
following examples are applied for binary classification approach. !Š
15 Classification task results in only two labels (classes).
44
3.9.1 Accuracy
Accuracy (or Accuracy rate) is considered to be the ratio of all correct predictions
to the total number of predictions across all labels, regardless of the accuracy of the labels. In many cases of imbalanced labels, the accuracy value can be very misleading
and this is its limitation.
True Positive + True Negative Accuracy = TFT
y True Positive + False Positive + True Negative + False Negative
True Positive: number of correctly true predicted ones.
False Positive: number of incorrectly true predicted ones.
True Negative: number of correctly false predicted ones.
False Negative: number of incorrectly false predicted ones.
In the below example, there are 3 true predictions (in red) in all 10 predictions so the accuracy in this case is 0.3 or 30%.
pred =[ 0, 1, 0,0,0, 0, 1, 1, 1, 0]
true =[0, 0, 1, 1, 1,1,0,0,1,0]
Figure 3.35: Predictions for binary task with sample size is 10
From task as depicted in figure 3.35, the opposite of Accuracy rate is Error rate and Error rate is defined by formula:
Error rate = 1 — Accuracy
3.9.2 Precision and Recall
In case of imbalanced classes, metrics like precision and recall are very reasonable
to help measure model performance, specially interest labels.
45
3.9.2.1 Precision
Precision is the correct prediction rate of a certain class, calculated based on the quotient of the correct number of predictions for that class and the total number of predictions labeled that class. Precision is considered as the measure of exactness, it is
calculated by the formula:
Precisi True Positive
rectston = a
True Positive + False Positive
Which class is positive depends on the modeler's definition, usually the positive class will be the case that needs to be concerned with the model performance.
In example from figure 3.35:
e For a total of 4 observations predicted as 1, in which 1 case is predicted
correctly, deducing precision for class 1 here is 1/4 = 25,0%.
e For a total of 6 observations predicted as 0, in which 2 cases is predicted
correctly, deducing precision for class 0 here is 2/6 ~ 33,3%.
3.9.2.2 Recall
Recall is the ratio between the correct predictions of a certain class and the total actual existence number of that class. Recall is considered as the measure as
completeness, it is calculated by the formula:
True Positive
Recall =
eee True Positive + False Negative
In example from figure 3.35:
e For a total of 5 observations actually labeled as 1, in which 1 case is
predicted correctly, deducing recall for class 1 here is 1/5 = 20,0%.
46
e For a total of 5 observations actually labeled as 0, in which 2 cases is
predicted correctly, deducing recall for class 0 here is 2/5 ~ 33,3%.
3.9.3 F-measure
There is a trade-off between precision and recall. This measure is the harmonic
mean of the two mentioned fractions (suppose these two quantities differ 0).
FL=2 Precision * Recall
= 8
Precsion + Recall
There are different of F-measure:
e Fl: evenly weighted (most common)
e E2: weights Recall more
e 0.5: weights Precision more
3.9.4 ROC
ROC or Receiver Operating Characteristic curve is used for controlling output threshold of model. A model is called productive as its True Positive Rate high and its False Positive Rate low. In the below figure, the example model is in average level.
47
= ° =a co °
Tue Positive Rate = ˆ
= nN
oo 02 04 06 08 10
False Positive Rate
Figure 3.36: ROC curve example
With figure 3.36, the example indicates that two labels will have the same prediction probabilities.
Chapter 4
SYSTEM DESIGN AND EVALUATION
4.1 System design
4.1.1 Overview of the system
The evolution of Artificial Intelligence becomes more challenging with the task of human language understanding. To do that, we need qualified data for the model to learn enough to adapt and respond. We limited the wide range of data in three topics
of the tourism domain: special dish, festivals and landscapes. By following the triple classification task that was implemented in Knowledge Graph BERT (KG-BERT) [1], the authors of KG-BERT modified the implementation of BERT by Huggingface as their backbone. Our project aims to demonstrate that BERT can indicate the information of triple exists in a paragraph or not by understanding its description. Each component (h, r, t) has their own description, our system packed them into a single sequence and learned via the BERT model. BERT is pre-training in a large dataset so
48
it can not cover a specific domain so we must fine-tune it to adapt with vietnamese tourism information.
4.1.2 Model building
To have the application with an acceptable classified performance. In NLP, it becomes harder with the human language formula so BERT comes as a key to open this challenge. But a great result model is not enough, the constraint is also an another part of the figure 4.1 we show below:
| Logic
Handcraft : rogram
W check Labelling Prog
< xã (s3) > doocang —> (e s]
Nn =
| <—_
Description
Description
@® 70% training E]
oD oe .h — > — < <— °
Ä Ả. Tripl
Encapsulate E - Data preparation Neo4j Triple
& 30% testing Triple ˆ
Deployment
Figure 4.1: Modeling pipeline
In the training phase, we first collect the paragraph (descriptions) of each entity mainly from Wikipedia and legal magazines. Before going forward, the self-check step
is done to ensure that data is legal, true fact and can bring more valued information. In
Vietnamese language form, we spend many times to label triples in a paragraph
because of the diversity wording, in the example: ““Tôi sống tại Vũng Tàu, noi đây có
đặc san là bánh khọt”. In this sentence, the word ‘noi day’ can be replaced with “chỗ
nay’, ‘o day’, “nơi nay’, etc it make difficult to capture the triple form because the rule does not consistent. Our team defines a logic program to capture a list of triple and then remove the non-valued ones. We can see these processes via the following example:
49
| nem ranaf)
ae : ag “2
Thu đồ Hà Nội có đặc. logic program -
sản là nem ran va /a nơi lễ
đặt lăng Chủ tịch Hỗ Ch
Minh
- lăng Chủ
tịch Hỗ Chi
Se
Figure 4.2: Triple extraction program
In figure 4.2, the bold phase is an entity that can be combined to create a triple. Our logic program can be used to generate two triples that follow Vietnamese language
structure.
After getting a list of triples, we create a database in Neo4j and store triple under key/value format. In each node, they have two keys: name, description and are assigned with their value respectively. The example below shows the knowledge graph when
we would like to all of the node that have relation to node named ‘ha_noi’.
Node Labels
match (s:S e ~[:co_m‹ a san]-(friends) where s.name='h...
SEP
Relationship Types
Connected as
© server user list
© server user add
Figure 4.3: Neo4j platform user interface
In figure 4.3, the right side is the working space where we can type the query and the graph is displayed below, the left side contains information about our database. To
50