Position-wise Feed-Forward Networks - Khóa luận tố- 123docz.net

Along with a sub layer so called multi-head attention layer, each encoder has a feed- forward network (FFN). FFN has two linear transformations with different learnable weight and a ReLU activation function in between:

Output of attention

From figure 3.25, each position shares the same feed forward position with different weight parameter per layer, the output is then calculate as following:

FFN(x) = max(0, xW1 + b1)W2 + b2 (Eq.1)

With W¡ and W› are weight parameters of two linear transformations, max function is

a ReLU return its own value if it is a positive number and is zero if it is a negative number.

An output of this layer is also the input for the next encoder. The process executes until input flows through twelve layers of BERT architecture. One important thing is the residual connection is added after the multi-head attention and feed forward networks followed by the layer normalization except the embedding layer [10].

Add & Normalize

Self-Attention

POSITIONAL ENCODING

ENCODER #1

Figure 3.26: Residual connection in encoding component from Exploring the Depths

of Recurrent Neural Networks with Stochastic Residual Learning

In figure 3.26, the residual network can ameliorate the vanishing gradient problem and allow for deeper network training [6].

3.5.8 Layer Normalization

In recent years, many methods have been created as a leverage to improve the efficiency of AI application. Normalization is one that helps the model to produce the probabilities output as exactly as possible of many machine learning and deep learning applications, normalization will demonstrate its role in a specific task with the main idea is to preserve the distribution of an input of layer that could prevent an updated weight is too large or too small and less independent by the other layers. The first normalization technique was introduced by [12] - Batch Normalization.

Figure 3.27: The covariance shift method

In figure 3.27, we would like to train a model to learn knowledge of a black dog then we try to apply this network on a colored dog - this is called X to Y mapping. If the distribution of X changed, we need to re-train the whole network when the parameters also update and depend on the output of the previous layer many times to fit with the distribution of Y - it is a covariance shift problem. Batch Normalization increases the constancy of the networks, it normalizes the output of the previous layer

by subtracting the batch mean and divide by the batch standard deviation. However, Batch Normalization requires more batches to get the desired performance, it is really

a hard problem when we are dealing with audio or video which is highly memory consuming or is limited by GPU. Layer normalization comes as a solution to minimize these circumstances.

Previous layer —*

Layer Normalization

. In matrix in shape of (H,W)

ô We normalize each value in matrix by:

Xi © (xi - wi) / V(oi? + 2)

Where:

vent divide by 0

————* Next layer

Figure 3.28: Matrix in layer normalization

In Figure 3.28, the matrix is the concatenation from an output of twelve self- attention layers with the shape of (Height, Width) and it could be changed based on the model architecture. When the model goes backward, a new weight will be updated by multiplying a learning rate and an old one. Via many layers, the equation can be very long and it could result in a large value - called exploding gradient and small - vanishing gradient. When normalizing the input layer, we can control the computed flow through

a deeper network.

3.5.9 Residual Network

As the rise of an efficient AI application, we are inheriting an achievement of leading people in our fields such as a model architecture, techniques or an open-source tool,... BERT is one of them which was built by Google in 2018. With a deeper network architecture, BERT can learn better the representation of input to give an approximately output, this opportunity can also cause an error called exploding and vanishing gradient when executed in back-propagation. In figure 3.29, residual network is designed based

on the principles of summing the output of a current layer with the input of the previous layer.

hidden layer

F(x)

hidden layer

F(x)+x GD

Figure 3.29: Two layers of neural network

3.6 Classification

In the previous section, we have broken down the specific components which make BERT more powerful. However, to serve the Vietnamese triple classification task, we need an external part called fully connected layer and the softmax function which indicates the triple is existence or not based on its description and the last hidden vector

of [CLS] token is chosen to feed into this layer.

Because BERT is pre-trained on the next sentence prediction task which uses the [CLS] token to classify the second sentence as following the first one or not. After pre- training, the [CLS] token can understand the structure and context of words in this task.

In fine-tuning, this token inherits the weight that was demonstrated its usage in next sentence prediction and combined with the contextual information of other words is learnt via 12 layers of BERT to classify the existence of triple. In figure 3.30, after the fully connected layer, the result is an index that has higher probability in the softmax function.

KG-BERT(a)

(ÍCLS])|Tokj] -- (Tok?) ( [SEP] )(Tok;) -- (Tok; ){ [SEP] )(oi) -- { Tokt || [SEP] }

LT Head Entity Relation Tail Entity

Figure 3.30: Classification layer to product the result

In figure 3.30, the last vector of [CLS] token as an input to a fully connected layer

to produce the output.

3.7 PyTorch

3.7.1 Overview

As the rise in demand of an AI application. Deep Learning is a method that helps machines understand human language by using a deep neural network. Many big tech companies have provided their experiments in some model architectures. To optimize the training time and speed, many frameworks have been introduced and can give a certain performance. However, the need for interaction between machines and people becomes larger and the increment of data day by day, these frameworks could not

maintain their performance. Facebook and Google have open-sourced their products

which are named PyTorch and TensorFlow. (Below plot from 1°).

1! https://trends.google.com.vn/trends/explore?date=today %205-y &q=PyTorch,TensorFlow

PyTorch and TensorFlow

== PyTorch = TensorFlow

Search Volumns

Figure 3.31: The comparison between PyTorch and TensorFlow

In figure 3.31, when the demand of understanding human language increases, we need an optimized framework which can support a developer to complete their task.

TensorFlow was built up by the Google Brain team and was released under The Apache License 2.0 in 2015. Through a long experiment time, version 1.0.0 was released on February 11, 2017. In some first year, this can be seen as the most interesting topic in the industry. Via many years, TensorFlow has gradually lose their position and is replaced by PyTorch. This framework was launched in 2016 by Facebook AI. These two frameworks take turns to solving hard problem in Deep

Learning which have need of a lot of data and deep network to training so they it can

consume more resources. (Below plot in |).

12 https://paperswithcode.com/trends

Frameworks

Paper Implementations grouped by framework

100%

@ MxNet

@ Caffe2

Share of Implementations 50%

Sepl6 Decl6 Marl7 Jluni7 Sepi7 Deel7 MariB uni8 Sep18 DeclB Mai3 Junil3 Sep13 Deci3 Mar20 lun20 — Sep20

Repository Creation Date

Figure 3.32: The number of repositories has been written in PyTorch

In figure 3.32, after being released in 2016 - the number of documents written in PyTorch is moderately increased in year and is opposite with TensorFlow and other frameworks. This leads to a question: “Why does PyTorch is a choice for research and production”. Each framework has their strength in a specific task and we could not be given the comparison which one is better. Based on the information above, we device

to use PyTorch as the backbone to implement and deploy our BERT model.

3.7.2. What is PyTorch ?

PyTorch is a Python-based scientific computing package that utilizes the power of

a graphic processing unit (GPU) for faster training. PyTorch is the most popular open- source Deep Learning framework, many researchers and developers decide to use

PyTorch because it provides fast, flexible experimentation and a seamless transition to

the production deployment.'*

'3 https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

p a Fast

(0.78, 0.56]] |

Ó PyTorch -›

tensor( [[0 34, 0 67]

[0.78, 0.56]] )

5 “3 Flexible3

| “

` Seamless

Figure 3.33: The advantage of Pytorch

In figure 3.33, the development of PyTorch is manipulated in Tensor, it inherits nearly all the properties of a numpy array. The difference between them is that tensors can be configured to run on GPU or CPU so we can distribute the resource to concentrate on certain operations.

3.7.3 The strength of PyTorch

In the limited time of thesis, we have worked in the functionality of PyTorch. Facebook AI provides a flex framework that allows a lot of functions to experiment from research to production deployment. By reading the post from heartbeat.ai and our

working time, we would like to condense the key features that make PyTorch are

popular.'4

In the first time we layout the code snippet, my partner exclaimed that: ‘It is really Python, our job can be shortened’. It is exactly, PyTorch bequeathed almost the functionality and properties of Python. Pytorch is mainly manipulated in a tensor that

is transformed from a numpy array so when employed in PyTorch, the feeling is the same as coding in Python.

14 https://bit.ly/3q0UTAT

BERT is one of the memory-consumed models in Deep Learning. With millions of parameters and data used for training, BERT requires an amount of resources for both training and testing. PyTorch has launched a method called Data Parallelism, it comes

as a key to distribute the computation on more than one GPU in one time to serve a huge data source. The ability of Data Parallelism can be demonstrated when we are working in a big model and would like to reduce the pressure of computation by setting the parameter to run on another GPU.

Users can use PyTorch to build a deep learning application based on dynamic graphs which can be seen in runtime. In other frameworks, it was built on top of static graphs which we could not follow what the GPU or CPU is going on. But in PyTorch,

we can configure the code flow to in depth-understanding to observe what the model

is actually executing in each layer.

In Deep Learning, to run a function we must understand its algorithm first and implement it later. Developer team of PyTorch contributes a large amount of available libraries that are designed from popular algorithms such as TorchAudio, AllenNLP, BoTorch, etc. These can help those who are working in PyTorch to reduce the time of coding and focus on other important things.

3.8 Doccano

The qualified data can bring a huge performance to any application. In our project,

to extract a valued triple from a paragraph that saves time and resource, we decide to use doccano as an open-source tool for triple labeling. Doccano allow us to define unlimited number of label types:

Nam 2018, Quảng Ngãi là đơn vị

hành chính Việt Nam đông thứ

5 i omer lỆ

FESTIVAL E h_PLAcr J t_LANDSCAPE h_SUBJECT B B | BD

B B

Bắc Kan là tỉnh giàu tiém năng du lịch bởi sự phong phú của

Kon Tum có nhiều cảnh quan tự

Hải Phòng là thành phố cảng

SH Sộc bu nong Năm 2018, Quang Ngai ô| 1a ô |

“don vi hh chính Vigt Nam đông thd 19 về số dân ©

Figure 3.34: Doccano interface

From interface of Doccano in figure 3.34, Vietnamese language has variety structure in writing phase, to collect a triple from paragraph, we create a logic program

to get them from a huge of non-use word. From: “Đà Lat có đặc sản mứt dâu, noi đây

có danh lam thắng cảnh núi Langbiang”, via our program will yield the two-triple results (Da Lat, có đặc sản, mut dâu) and (Đà Lạt, có danh lam thắng cảnh, núi

Langbiang).

3.9 Evaluation metrics

Evaluation metrics are an essential part of machine learning or deep learning problems. To determine performance of the model, consider whether our classifier works as expected or not or make comparison between separate models, reliable evaluation methods is absolutely required. Commonly used metrics in classification

algorithms that can be known as accuracy, precision, recall, F-measure and ROC. The

following examples are applied for binary classification approach. !Š

15 Classification task results in only two labels (classes).

3.9.1 Accuracy

Accuracy (or Accuracy rate) is considered to be the ratio of all correct predictions

to the total number of predictions across all labels, regardless of the accuracy of the labels. In many cases of imbalanced labels, the accuracy value can be very misleading

and this is its limitation.

True Positive + True Negative Accuracy = TFT

y True Positive + False Positive + True Negative + False Negative

True Positive: number of correctly true predicted ones.

False Positive: number of incorrectly true predicted ones.

True Negative: number of correctly false predicted ones.

False Negative: number of incorrectly false predicted ones.

In the below example, there are 3 true predictions (in red) in all 10 predictions so the accuracy in this case is 0.3 or 30%.

pred =[ 0, 1, 0,0,0, 0, 1, 1, 1, 0]

true =[0, 0, 1, 1, 1,1,0,0,1,0]

Figure 3.35: Predictions for binary task with sample size is 10

From task as depicted in figure 3.35, the opposite of Accuracy rate is Error rate and Error rate is defined by formula:

Error rate = 1 — Accuracy

3.9.2 Precision and Recall

In case of imbalanced classes, metrics like precision and recall are very reasonable

to help measure model performance, specially interest labels.

3.9.2.1 Precision

Precision is the correct prediction rate of a certain class, calculated based on the quotient of the correct number of predictions for that class and the total number of predictions labeled that class. Precision is considered as the measure of exactness, it is

calculated by the formula:

Precisi True Positive

rectston = a

True Positive + False Positive

Which class is positive depends on the modeler's definition, usually the positive class will be the case that needs to be concerned with the model performance.

In example from figure 3.35:

e For a total of 4 observations predicted as 1, in which 1 case is predicted

correctly, deducing precision for class 1 here is 1/4 = 25,0%.

e For a total of 6 observations predicted as 0, in which 2 cases is predicted

correctly, deducing precision for class 0 here is 2/6 ~ 33,3%.

3.9.2.2 Recall

Recall is the ratio between the correct predictions of a certain class and the total actual existence number of that class. Recall is considered as the measure as

completeness, it is calculated by the formula:

True Positive

Recall =

eee True Positive + False Negative

In example from figure 3.35:

e For a total of 5 observations actually labeled as 1, in which 1 case is

predicted correctly, deducing recall for class 1 here is 1/5 = 20,0%.

e For a total of 5 observations actually labeled as 0, in which 2 cases is

predicted correctly, deducing recall for class 0 here is 2/5 ~ 33,3%.

3.9.3 F-measure

There is a trade-off between precision and recall. This measure is the harmonic

mean of the two mentioned fractions (suppose these two quantities differ 0).

FL=2 Precision * Recall

= 8

Precsion + Recall

There are different of F-measure:

e Fl: evenly weighted (most common)

e E2: weights Recall more

e 0.5: weights Precision more

3.9.4 ROC

ROC or Receiver Operating Characteristic curve is used for controlling output threshold of model. A model is called productive as its True Positive Rate high and its False Positive Rate low. In the below figure, the example model is in average level.

= ° =a co °

Tue Positive Rate = ˆ

= nN

oo 02 04 06 08 10

False Positive Rate

Figure 3.36: ROC curve example

With figure 3.36, the example indicates that two labels will have the same prediction probabilities.

Chapter 4

SYSTEM DESIGN AND EVALUATION

4.1 System design

4.1.1 Overview of the system

The evolution of Artificial Intelligence becomes more challenging with the task of human language understanding. To do that, we need qualified data for the model to learn enough to adapt and respond. We limited the wide range of data in three topics

of the tourism domain: special dish, festivals and landscapes. By following the triple classification task that was implemented in Knowledge Graph BERT (KG-BERT) [1], the authors of KG-BERT modified the implementation of BERT by Huggingface as their backbone. Our project aims to demonstrate that BERT can indicate the information of triple exists in a paragraph or not by understanding its description. Each component (h, r, t) has their own description, our system packed them into a single sequence and learned via the BERT model. BERT is pre-training in a large dataset so

it can not cover a specific domain so we must fine-tune it to adapt with vietnamese tourism information.

4.1.2 Model building

To have the application with an acceptable classified performance. In NLP, it becomes harder with the human language formula so BERT comes as a key to open this challenge. But a great result model is not enough, the constraint is also an another part of the figure 4.1 we show below:

| Logic

Handcraft : rogram

W check Labelling Prog

< xã (s3) > doocang —> (e s]

Nn =

| <—_

Description

@® 70% training E]

oD oe .h — > — < <— °

Ä Ả. Tripl

Encapsulate E - Data preparation Neo4j Triple

& 30% testing Triple ˆ

Deployment

Figure 4.1: Modeling pipeline

In the training phase, we first collect the paragraph (descriptions) of each entity mainly from Wikipedia and legal magazines. Before going forward, the self-check step

is done to ensure that data is legal, true fact and can bring more valued information. In

Vietnamese language form, we spend many times to label triples in a paragraph

because of the diversity wording, in the example: ““Tôi sống tại Vũng Tàu, noi đây có

đặc san là bánh khọt”. In this sentence, the word ‘noi day’ can be replaced with “chỗ

nay’, ‘o day’, “nơi nay’, etc it make difficult to capture the triple form because the rule does not consistent. Our team defines a logic program to capture a list of triple and then remove the non-valued ones. We can see these processes via the following example:

| nem ranaf)

ae : ag “2

Thu đồ Hà Nội có đặc. logic program -

sản là nem ran va /a nơi lễ

đặt lăng Chủ tịch Hỗ Ch

Minh

- lăng Chủ

tịch Hỗ Chi

Figure 4.2: Triple extraction program

In figure 4.2, the bold phase is an entity that can be combined to create a triple. Our logic program can be used to generate two triples that follow Vietnamese language

structure.

After getting a list of triples, we create a database in Neo4j and store triple under key/value format. In each node, they have two keys: name, description and are assigned with their value respectively. The example below shows the knowledge graph when

we would like to all of the node that have relation to node named ‘ha_noi’.

Node Labels

match (s:S e ~[:co_m‹ a san]-(friends) where s.name='h...

SEP

Relationship Types

Connected as

Figure 4.3: Neo4j platform user interface

In figure 4.3, the right side is the working space where we can type the query and the graph is displayed below, the left side contains information about our database. To