Luận văn tốt nghiệp Khoa học máy tính: Grounded language learning: Improve text representation with visual information

Ho Chi Minh city University of Technology Faculty of Computer Science and EngineeringGRADUATE THESIS GROUNDED LANGUAGE LEARNING: IMPROVE TEXT REPRESENTATION WITH VISUAL INFORMATION Major

Trang 1

Ho Chi Minh city University of Technology Faculty of Computer Science and Engineering

GRADUATE THESIS GROUNDED LANGUAGE LEARNING: IMPROVE TEXT REPRESENTATION WITH VISUAL INFORMATION

Major: Computer science

Council: Computer Science 11 Supervisor: Assoc Prof Quan Thanh Tho Reviewer: Mr Le Dinh Thuan

Student: Nguyen Tran Cong Duy (1710043)

Trang 2

-o0o -Representation with Visual Information

Thesis

Nguyen Tran Cong DuySupervisorsAssoc Prof Quan Thanh Tho

Trang 3

I hereby declare that, except for the reference results from other related works ified in the thesis, the contents presented in this thesis are my own implementationand there is no part of the content applied for a degree at another school.

spec-Ho Chi Minh City, July 11, 2021

Trang 4

As a matter of first importance, I am massively thankful to my counselor Assoc.Prof Quan Thanh Tho for his consistent help and direction all through mythesis, and for the opportunity in studying and researching he gave me Second, Iadditionally thank my parents and my best friends for the persistent consolation,backing, and consideration.

Trang 5

Nowadays, people learn languages through listening, speaking, reading, writing,and multimodal interactions with the real world Even a child has been taughtfrom a young age to talk (listen), teach to speak, teach gestures and learn throughpictures from a young age, people from a young age not only learn language fromresources or books with only words that learn a combination of images, stories,and descriptive sentences Today’s language models are mostly not learned fromreal-world factors but are trained by purely linguistic data sources There have been

a few constructions that incorporate language and other elements into applicationsfor robotics, visual and linguistic tasks, etc with positive results but presently amajor challenge in the industry, machine learning, deep learning today

Trang 6

List of Tables vi

1.1 Motivation 1

1.2 Our contribution 2

1.3 The Scope of the Thesis 3

1.4 Organization of the Thesis 4

2 Foundations 5 2.1 Neural Network - Multilayer perceptron 5

2.2 Convolutional Neural Network 7

2.3 Object Detection 14

2.4 Recurrent Neural Network 17

2.5 Transformer 24

2.6 BERT 28

3 Related work 31 3.1 Overall 31

3.2 Grounded Language Learning Approaches 32

4 Motivation 36 4.1 Motivation 36

4.2 Propose 38

5 Methodology 40 5.1 Approach 1 40

5.2 Approach 2 44

Trang 7

6 Experiments 50

6.1 Experimental Setup 50

6.2 Experimental Results 53

7 Analysis 56 7.1 Impact of visual dimension 56

7.2 The impact of visual grounding 57

7.3 Visualization of alignment between tokens and objects 57

7.4 Evaluation on Pre-training Tasks 58

7.5 Visualization of token-level scoring on first approach 58

8 Application 61 8.1 Django 61

8.2 System description 62

8.3 Mockup 64

8.4 Demo results 65

9 Conclusion 67 9.1 What have we done? 67

9.2 Future direction 68

Trang 8

4.1 Statistics of some common datasets used in visual grounded languagelearning task 366.1 Task descriptions and statistics 506.2 Downstream task results of BERT and our GroundedBERT, weconduct the experiments on BERT-base and BERT-large architectures.MRPC and QQP results are F1 score, STS-B results are Pearsoncorrelation, SQuAD v1.1 and SQuAD v2.0 results are exact matchingand F1 score respectively The results, which outperform the otherone are marked in bold, are all scale to range 0-100 The ∆baseand ∆large columns show the difference between our model and thebaseline 536.3 Downstream task results of BERT, V&L pretrained models and ourObjectGroundedBERT (OGBERT), we conduct the experiments onBERT-base architectures MRPC and QQP results are F1 score,STS-B results are Pearson correlation, SQuAD v1.1 and SQuAD v2.0results are exact matching and F1 score respectively The results,which outperform the other one are marked in bold, are all scale

to range 0-100 The ∆base column show the difference between ourmodel and the baseline 547.1 Downstream task results of our ObjectGroundedBERT with differentdimension of visual embedding The metrics and results are set upsimilar to Table 6.3 56

Trang 9

7.2 Downstream task results and comparison of our BERT without training the Text-ground-image Module The metricsand results are set up similar to Table 6.3 The first four rows reportthe fine-tuned results of our model without training with the visualgrounded datasets, the last four rows show the difference to the resultsreported in Table 7.1 577.3 Downstream task results on different pretraining tasks 58

Trang 10

ObjectGrounded-1.1 Wikipedia, BookCorpus datasets 1

1.2 Wikipedia, BookCorpus datasets with another visual datasets 3

2.1 A simple neural network 6

2.2 MLP with 3 layers 6

2.3 Visualization of max pooling 10

2.4 LeNet architecture 11

2.5 AlexNet architecture 11

2.6 VGG16 architecture 12

2.7 Inception cell 13

2.8 Inception architecture 13

2.9 ResNets block 14

2.10 ResNets architecture 14

2.11 R-CNN 15

2.12 Fast R-CNN 16

2.13 Faster R-CNN 17

2.14 Recurrent Neural Network 18

2.15 RNN calculation in one node 19

2.16 RNN architectures 20

2.17 Encoded-Decoder 22

2.18 Encoded-Decoder with Attention 23

2.19 Transformer: attention is all you need 24

2.20 Transformer calculation 25

2.24 Transformer multihead attention 28

2.25 BERT model 28

Trang 11

3.1 IMAGINET architecture from the paper Collell et al (2017) 32

3.2 Cap2Both is the combination of Cap2Img and Cap2Both, the figure is in the paper Kiela et al (2018) 33

3.3 Bordes et al (2019) paper architecture 34

3.4 Illustration of the BERT transformer model trained with a visually-supervised language model with two objectives: masked language model (on the left) and voken classification (on the right) The figure is in paper Tan & Bansal (2020) 35

4.1 A grounded language learning example 37

5.1 40

5.2 Implementation of our ObjectGroundedBERT The model consists of two components, i.e Language encoder and Text-ground-image part The new representaion of language model combines of Textual embedding and Visual embedding 44

5.3 Implementation of our pretraining framework for ObjectGrounded-BERT The model consists of four components, i.e Object detection model (Faster-RCNN), Object encoder, Cross Modal Transformer and ObjectGroundedBERT The detail of Cross Modal Transformer layer is shown on the right The pretraining tasks are Masked Visual Feature Prediction, Image-Text Matching and Masked Language Modeling 45

6.1 MSCOCO Dataset 51

6.2 GLUE Benchmark 51

7.1 Illustration of the Attention map of Cross modal layers 57

7.2 First visualization of token-level scoring This example chooses the caption there is a clean bathroom counter and sink with 3 images 59 7.3 Second visualization of token-level scoring This example chooses the caption bicycle parked in the grass by a tree with 3 images 60

8.1 Django framework 61

8.2 Architecture diagram 62

8.3 Activity diagram 63

8.4 Home page 64

8.5 Model page - result page 64

8.6 Input of CoLA task at home page 65

8.7 Result of CoLA task at home page 65

Trang 12

8.8 Input of MNLI task at home page 668.9 Result of MNLI task at home page 66

Trang 13

Figure 1.1: Wikipedia, BookCorpus datasets

Trang 14

1.2 Our contribution

Previous studies of visual grounded language learning train language encoder withboth language objective and visual grounding example However, due to thedifferences in distribution and scale between the visual-grounded datasets andlanguage corpora, the language model tends to mix up the context of the tokensthat are occurred in the grounded data with those that are not As such, there isconfusion between the visual information and contextual meaning of text duringembedding training To overcome this limitation, we propose GroundedBERT -

a grounded language learning method that enhances the BERT representationwith visual grounded information GroundedBERT comprises of two components:(i) a text-ground-image part captures the global and local semantic mappingbetween the visual and textual representations learned via sentence-level and token-level mechanism, and (ii) an original BERT embedding captures the contextualrepresentation of words learned from the textual language corpora Our proposedmethod significantly outperforms the baseline language models on various languagetasks of GLUE and SQuAD datasets

Moreover, these studies use a convolutional neural network (CNN) to extractfeatures from the whole image for grounding with the sentence description However,this approach has two main drawbacks: (i) the whole image usually contains moreobjects and backgrounds than the sentence itself; thus, matching them togetherwill confuse the grounded model; (ii) CNN only extracts the features of theimage but not the relationship between objects inside that, limiting the groundedmodel to learn complicated contexts To overcome such shortcomings, we propose

a novel object-level grounded language learning framework that empowers thelanguage representation with visual object-grounded information The framework

is comprised of two main components: (i) ObjectGroundedBERT captures thevisual-object relations and literary portrayals by cross-modal pretraining via atext-ground-image mechanism, and (ii) Cross-modal Transformer helps the objectencoder and ObjectGroundedBERT learn the alignment and representation ofimage-text context

Experimental results show that our proposed framework ObjectGroundedBERTand GroundedBERT consistently outperform the baseline language models onvarious language tasks of GLUE and SQuAD datasets

Our works are submitted into two conferences EMNLP 20211 and Neurips 2022.1

https://2021.emnlp.org/

2

https://nips.cc/

Trang 15

Figure 1.2: Wikipedia, BookCorpus datasets with another visual datasets

1.3 The Scope of the Thesis

The coverage of this text are:

• We propose GroundedBERT - a grounded language learning approach thatenhances BERT representation with visual-grounded information Instead

of grounding visual information to the language model, which changes theoriginal contextual representation, the visual-grounded representation isfirst learned from the text-image pairs and then joined to the contextualrepresentation to form a unified visual-textual representation Moreover withObjectGroundedBERT, to the best of our knowledge, this study is the first toinvestigate the grounded language learning at the object-level with the richgrounded information containing object features, attributes, and positions

By doing so, we can enhance the ability of grounded language to capturemore complex relations and avoid the confusion during learning process

• To this end, we introduce a Text-ground-image module that captures bothglobal and local semantics between contextual relation of words and image via

a novel token-level and sentence-level learning mechanism We also propose

a novel grounded language framework that enhances language representationwith visual-objected-grounded information Instead of using CNN to encodethe whole image, we embed the features of objects from an off-the-shelf objectdetector into the encoder and connect them with the language modality via a

Trang 16

cross-modal Transformer A Text-ground-image mechanism is also proposed

to capture the visual object information and their relations found from thesemantic correlation of words and image via multi-task pretraining strategy

• We conduct extensive experiments on various language downstream tasks inGLUE and SQuAD datasets, and significantly outperforms the baselines onthese tasks

• We build a demo app for CoLA, MNLI tasks using our GroundedBERT

1.4 Organization of the Thesis

• In chapter 2, we provide some basic math and machine learning conceptsthat might be helpful for the reader to understand the rest of the text

• In chapter 3, we summarize some legacy research works in the area in aconsistent way

• In chapter 4, we first briefly describe why we have the novel ideal and theproblem of previous works

• In chapter 5, we describe the detail of our proposed models and trainingstrategies

• In chapter 6, we show our results on many downstream tasks and also how

to implement and config for the hyper-parameters

• In chapter 7, we additional analyze our work on many aspects to show theeffectiveness and visualize some results of our examples

• In chapter 8, we describe and illustrate the application to demo our work ondownstream tasks

• Finally, we conclude by discussing what we have done in Pre-Thesis andThesis also The future direction of our work in chapter 9

Trang 17

In this chapter, I will present the background knowledge used in the process ofmaking the thesis, including commonly used concepts in deep learning networks andnatural language processing, popular network architectures in image processingand the models used for language

2.1 Neural Network - Multilayer perceptron

2.1.1 Logistic regression

is a binary classification method, built by a function capable of taking any valueand returning a result the number between 0 and 1 (sigmoid function) - equivalent

to the probability of that happening based on the input data This is similar

to a single-layer neural network As the figure 2.1 illustrates the use of logisticregression as a simple neural network

In the figure 2.1, the matrix x contains the attributes (features) of an input,the matrix w will be the weights of the attributes and b is the bias After goingthrough the calculation steps, the result obtained ˆy will be the probability thatlabel (label) y has a value of 1 with known x and w

2.1.2 Multilayer perceptron

Multilayer perceptron is a multi-layer neural network, usually consisting of layers:input layer, output layer and hidden layer

Trang 18

Figure 2.1: A simple neural network.

Figure 2.2 illustrates an MLP with 3 layers (input, hidden, output)

Figure 2.2: MLP with 3 layers.

Trang 19

function will be asymptote to 0, and vice versa if x is a very large positive numberthen the result will be asymptote to 1.

If the value of the node is changed to 0, it will not be meaningful in the next layer.and the corresponding coefficients from that node are also not updated with thegradient This phenomenon is called Dying ReLU

2.2 Convolutional Neural Network

2.2.1 Introduction

Contingent upon whether we are taking care of high contrast or shading pictures,every pixel area may be related with possibly one or numerous mathematicalqualities, individually As of not long ago, our method of managing this richconstruction was profoundly uninspiring We essentially disposed of each picture’sspatial construction by smoothing them into one-dimensional vectors, taking care ofthem through a completely associated MLP Since these organizations are invariant

to the request for the highlights, we could get comparable outcomes whether ornot we protect a request relating to the spatial construction of the pixels or in theevent that we permute the segments of our plan network prior to fitting the MLP’s

Trang 20

boundaries Ideally, we would use our earlier information that close-by pixels arenormally identified with one another, to fabricate effective models for gaining frompicture information.

This part presents convolutional neural organizations (CNNs), an amazinggroup of neural organizations that are intended for correctly this reason CNN-based models are presently universal in the field of PC vision and have become soprevailing that barely anybody today would foster a business application or enter

a contest identified with picture acknowledgment, object location, or semanticdivision, without working off of this methodology

Current CNNs, as they are called casually owe their plan to motivations fromscience, bunch hypothesis, and a solid portion of test dabbling Notwithstandingtheir example productivity in accomplishing precise models, CNNs will, in general,

be computationally proficient, both on the grounds that they require fewer aries than completely associated structures and on the grounds that convolutionsare not difficult to parallelize across GPU centers Thusly, specialists frequentlyapply CNNs at whatever point conceivable, and progressively they have arisen asbelievable contenders even on assignments with a one-dimensional arrangementstructure, like sound, text, and time arrangement investigation, where repetitiveneural organizations are ordinarily utilized Some shrewd variations of CNNs havelikewise applied them as a powerful influence for chart-organized information and

bound-in recommender frameworks

2.2.2 Detail

The Convolutional Layer is the core component of the CNN network that performsthe most important operations Concepts and parameters for setting up a convo-lutional network include such as filter size (F), input size (W), displacement (S -strike), zero-padding (P), depth (depth) aka the number of filters we use

• Depth of the output volume will be equal to the number of filters we decide

to use Each of these filters will learn to detect different characteristics of theinput For example, with an image input, each filter in the first convolutionallayer will in turn learn to detect edges and corners with different directions,etc We will call the set of neurons connected to a region of the image input

is the depth column

• Displacement is the rate at which the filter moves on the input Forexample, when we have a displacement of 1, we will shift the filter 1 pixel at

a time When we have an offset of 2, we will shift the filter by 2 pixels at a

Trang 21

time, and thus will produce an output volume smaller in length and width.

• Zero padding is the parameter that determines the number of 0 bordersaround the input border This approach helps us to control the length andwidth of the output volume (most commonly used to retain the width andlength dimensions of the input and so the output has the same width andlength as the input) )

We can calculate the size of the output space from the above convolutionalnetwork using the formula:

W − F + 2P

There is also depth that will affect how much output is stacked

The 2 important properties of convolutional networks compared to otherarchitectures are as follows:

• Locality: Because the image has a fixed size where each image is different, inreality, however, the objects have the same characteristics but the appearance

of the objects, the output content By regions, we need to extract features

in local regions to get the feature of that object without connecting all theneurons to get the feature

• Parameter sharing: each neuron on the same filter will have the sameparameter and is the parameter of the filter This filter will focus on detectingthere is a certain property or not For example, at low-level filters can detectstraight lines and curves in different directions and shapes, and at high-levelscan combine combination of pre-discovered properties, for example 4 curvesmake a circle This parameter sharing mechanism helps the convolutionalneural network to work well without consuming too much resources and thesize (number of parameters to train) of the model

2.2.3 Pooling

After going through the convolutional layers, normally the output of the tional layer will be pooling The pooling layer has the effect of gradually reducingthe spatial size of the input, helping to reduce the number of parameters and thenumber of calculations of the model A common use of the pooling layer is to

Trang 22

convolu-periodically add between successive convolutional layers in a CNN network Thepooling layer will apply a pooling function independently on each depth profile ofthe input, and thus only the length and width dimensions of the input are reduced,while the depth dimensions are preserved Since the pooling layer only performspooling, there will be no training parameters but only setup parameters Figure2.3 shows how max pooling 2x2 works, in addition to average pooling.

Figure 2.3: Visualization of max pooling

2.2.4 CNN architectures

LeNet 1998 LeNet is one of the most famous oldest CNNs developed by YannLeCUn in 1998 as shown in figure 2.4 LeNet’s structure consists of 2 layers(Convolution + maxpooling) and 2 layers fully connected layer and the output

is softmax layer Let’s take a look at LeNet’s architect details for mnist data(accuracy up to 99%)

Trang 23

Figure 2.4: LeNet architecture

Alexnet 2012 Developed by Alex Krizhevsky in 2012 in the ImageNet 2012competition

The architecture is relatively similar to LeNet-5 as shown in 2.5 The difference

is that this network is designed to be larger and wider than the number ofparameters: 60,000,000 (1000 times more than LeNet-5 ) Architecture as shownbelow:

Figure 2.5: AlexNet architecture

VGGNet 2014 Developed in 2014, it is a deeper but simpler variant of theconvolution architecture (from the root: convolutional structure) commonly found

in CNN The architecture is as shown in the figure 2.6, you can see the defaultnumber Although the higher layers are simplified compared to LeNet, AlexNet compact in size, the number is larger and deeper

Number of parameters: 138,000,000

Trang 24

To save computation, 1x1 size convolutions are used to reduce the input channeldepth (reduce the input channel depth) For each cell, use the 1x1, 3x3, 5x5 filters

to extract features from the input Below is the form of a cell as shown in 2.7

Trang 25

Figure 2.7: Inception cell

Below is the Inception network architecture

The network is built from stitching together inception cells as shown in thefigure 2.8

Figure 2.8: Inception architecture

ResNets 2015 ResNet was developed by microsoft in 2015 with the paper

"Deep residual learning for image recognition" ResNet winer ImageNet ILSVRCcompetition 2015 with error rate 3.57% ,ResNet has a structure similar to VGGwith many stack layers making the model deeper Unlike VGG, resNet has deeperdepths like 34,55,101 and 151 Resnet solves the problem of traditional deeplearning, it can easily train models with hundreds of layers

ResNet has an architecture of many residual blocks, the main idea is to skip

Trang 26

the layer by adding a connection to the previous layer as shown in 2.9 The idea

of residual block is to feed foword x(input) through some layers conv-max-conv,

we get F(x) then add x to H(x) = F (x) + x Model will be easier to learn when

we add features from previous layer

Figure 2.9: ResNets block

The resnets model is depicted as 2.10 below

Figure 2.10: ResNets architecture

2.3 Object Detection

2.3.1 Overall

In image classification tasks, we assume that there is only one major object inthe image and we only focus on how to recognize its category However, thereare often multiple objects in the image of interest We not only want to knowtheir categories, but also their specific positions in the image (bounding box) Incomputer vision, we refer to such tasks as object detection

Trang 27

Object detection has been widely applied in many fields For example, driving needs to plan traveling routes by detecting the positions of vehicles, pedes-trians, roads, lane, and obstacles in the captured video images Besides, robotsmay use this technique to detect and localize objects of interest throughout itsnavigation of an environment Moreover, security systems may need to detectabnormal objects, such as intruders or bombs.

self-2.3.2 RCNN

This work from Girshick et al (2014) is also known as R-CNN This is perhaps themost mainstream work in profound ways to deal with object location The creatorsutilize an inquiry calculations to deliver a sensible number of 2000 districts Thesedistricts are called locales proposition An element extractor takes as informationevery proposition and creates a comparing highlight map At last, a SVM isreceived to arrange the areas

In RCNN, a embedding network extracts feature for each of 2000 regionproposals within the underlying image A classification network is then applied toclassify class label for each feature vector As shown in Fig 2.11

Figure 2.11: R-CNN (figure from Girshick et al (2014))

2.3.3 Fast R-CNN

In this paper, Girshick (2015), the creators propose a technique called Fast based Convolutional Network or Fast R-CNN to handle object location issues Thework centers around improving preparing and testing speed with multiple timesand multiple times quicker than R-CNN at preparing and testing stage separately.The creators call attention to three significant disadvantages with respect to itsvarious stages preparing pipeline, intricacy of preparing, and the speed of discovery

Trang 28

Region-A convolutional-based network takes an entire image and several object proposals

as input

The result feature map is then ROI pooled to produce a set of fixed vectors,which are then post-processed to class probabilities and bound box offsets Asshown in Fig 2.12

Figure 2.12: Fast R-CNN (figure from Girshick (2015))

2.3.4 Faster R-CNN

The Fast R-CNN first feed a pictures and a bunch of article proposition to aconvolution-based element extractor The yield highlight maps are then ROIpooled to a fixed size include vector for each article proposition At long last, acouple of completely associated layers will yield class probabilities and jumpingboxes balances Ren et al (2015) thinks of a much quicker form of R-CNN in whichthe district proposition stage is learnable All the more explicitly, an insertingnetwork separates include from an info picture The outcome highlight map isthen taken care of to a district indicator which then, at that point produces localeproposition for the picture At last, the first component map is ROI pooled byarea recommendations from the indicator

The region proposals are produce from a learnable module, which significantlyimprove time performance As shown in Fig 2.13

Trang 29

Figure 2.13: Faster R-CNN (figure from Girshick (2015))

2.4 Recurrent Neural Network

2.4.1 Introduction

With natural language processing problems, or string problems, it is common,when the data type is of non-fixed length and depends on location and time, theregression network is born In a traditional neural network, all inputs and outputsare independent of each other That is, they are not chained together But thesemodels are not suitable for many problems For example, if we want to guess thenext word that may appear in a sentence, we also need to know how the previouswords appear in turn, right? RNNs are called regressive because they performthe same task for all elements of a sequence with the output dependent on bothprevious calculations In other words, RNN is capable of remembering previouslycomputed information

Figure 2.14 depicts a typical RNN structure:

Trang 30

Figure 2.14: Recurrent Neural Network

To better understand how RNNs work specifically, this section will present themechanism for propagating a sequence of data over a network We consider thebasic specific network architecture of the RNN, which is represented in the form of

a retrieval work as follows (note that the vector is represented as a column):

a0= 0

ai = f (Wax· xi+ Waa· ai−1+ ba)

ˆ

yi= g(Wya· ai+ by)The variables and parameters in the above recursive formula are defined as:

• aiis the hidden state - the hidden state of the model a0 = 0is the conventionfor the starting state

• ˆyi is the output predictor value for step i

• xi is the input value at step i

• f and g are suitable activation functions Usually, we choose f(x) = tanh(x)

or f(x) = ReLU(x) and g(x) = σ(x) or g(x) = Softmax(x), depending onthe design and problem requirements

• W and b are the weights of the model, specifically:

– Wax is the matrix of the linear mapping that transforms x into anelement in a

– Waa is the matrix of a linear mapping that transforms the old a stateinto a component in the new a state

– ba is the bias of the transformation finding a

Trang 31

– Wya is the matrix of the linear mapping that transforms the currentstate a to the output ˆy

– ba is the bias of the transformation finding ˆy

In computational graph form, a basic RNN node can be represented by:

UO U

O

U O

Figure 2.15: RNN calculation in one node.

Trang 32

2.4.2 Application of RNN

Figure 2.16: RNN architectures

• One-to-one: This is a regular neural network

• One-to-many:

With fixed input data, output data is sequence

For example, the problem of generating paragraphs by topic: the input is acertain topic, or a word starts a sentence and requires the model to generate

a paragraph or sentence

• Many-to-one:

With input data as a sequence and fixed output data

For example, the problem of "predicting the next word" or the problem ofsemantic classification of sentences (positive or negative comments) where theoutput can be a vector or a number such as the probability of that sentencebeing the product extreme or not

• Asymmetric sequence-to-sequence, or many-to-many format:The input and output data are both sequenced, but have different lengthsand have no input-output matching

Another common example is a machine translation problem: Given a textstring in language A as input, the system must return a text string in language

B as an output It is clear that the same sentence, but in different languageswill have different representations (sentence length, word count, grammar, )

• Symmetric many-to-many format:

The input and output data are both sequenced, have the same length, andhave corresponding input-output matching at each step

Trang 33

Consider the problem of identifying word types in a sentence Since each wordbelongs to only one type, we will have corresponding symmetric input-outputpairs Moreover, the word type of a sentence also depends on the surroundingwords, or the order in which words appear in the sentence Therefore, we listthis problem as symmetric many-to-many.

2.4.3 Attention Mechanism

2.4.3.1 Encoder - decoder

• Encoder: A phrase that converts input into learning features that are capable

of learning tasks For the Neural Network model, Encoders are hiddenlayers For the CNN model, the Encoder is a sequence of Convolutional +Maxpooling layers The Encoder process RNN model is the Embedding andRecurrent Neral Network layers

• Decoder: The output of the encoder is the input of the Decoder This phraseaims to find the probability distribution from the features learning in theEncoder and then determine what is the label of the output The result can

be a label for categorical models or a chronological sequence of labels formodel seq2seq

In NLP, the seq2seq model is a form of encoder-decoder in NLP, like theasymmetric many-to-many form mentioned above Figure 2.17 is an example

Trang 34

Figure 2.17: Encoded-Decoder

2.4.3.2 Encoder - decoder with attention

The seq2seq model is a sequence model, so it is ordered in time In a machinetranslation task, the words in the input will have a greater relationship with thewords in the output in the same position Therefore, attention in a simple way willhelp the algorithm adjust the focus to be greater on word pairs (input, output)

if they have similar or nearly equivalent positions As model seq2seq figure 2.18after attention

Trang 35

Figure 2.18: Encoded-Decoder with Attention

2.4.3.3 Equation

First at the time step t we calculate a list of scores, each point corresponding to apair of input positions t and the rest according to the formula below:

score(ht, ¯hs)Here ht is fixed at time step t and is hidden state of target word t in decoderphase, ¯hs) is hidden state of sth word in encoder phase The formula to calculatethe score can be dot product or cosine similarity depending on the choice

2 The scores after step 1 have not been normalized To form a probabilitydistribution we go through the softmax function then we get the attention weights

αts = exp(score(ht , ¯ h s ))

P S s0=1 exp(score(h t , ¯ hs0))

αts is the attention (attention weight) distribution of the words in the input to thewords at the position in the output or target

3 Combine the probability distribution vector αtswith the hidden state vectors

to obtain the context vector

Trang 36

at= f (ct, ht) = tanh(Wc[ct, ht])

4 Calculate the attention vector to decode the corresponding word in thetarget language The attention vector will be a combination of the context vectorand the hidden states in the decoder In this way, the attention vector will notonly learn from only the hidden state in the last unit as shown in Figure 1, butalso learn from all the words in other locations through the context vector Theformula for calculating output for hidden state is similar to calculating output forinput gate layer in RNN:

ct=PS

s 0 =1αtsh¯s0The notation [ct, ht]is the concatenate of 2 vector ct, ht in length Suppose

ct∈ Rc, ht∈ Rh then vector [ct, ht] ∈ Rc+h Wc∈ Ra×(c+h) where a is the length

of the attention vector The matrix we need to train is Wc

2.5 Transformer

2.5.1 Introduction

Transformer is a model from the article attention is all you need Vaswani et al.(2017) that applies attention to string tasks (later applied to other tasks) more).Transformer architecture as shown 2.19

Figure 2.19: Transformer: attention is all you need

Trang 37

This architecture consists of 2 parts encoder on the left and decoder on theright.

• Encoder: is a stacked composite of 6 defined layers Each layer consists of

2 sub-layers in it The first sub-layer is the multi-head self-attention Thesecond layer is merely fully-connected feed-forward layers

• Decoder: Decoder is also a stacked composite of 6 layers The architecture

is similar to the sub-layers in the Encoder except that one more sub-layerrepresents the attention distribution in the first place This layer is nodifferent from the multi-head self-attention layer except that it is adjusted

to not bring future words to attention

2.5.2 Attention Mechanism

Scale dot product attention With each input token, it will go through theembedding layer to get the embedding vector and be calculated in the transformer’sencoder as the 2.20 model

Figure 2.20: Transformer calculation

Self-Attention is the idea of attention on each word in the middle of the wholesentence as shown in 2.21 In the first step, with the embedding vector included,

we create three vectors Query, Key and Value by multiplying the embedding vector

by three matrices (with training parameters) as shown in 2.22

Trang 38

Q, K, V = Wq· X, Wk· X, Wv· X (2.7)

Trang 39

Calculate attention score by multiplying 2 vectors (matrix if general to 1sentence) Q and K, then we normalize (equal to square root) Get the Softmax forthe entire Score taken from all the tokens, multiply the score by the Value vectorand sum it to output as shown 2.23.

through Self-Attention, the output will go through the Feed Forward layer withthe formula:

X = LayerN orm(x + Sublayer(x)) = LayerN orm(x + z) (2.8)

Multi-head Attention Not only apply 1 layer of Self-Attention, but also usemany layers of Self-Attention to learn more semantics with the formula:

MultiHead(Q, K, V) = concatenate(head1, head2, , headh)W0 (2.9)

Tiêu đề	Grounded Language Learning: Improve Text Representation with Visual Information
Tác giả	Nguyen Tran Cong Duy
Người hướng dẫn	Assoc. Prof. Quan Thanh Tho, Mr. Le Dinh Thuan
Trường học	Ho Chi Minh city University of Technology
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	85
Dung lượng	6,07 MB