Luận văn thạc sĩ Khoa học máy tính: Application of visual question answering using bert integrated with knowledge base to answer extensive question

Trang 1

HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

Trang 2

HO CHI MINH CITY

Trang 3

THIS THESIS IS COMPLETED AT

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY – VNU-HCM

Master’s Thesis Committee:

1 Chairman: Assoc Prof Dr Võ Thị Ngọc Châu 2 Secretary: Dr Phan Trọng Nhân

3 Examiner 1: Dr Trần Tuấn Anh 4 Examiner 2: Dr Bùi Thanh Hùng 5 Commissioner: Dr Bùi Công Giao

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of …Computer Science and Engineering …after the thesis being corrected (If any)

CHAIRMAN OF THESIS COMMITTEE HEAD OF FACULTY OF

COMPUTER SCIENCE AND ENGINEERING

Trang 4

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

Full name: Phạm Điền Khoa Student ID: 2170462

I THESIS TITLE (In Vietnamese): Ứng Dụng Của Visual Question Answering Sử Dụng BERT Tích Hợp Với Knowledge Base Để Trả Lời Câu Hỏi Mở Rộng.II THESIS TITLE (In English): : APPLICATION OF VISUAL QUESTION

ANSWERING USING BERT INTEGRATED WITH KNOWLEDGE BASE TO ANSWER EXTENSIVE QUESTION

III TASKS AND CONTENTS:

 Researching Visual Question Answering in natural language processing  Proposing suitable approaches for Visual Question Answering

 Experimenting and evaluating proposed approaches

IV THESIS START DAY: 06/02/2023

V THESIS COMPLETION DAY: 09/06/2023

VI SUPERVISORS: Assoc Prof Dr Quản Thành Thơ, Dr Bùi Hoài Thắng

Ho Chi Minh City, date ………

SUPERVISOR 1 SUPERVISOR 2

( (Full name and signature) (Full name and signature)

CHAIR OF PROGRAM COMMITEE

(Full name and signature)

DEAN OF FACULTY OF COMPUTER SCIENCE AND ENGINEERING

Trang 5

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my parents for their unwavering support and encouragement throughout my academic journey Their love, patience, and belief in me have been a constant source of motivation and strength

I am also immensely grateful to my thesis instructor, Assoc Prof Quản Thành Thơ, for his invaluable guidance, expertise, and continuous support throughout the research process His insightful feedback, constructive criticism, and dedication to excellence have greatly contributed to the success of this thesis

I would like to extend my appreciation to all the faculty members and staff of the Department of Computer Science for providing me with a conducive learning environment and valuable resources Their expertise and assistance have been instrumental in shaping my research work

Finally, I would like to thank my friends and colleagues for their friendship, encouragement, and insightful discussions, which have enriched my understanding and perspectives on the subject matter

To everyone who has played a part, big or small, in the completion of this master thesis, I am sincerely thankful for your support and assistance

Trang 6

ABSTRACT

Fusing different modalities, such as image and text, to obtain important information has long been an issue in artificial intelligence Visual Question Answering (VQA) is an emerging field that aims to develop intelligent systems capable of understanding and answering questions based on visual content This dissertation presents a comprehensive study on VQA with the intergration of knowledge base, focusing on the fusion of language understanding, image processing, and knowledge retrieval techniques

 The primary objective of this research is to investigate the effectiveness of different baselines in the Knowledge-based Visual Question Answering task and explore their strengths and weaknesses Two baselines were considered: KBVQA with BERT base and CNN, KBVQA with BERT large and CLIP These models were evaluated using the ViQuAE dataset, which contains pre-classified questions with ground truth answers

 To improve the KBVQA system, several future directions are proposed First, training the BERT models with larger and more diverse datasets could enhance their language understanding capabilities Additionally, efforts should be made to address the challenges associated with object detection and face recognition, which were not fully implemented within the limited timeframe of this research

In conclusion, this dissertation contributes to the field of Knowledge-based Visual Question Answering by evaluating and comparing different baseline models It highlights the effectiveness of BERT large and CLIP in achieving accurate and relevant answers The findings and proposed future directions provide valuable insights for researchers and practitioners interested in advancing KBVQA systems

Trang 7

TÓM TẮT LUẬN VĂN

Việc hợp nhất các phương thức khác nhau, chẳng hạn như văn bản và hình ảnh, để có được các thông tin cần thiết từ lâu đã là một vấn đề trong lĩnh vực trí tuệ nhân tạo Visual Question Answering (VQA) là một lĩnh vực mới nổi gần đây, nhằm phát triển các hệ thống AI có khả năng hiểu và trả lời các câu hỏi dựa trên hình ảnh Luận văn này trình bày một nghiên cứu về VQA tích hợp với Knowledge Base, tập trung vào sự kết hợp giữa việc đọc hiểu ngôn ngữ, xử lý hình ảnh và các kỹ thuật truy xuất thông tin

• Mục tiêu chính của nghiên cứu này là điều tra hiệu quả của các baseline khác nhau trong nhiệm vụ trả lời câu hỏi theo hình ảnh dựa trên knowledge base và khám phá điểm mạnh và điểm yếu của chúng Hai baseline đã được xem xét: KBVQA với BERT base và CNN, KBVQA với BERT large và CLIP Các mô hình này được đánh giá bằng cách sử dụng dataset ViQuAE, bao gồm các câu hỏi được phân loại trước với các câu trả lời ground truth

• Để cải thiện hệ thống KBVQA, một số hướng tiếp theo được đề xuất Đầu tiên, đào tạo mô hình BERT với bộ dữ liệu lớn hơn và đa dạng hơn có thể nâng cao khả năng hiểu ngôn ngữ Ngoài ra, cần nỗ lực giải quyết các thách thức liên quan đến phát hiện đối tượng và nhận dạng khuôn mặt, vốn không được triển khai đầy đủ trong thời gian giới hạn của nghiên cứu này

Tóm lại, luận văn này đóng góp vào lĩnh vực Knowledge-based Visual Question Answering bằng cách đánh giá và so sánh các baseline khác nhau Thực nghiệm chỉ ra hiệu quả của baseline sử dụng BERT large và CLIP trong việc đạt được câu trả lời chính xác và phù hợp Các nghiên cứu tiếp tục trong tương lai sẽ cung cấp những hiểu biết có giá trị cho các nhà nghiên cứu và những người thực hành quan tâm đến việc cải tiến các hệ thống KBVQA

Trang 8

THE COMMITMENT OF THE THESIS’S AUTHOR

I declare that this thesis, written under the supervision of Assoc Prof Quan Thanh Tho, was created to suit the needs of society and my abilities to obtain information External support should be documented, referenced, and cited

AUTHOR

Trang 9

TABLE OF CONTENTS

CHAPTER I: INTRODUCTION 1

1.1RESEARCH PROBLEM: 1

1.2OVERVIEW OF VQA AND KBVQA PROBLEM 2

1.2.1 Visual Question Answering (VQA) 2

1.2.2 Knowledge-based Visual Question Answering (KBVQA): 4

1.3TARGET AND SCOPE OF THE THESIS 6

1.4LIMITS OF THE THESIS 7

1.5CONTRIBUTION OF THE THESIS 7

1.6RESEARCH SUBJECTS: 7

CHAPTER 2: BACKGROUND KNOWLEDGE 9

2.1CONVOLUTIONAL NEURAL NETWORKS (CNNS): 9

2.1.1 Architecture of CNN: 10

2.1.2 Residual Network( ResNet): 11

2.2LONG SHORT-TERM MEMORY (LSTM): 14

2.2.1 Overview about Recurrent Neural Networks (RNNs): 14

2.2.2 Long-short term memory: 15

2.3BERT 17

2.4CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING (CLIP) 22

2.5WIKIPEDIA KNOWLEDGE BASE 26

CHAPTER 3: RELATED WORKS 28

3.1APPROACHES OF KNOWLEDGE-BASED VISUAL QUESTION ANSWERING 30

4.2.1 Metric Evaluation: Precision, Recall and F1 score: 40

4.2.2 Dataset for evaluation 42

Trang 10

4.5FIRST BASELINE:BERT BASE WITH CNN 48

4.5.1 Motivation and idea: 48

4.5.2 System Description 50

4.5.3 Result and Discussion 52

4.6SECOND BASELINE:BERTLARGE AND CLIP 56

4.6.2 System Description: 57

4.7EXPERIMENT WITH GPT-2 FOR 2ND BASELINE 64

4.7.2 System Description 65

CHAPTER 5: CONCLUSION 67

REFERENCES 69

Trang 11

LIST OF TABLES

Table 1.1: Computer vision sub-tasks required to be solved by VQẠ 3

Table 4.1: Google Colabs’s Specs 45

Table 4.2: GPUs available in Colab, Colab Pro, and Colab Prợ 45

Table 4.3: KBVQA system (version BERT base + CNN)’s parameters 51

Table 4.4: Result of KBVQA system (BERT base + CNN) 52

Table 4.5: KBVQA system (version BERT large + CLIP)’s parameters 60

Table 4.6: Result of KBVQA system (BERT large + CLIP) 61

Trang 12

TABLE OF FIGURES

Figure 1.1: Illustration of VQA's tasks 2

Figure 1.2: The question and relevant item in the Knowledge Base 5

Figure 2.1: Convolutional Neural Network 10

Figure 2.2 Comparison of 20-layer vs 56-layer architecture 11

Figure 2.3 Skip (Shortcut) connection of ResNet 12

Figure 2.4 ResNet -34 architecture 13

Figure 2.5 Recurrent Neural Networks 14

Figure 2.6: Long-short-term-memory’s structure 16

Figure 2.7: BERT’s Encoder 18

Figure 2.8: BERT model 20

Figure 2.9: Contrastive Pre-training CLIP 24

Figure 2.10: Create dataset classifier from lable text and use it for zero-prediction 25

Figure 3.1: Knowledge-based Visual Question Answering’s Taxonomy 29

Figure 4.1: Types of question that you can ask between traditional VQA and Knowledge-based VQA 35

Figure 4.2: The overview of "Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering" VQA system 39

Figure 4.3: “KVQA: Knowledge-Aware Visual Question Answering” VQA system 40

Figure 4.4: KBVQA system with BERT base and CNN 50

Figure 4.5: Generated answer about Barack Obama from the KBVQA (BERT base + CNN) system 54

Figure 4.6:Bad generated answer Isaac Newton from the KBVQA (BERT base + CNN) system 55

Figure 4.7: KBVQA system with BERT and CLIP 58

Trang 13

Figure 4.8: Good generated answer aboutAmelia Earhart from the KBVQA (BERT large + CLIP) system 63 Figure 4.9: KBVQA system with BERT, CLIP and GPT-2 65 Figure 4.10: Generated answer with more information from KBVQA system (BERT Large +CLIP +GPT-2) about Elvis Presley 66

Trang 14

CHAPTER I: INTRODUCTION

1.1 Research Problem:

Fusing multiple modalities, such as image and text, to retrieve relevant information is a long-standing problem in the field of artificial intelligence In recent years, significant progress has been made in many subdomains of machine learning Neural networks are now capable of solving Computer Vision and Natural Language Processing tasks in a much different way, with greater speed and accuracy One of the most challenging and promising tasks in this area is Visual Question Answering (VQA), which aims to automatically generate an accurate answer to a natural language question about an image It was once thought that developing a computer vision system capable of answering arbitrary natural language questions about images was an ambitious but intractable goal However, since 2016, there has been tremendous progress in developing systems with these capabilities VQA systems aim to correctly answer natural language questions about an image input and comprehend the contents of an image in the same way that humans do, while also communicating effectively about that image in natural language

With the increasing availability of large-scale annotated datasets, deep learning-based approaches have recently achieved remarkable progress in VQA However, most existing VQA models only focus on answering questions about objects or actions in images, one of the main challenges is the lack of knowledge representation in VQA systems, which limits their ability to handle complex questions that require a deeper understanding of the image content There was an idea about using the visual question answering system to answer more extensive questions about the information of the entities in the image, and to do that, the idea of retrieving the information contained in the knowledge base that has been applied to VQA Combining all of this, a new task called Knowledge-based Visual Question Answering (KBVQA) has been proposed, which requires a model to retrieve relevant information about a entity in the image from a knowledge base and use it to answer questions

Trang 15

To address the limitation of traditional Visual Question Answering that we mention before, Knowledge-based Visual Question Answering (KBVQA) has emerged as a another more specialized direction for VQA field KBVQA requires external knowledge (knowledge bases - KBs) beyond the image to answer the question By utilizing structured knowledge sources, KBVQA systems can reason about the relationships between visual concepts and provide more accurate and detailed answers to complex questions

The integration of knowledge representation techniques in KBVQA has the potential to significantly improve the performance of VQA systems and enable them to tackle more complex and sophisticated questions This master thesis aims to investigate the challenges and opportunities of KBVQA, explore the existing approaches and techniques in the field, and propose solutions to enhance the accuracy and efficiency of KBVQA systems

1.2 Overview of VQA and KBVQA problem 1.2.1 Visual Question Answering (VQA)

Visual question answering systems attempt to correctly answer natural language questions about an image input The overarching goal of this problem is to create systems that can comprehend the contents of an image in the same way that humans do and communicate effectively about that image in natural language This is a difficult task because image-based models and natural language models must interact and complement each other

Trang 16

Figure 1.1: Illustration of VQA's tasks

The problem is widely regarded as AI-complete One that addresses the problem of Artificial General Intelligence, namely making computers as intelligent as humans

An idea of the subproblems involved in the task of visual question answering:

Table 1.1: Computer vision sub-tasks required to be solved by VQA

Computer Vision Task Representative VQA Question

Attribute classification What corlor is the umbrella?

image?

Sptial relationships among What is betwwen cat and sofa?

Commonsense reasoning Does this have 20/20 vision?

Knowledge-base reasoning Is this a vegetarian pizza?

People can now create models that can recognize objects in images with high accuracy However, we are still a long way from human-level image comprehension When we look at images, we see objects but also understand how they interact and can tell their state and properties Visual question answering (VQA) is particularly intriguing because it allows us to learn about what our models actually see We present the model with an image and a question in the form of natural language and the model generates an answer again in the form of natural language

Trang 17

1.2.2 Knowledge-based Visual Question Answering (KBVQA):

Knowledge-Based Visual Question Answering (KBVQA) is a new task that combines the capabilities of computer vision, natural language processing, and knowledge representation KBVQA systems aim to answer questions about entities in images by utilizing external knowledge sources, such as Wikipedia knowledge base Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models To address the task, one must thus retrieve relevant information from a KB (Knowledge Base) The main challenge in KBVQA is to enable machines to reason about the relationships between visual concepts and knowledge sources to provide accurate and detailed answers to complex questions This contrasts with standard Visual Question Answering, where questions target the content of the image (e.g the color of an object or the number of objects)

KBVQA is an extension of the Visual Question Answering (VQA) task, which aims to generate accurate answers to natural language questions about an image However, the limitation of traditional VQA models is their inability to handle complex questions that require a deeper understanding of the image content By utilizing structured knowledge sources, KBVQA systems can address this limitation and enhance the performance of VQA systems In KBVQA, both text and image modalities bring useful information that must be combined together with querying knowledge base techiniques Therefore, the task is more broadly related to Multimodal Information Retrieval (IR) and Multimodal Fusion

Trang 18

Figure 1.2: The question and relevant item in the Knowledge Base

Recent advances in deep learning, natural language processing, and knowledge representation have led to significant progress in KBVQA Several approaches have been proposed to address the challenges of KBVQA, such as knowledge graph-based methods, multi-modal fusion methods, and attention-based methods These approaches aim to improve the accuracy and efficiency of KBVQA systems and enable them to tackle more complex and sophisticated questions

In this master thesis, I will investigate the challenges and opportunities of KBVQA, explore the existing approaches and techniques in the field, and propose solutions to enhance the accuracy and efficiency of KBVQA systems that using multi-modal fusion method, consist of BERT (Bidirectional Encoder Representations from Transformers) as a language model to work with CNNs (Convolutional Neural Networks) or CLIP(Contrastive Language-Image Pre-Training) as image processing models and Wikipedia Knowledge Base I will also evaluate the performance of different KBVQA models on different baseline and datasets and analyze the results to gain insights into the strengths and weaknesses of each approach

Trang 19

1.3 Target and Scope of the Thesis

The target of the thesis is to research and build a Knowledge-based Visual Question Answering system using deep learning methods and natural language processing techniques Specifically:

 Understand and use deep learning models, processing techniques in natural language and vision domain

 Understand and use the combination of models with different functions, which called mullti-modal fusion, especially the late fusion technique

 Understand and apply forms of information retrieval, and apply information retrieval to the knowledge base

 Understand problem solving methods, especially recent ones using modern deep learning models like BERT, CLIP From there, the advantages and disadvantages of each method are shown

 Make suggestions that can improve the performance for the system

 After the thesis, students have a more accurate view of natural language processing in particular and deep learning, machine learning in general Better understand the problems, challenges and feasibility when applying deep learning, machine learning to solve a real-world problem

From the above objectives, students set out the tasks to be performed in the process of making the thesis:

 Learn the problem of visual question answering combine with knowledge base, related works, problem solving methods, advantages and disadvantages of the methods

 Propose models to improve the accuracy of knowledge-based visual question answering task

 Experiment and evaluate the results of the proposed systems

 In conclusion, the outstanding issues are raised and future research is proposed

Trang 20

1.4 Limits of the Thesis

Knowledge-based Visual Question Answering is a complex problem and has many tasks and methods, so the content of the thesis will be limited as follows:

 Focus on Knowledge-based visual question answering in the direction of multimodal fusion, especially late fusion technique

 The language of the datasets is English

 Deep learning models: CNN, LSTM, CLIP, BERT, GPT-2

 The model is evaluated based on the precison, recall, f1-score for the base query and answer the question

Knowledge-1.5 Contribution of the Thesis

In this thesis, student propose a method to improve the answers from VQA system through retrieving information with knowledge base retreival method, combining BERT and CLIP models to the system Integrating GPT-2 to answering module to expand on information related to the entity which is in question to provide an appropriate answer about the entity

1.6 Research subjects:

This thesis consist of 5 chapters:

 Chapter 1 Introduction: Introduce the development of the current multimodal fusion trend, describe the knoeledge-based visual question answering problem, commonly used datasets as well as evaluation methods

 Chapter 2 Background: Discuss the basic knowledge background in deep learning, from Convolutional Neural Networks (CNNs), Long-short-term memory (LSTM) to Bidirectional Encoder Representations from Transformers (BERTs), Contrastive Language-Image Pre-Training (CLIP) and Generative Pre-trained Transformer 2 (GPT-2)

 Chapter 3 Related works: Discussing related studies, deep learning models and

multimodal fusion methods, thereby giving direction for the topic

Trang 21

 Chapter 4 The Proposed Models: Chapter 4 specifically talks about the students' proposed models, for the Knowledge-based Visual Question Answering problem and the practical results

 Chapter 5 Conclusion: Summarize the contributions of the thesis, the outstanding

issues of the problem of KBVQA and talk about future research.

Trang 22

CHAPTER 2: BACKGROUND KNOWLEDGE

Knowledge-based Visual Question Answering (KBVQA) is a recent and challenging research area in computer vision and natural language processing that requires the integration of knowledge representation techniques to enable machines to answer complex and sophisticated questions about visual content In this thesis, we focus on the main tasks involved in KBVQA, including natural language question processing by Bidirectional Encoder Representations from Transformers (BERT) or Long Short-Term Memory (LSTM), image understanding using Convolutional Neural Networks (CNNs), and the use of advanced language models such as Contrastive Language-Image Pre-Training (CLIP) models with OpenAI's GPT-2, to generate relevant and accurate answers to questions

2.1 Convolutional Neural Networks (CNNs):

Convolutional neural networks are a specialized type of artificial neural networks that use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers They are specifically designed to process pixel data and are used in image recognition and processing

A convolutional neural network (CNN, or ConvNet) is a type of artificial neural network (ANN) that is most commonly used in deep learning to analyze visual imagery CNNs are also referred to as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), due to the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps

Due to the downsampling operation used on the input, most convolutional neural networks are not translationally invariant Image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain-computer interfaces, and financial time series are just a few of the applications

Trang 23

CNNs are regularized versions of multilayer perceptrons Multilayer perceptrons are typically fully connected networks, in which each neuron in one layer is connected to all neurons in the next layer These networks' "full connectivity" makes them prone to data overfitting Typical methods of regularization, or preventing overfitting, include penalizing parameters during training (such as weight decay) or trimming connectivity CNNs take a different approach to regularization: they use the hierarchical pattern in data to assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters As a result, CNNs are at the lower end of the connectivity and complexity scale

Convolutional neural networks were inspired by biological processes in that the pattern of connectivity between neurons resembles the organization of the animal visual cortex Individual cortical neurons only respond to stimuli in a small region of the visual field known as the receptive field Different neurons' receptive fields partially overlap to cover the entire visual field

In comparison to other image classification algorithms, CNNs require little processing This means that, unlike traditional algorithms, the network learns to optimize the filters (or kernels) through automated learning This lack of reliance on prior knowledge and human intervention in feature extraction is a significant benefit

pre-Figure 2.1: Convolutional Neural Network

2.1.1 Architecture of CNN:

A convolutional neural network consists of an input layer, hidden layers and an output layer In any feed-forward neural network, any middle layers are called hidden because their inputs and outputs are masked by the activation function and final convolution In a

Trang 24

convolutional neural network, the hidden layers include layers that perform convolutions Typically this includes a layer that performs a dot product of the convolution kernel with the layer's input matrix This product is usually the Frobenius inner product, and its activation function is commonly ReLU As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer This is followed by other layers such as pooling layers, fully connected layers, and normalization layers

2.1.2 Residual Network( ResNet):

Following the first CNN-based architecture (AlexNet), which won the ImageNet 2012 competition, each subsequent winning architecture employs more layers in a deep neural network to reduce error rates This works for fewer layers, but as the number of layers increases, we encounter a common problem in deep learning known as the Vanishing/Exploding gradient As a result, the gradient becomes 0 or too large As the number of layers increases, so does the training and testing error rate

Figure 2.2 Comparison of 20-layer vs 56-layer architecture

The above plot shows that a 56-layer CNN architecture produces more error on both the training and testing datasets than a 20-layer CNN architecture After further investigation into the error rate, the authors concluded that it is caused by a vanishing/exploding gradient

Trang 25

ResNet, which was proposed in 2015 by Microsoft Research researchers, introduced a new architecture called Residual Network

Residual Network: This architecture introduced the concept of Residual Blocks to solve the problem of the vanishing/exploding gradient In this network, we employ a technique known as skip connections The skip connection connects layer activations to subsequent layers by skipping some layers in between This results in the formation of a residual block These residual blocks are stacked together to form renets

Instead of layers learning the underlying mapping, this network allows the network to fit the residual mapping So, rather than using, say, H(x), as the initial mapping, let the

network fit

F(x) := H(x) - x which gives H(x) := F(x) + x

Figure 2.3 Skip (Shortcut) connection of ResNet

The benefit of including this type of skip connection is that if any layer degrades architecture performance, it will be skipped by regularization As a result, training a very deep neural network is possible without the issues caused by vanishing/exploding gradients The researchers conducted experiments on 100-1000 layers of the CIFAR-10 dataset

Trang 26

A similar approach is known as "highway networks," and these networks, too, use skip connections These skip connections, like LSTM, make use of parametric gates These gates control how much data passes through the skip connection However, this architecture has not provided greater accuracy than the ResNet architecture

Network Architecture: This network employs a 34-layer plain network architecture inspired by VGG-19, to which the shortcut connection is added These shortcut connections

then transform the architecture into a residual network

Figure 2.4 ResNet -34 architecture

Image Embeddings by ResNet:

To compute a high level representation  of the input image I, I use a pretrained convolutional neural network (CNN) model based on residual network (ResNet) architecture

= CNN (I)

 is a 14 x 14 x 2048-dimensional three-dimensional tensor from the last layer of the residual network (ResNet) before the final pooling layer I also apply l2 normalization to the depth (last) dimension of image features, which improves learning dynamics

Trang 27

2.2 Long Short-Term Memory (LSTM):

2.2.1 Overview about Recurrent Neural Networks (RNNs):

A recurrent neural network (RNN) is a type of artificial neural network in which connections between nodes can form a cycle, allowing output from one node to affect subsequent input to the same node This enables it to exhibit temporal dynamic behavior RNNs, which are derived from feedforward neural networks, can process variable length sequences of inputs using their internal state (memory) As a result, they can be used for tasks like unsegmented, connected handwriting recognition or speech recognition Recurrent neural networks are Turing complete in theory and can execute arbitrary programs to process arbitrary sequences of inputs

The term "recurrent neural network" refers to a type of network with an infinite impulse response, whereas "convolutional neural network" refers to a type of network with a finite impulse response Both types of networks exhibit temporal dynamics An infinite impulse recurrent network is a directed cyclic graph that cannot be unrolled and replaced with a strictly feedforward neural network, whereas a finite impulse recurrent network can be unrolled

Figure 2.5 Recurrent Neural Networks

Trang 28

Both finite impulse and infinite impulse recurrent networks can have additional stored states, and the storage can be controlled directly by the neural network If the storage includes time delays or feedback loops, it can also be replaced by another network or graph These controlled states are known as gated states or gated memory, and they are used in long short-term memory networks (LSTMs) and gated recurrent units This is also referred to as a Feedback Neural Network

2.2.2 Long-short term memory:

Classic RNNs, in theory, can track arbitrary long-term dependencies in input sequences The problem with vanilla RNNs is computational (or practical) in nature: when back-propagating a vanilla RNN, the long-term gradients can "vanish" (that is, tend to zero) or "explode" (that is, tend to infinity) due to the computations involved, which use finite-precision numbers Because LSTM units allow gradients to flow unchanged, RNNs with LSTM units partially solve the vanishing gradient problem

The long short-term memory (LSTM) is a type of artificial neural network used in artificial intelligence and deep learning Unlike traditional feedforward neural networks, LSTM includes feedback connections A recurrent neural network (RNN) of this type can process not only single data points (such as images), but also entire data sequences (such as speech or video) LSTM, for example, is useful for unsegmented, connected handwriting recognition, speech recognition, machine translation, robot control, video games, and healthcare The LSTM neural network has become the most cited neural network of the twentieth century

Trang 29

Figure 2.6: Long-short-term-memory’s structure

The name of LSTM refers to the analogy that a standard RNN has both "long-term memory" and "short-term memory" The connection weights and biases in the network change once per episode of training, analogous to how physiological changes in synaptic strengths store long-term memories; the activation patterns in the network change once per time-step, analogous to how the moment-to-moment change in electric firing patterns in the brain store short-term memories The LSTM architecture aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory"

A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell

Because there can be lags of unknown duration between important events in a time series, LSTM networks are well-suited to classifying, processing, and making predictions based on time series data LSTMs were created to address the vanishing gradient problem that can occur when training traditional RNNs In many applications, LSTM outperforms RNNs, hidden Markov models, and other sequence learning methods due to its relative insensitivity to gap length

Trang 30

An RNN with LSTM units can be trained in a supervised manner on a set of training sequences, using an optimization algorithm such as gradient descent combined with backpropagation through time to compute the gradients required during the optimization process, in order to change each weight of the LSTM network in proportion to the derivative of the error (at the LSTM network's output layer) with respect to the corresponding weight A problem with using gradient descent for standard RNNs is that error gradients vanish exponentially quickly with the size of the time lag between important events When error values are back-propagated from the output layer to LSTM units, the error remains in the LSTM unit's cell This "error carousel" feeds error back to each of the LSTM unit's gates indefinitely until they learn to cut off the value

Question Embeddings by LSTM:

We tokenize and encode a given question q into word embeddings Eq = {e1, e2, , ep} where ei ∈ RD, D is the length of the distributed word representation, and P is the number of words in the question After that, the embeddings are stored in a long short-term memory (LSTM)

BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text Transformer, in its most basic form, consists of two separate mechanisms: an encoder that reads the text input and a decoder that produces a prediction for the task Because the goal of BERT is to generate a language model, only the encoder mechanism is required The Transformer encoder reads the entire sequence of words at once, as opposed to directional models, which read the text input

Trang 31

sequentially (left-to-right or right-to-left) As a result, it is considered bidirectional, though it is more accurate to say non-directional This feature enables the model to learn the context of a word based on its surroundings (to the left and right of the word)

The chart below is a high-level description of the Transformer encoder The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index

Figure 2.7: BERT’s Encoder

When training language models, there is a challenge of defining a prediction goal Many models predict the next word in a sequence (e.g “The child came home from _”), a directional approach which inherently limits context learning To overcome this challenge, BERT uses two training strategies:

Masked LM (MLM)

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token The model then attempts to predict the original value of

Trang 32

the masked words, based on the context provided by the other, non-masked, words in the sequence In technical terms, the prediction of the output words requires:

 Adding a classification layer on top of the encoder output

 Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension

 Calculating the probability of each word in the vocabulary with softmax The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words As a consequence, the model converges slower than directional models, a characteristic which is offset by its increased context awareness

Next Sentence Prediction (NSP)

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence The assumption is that the random sentence will be disconnected from the first sentence

To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:

 A [CLS] token is inserted at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence

 A sentence embedding indicating Sentence A or Sentence B is added to each token Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2

Trang 33

 A positional embedding is added to each token to indicate its position in the sequence The concept and implementation of positional embedding are presented in the Transformer paper

Figure 2.8: BERT model

During pretraining, BERT learns to generate contextualized word representations by training on a large corpus of unlabeled text The objective is to predict the masked words within a sentence, as well as to predict whether two sentences appear consecutively in the original text or not This pretraining process helps BERT to learn rich representations of words that capture both syntactic and semantic information After pretraining, BERT can be fine-tuned on specific downstream tasks by adding task-specific layers on top of the pretrained model During fine-tuning, BERT is trained on labeled data for tasks such as text classification, named entity recognition, question answering, and more The model's parameters are updated to optimize performance on the specific task using supervised learning

There are 2 popular versions of BERT, namely BERT base and BERT large, to compare both versions, we will compare based on the following aspects

 Model Size: BERT base has 12 transformer layers, while BERT large has 24 transformer layers BERT large is larger and more powerful, but it requires more computational resources and time for training and inference

 Parameters: BERT base has around 110 million parameters, whereas BERT large has around 340 million parameters The increased number of parameters in BERT

Trang 34

large allows it to capture more complex patterns and dependencies in the input data

 Performance: BERT large tends to outperform BERT base on various NLP tasks due to its larger capacity to learn from data BERT large can achieve higher accuracy and better generalization on tasks such as question answering, natural language inference, and text classification However, BERT base is often sufficient for many practical applications and offers a good balance between performance and computational requirements

 Training Time: Training BERT large takes significantly more time compared to BERT base due to the increased model size and the number of parameters BERT large requires access to more computational resources and longer training times to converge

 Inference Speed: Inference with BERT large is slower compared to BERT base due to its larger size and increased computational requirements BERT large may not be suitable for applications where real-time or low-latency predictions are required

Overall, BERT large provides better performance than BERT base, especially on complex NLP tasks However, it comes with increased computational requirements, longer training times, and slower inference speed The choice between BERT base and BERT large depends on the specific task, available computational resources, and the trade-off between performance and efficiency

In this project, BERT is used for both the question and the Wikipedia text I use both versions of BERT, BERT base combined with CNN and BERT large combined with CLIP for baselines The input to the BERT model is created by concatenating the question and a summary of the Wikipedia page for the named entity provided by the user The start and end positions of the answer within the concatenated input are then predicted using the BERT model's output

Trang 35

The program takes a user-input question and picture to retrieve the summary of the Wikipedia page for the entity The summary is then truncated to fit the maximum input length for the BERT model

The program then encodes the question and summary using the BERT tokenizer, adding special tokens such as [CLS] and [SEP], truncating if necessary, and returning a PyTorch tensor This tensor is then passed as input to the BERT model, which returns the start and end scores of the answer span

The start and end indices of the answer span are then obtained by finding the index of the maximum score for each, and the answer span is extracted using the tokenizer's convert_tokens_to_string and convert_ids_to_tokens methods

Finally, the start and end scores, along with the extracted answer span, are concatenated with the image features from CLIP and the resulting tensor is used to output the final answer to the question

2.4 Contrastive Language-Image Pre-Training (CLIP)

CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories A critical insight was to leverage natural language as a flexible prediction space to enable generalization and transfer

Most inspirational for CLIP is the work of Ang Li and his co-authors at FAIR who in 2016 demonstrated using natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets, such as the canonical ImageNet dataset They achieved this by fine-tuning an ImageNet CNN to predict a much wider set of visual concepts (visual n-grams) from the text of titles, descriptions, and tags of 30 million Flickr photos and were able to reach 11.5% accuracy on ImageNet zero-shot

Trang 36

CLIP is part of a group of papers revisiting learning visual representations from natural language supervision in the past year This line of work uses more modern architectures like the Transformer and includes VirTex, which explored autoregressive language modeling, ICMLM, which investigated masked language modeling, and ConVIRT,which studied the same contrastive objective we use for CLIP but in the field of medical imaging

Scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets The method uses an abundantly available source of supervision: the text paired with images found across the internet This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in the dataset

In order to solve this task, the intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it

Trang 37

Figure 2.9: Contrastive Pre-training CLIP

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset We then use this behavior to turn CLIP into a zero-shot classifier We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image

Trang 38

Figure 2.10: Create dataset classifier from lable text and use it for zero-prediction

CLIP has been trained with a contrastive objective in a weakly-supervised manner over 400M image and caption pairs It has demonstrated better generalization capacities than fully-supervised models and is effcient for KBVQA

With CLIP, they tested whether task agnostic pre-training on internet scale natural language, which has powered a recent breakthrough in NLP, can also be leveraged to improve the performance of deep learning for other fields They are excited by the results we’ve seen so far applying this approach to computer vision Like the GPT family, CLIP learns a wide variety of tasks during pre-training which we demonstrate via zero-shot transfer We are also encouraged by our findings on ImageNet that suggest zero-shot evaluation is a more representative measure of a model’s capability

CLIP was trained on a large, diverse dataset called the Image-Text (IT) dataset This dataset consists of 400 million image-text pairs, which were collected from the internet and

Trang 39

then filtered to ensure a diverse range of concepts and text sources The IT dataset is used to pre-train the CLIP model to recognize and understand the relationship between images and text The Image-Text (IT) dataset is a large-scale dataset that contains diverse and complex images paired with natural language descriptions It was introduced by the paper "Unicoder-VL: A Universal Encoder for Vision and Language by Lu et al." in 2020 The dataset contains 3.3 million image-text pairs with 80.1 million text tokens and 1.2 million object instances The images were collected from COCO, Visual Genome, and Flickr30k, while the text descriptions were sourced from COCO Captions, Visual Genome, and SBU Captions The text descriptions consist of captions, questions, and answers The IT dataset is considered to be one of the largest and most diverse datasets for training vision and language models, and has been used to pretrain models such as UniVL and CLIP

In this thesis, CLIP is used to encode the image into a feature vector The image is loaded using the PIL library and preprocessed using torchvision.transforms The pre-trained CLIP model is then loaded using the clip.load() function, along with the preprocess function that converts the image to a tensor and applies normalization The model_clip.encode_image() function is used to extract a feature vector from the preprocessed image tensor This feature vector is then using to do 2 tasks, the first is feeding into “object detection/ face recognition’ module to find out the name entity, the second is “image retrieval” with the KB which has name, image and paragraphs of each entity

2.5 Wikipedia Knowledge Base

The Wikipedia Knowledge Base (KB) is in the ViQuAE dataset (we will talk more about this dataset Chapter 4 – section 4.2.2) The KB is built upon Wikipedia, more precisely, the 2019/08/01 dump available in KILT, which consists of 5.9M articles Each one is mapped to a Wikidata entity Hence, we use both terms interchangeably Information retreival aims at retrieving relevant sources of knowledge from the KB with respect to the query (question and image) The knowledge base of this project is the information present on the English Wikipedia When the “Object detection/Face Recognition” outputs an named entity, the program retrieves the corresponding Wikipedia page for that entity It then extracts the summary section from the page and uses it to answer the user's question

Trang 40

using BERT The idea behind using Wikipedia as the knowledge base is that it contains a large amount of information on a wide range of topics, making it a useful resource for answering factual questions

In the KBVQA process, the system first processes the visual input using techniques like image processing and computer vision to extract relevant features and understand the content of the visual data Then, the textual question is analyzed using NLP techniques to comprehend its meaning and extract key information Next, the system utilizes the KB to retrieve relevant knowledge related to the entities mentioned in the question This involves querying the KB using the extracted information and retrieving the necessary facts or relationships Finally, the KBVQA system combines the visual understanding, textual comprehension, and retrieved knowledge to generate a well-informed answer to the user's question The KB provides a structured and organized source of information that helps in understanding the context of the question and retrieving appropriate answers based on factual knowledge.