Khóa luận tốt nghiệp Khoa học máy tính: So khớp ngữ nghĩa đối tượng cho bài toán chú thích hình ảnh trên Tiếng Việt

List of Figures2.1 2.2 2.3 2.4 2.5 2.6 The system 1 takes an input image, 2 extracts around 2000 bottom-up region proposals, 3 computes features for each pro-posal using a large convolut

Applications 000 3

The standard of living of humans has substantially improved as science and technology have progressed Our lives have been elevated by the conveniences to the point that we can enjoy our surroundings In reality, robots are replacing humans

Introduction in jobs that used to be done by humans For instance, you might ask the assistant to phone someone for you while you are driving A virtual assistant, such as Siri, is an example of AI that will access your contacts, recognize the word, and dial the phone number These assistants utilize natural language processing (NLP), machine learning (ML), statistical analysis, and algorithmic execution to figure out what you want and try to obtain it for you For another example, interacting with customer service as a consumer may be time-consuming and frustrating It’s an inefficient, costly, and difficult-to-manage department for businesses AI chatbots are one artificially intelligent option that is becoming increasingly popular Ma- chines can answer commonly asked inquiries, accept and track orders, and route calls thanks to pre-programmed algorithms Also with the help of technology, we stay closed to each other via social network and the internet despite physical distance Technology is being used to every area of human existence with the goal of making it more convenient and efficient.

The development of technology leads to the growth of computer science, in which computer vision (CV) and natural language processing (NLP) play an important role Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of CV and NLP Automatic image captioning is widely used by search engines to retrieve and show relevant search results to the user over the annotation keywords, to categorize personal multimedia collections, for automatic product tagging in online catalogs, in computer vision development, and other areas of business and research The application of image caption is extensive and significant, let us take some exemplary applications to represent the strong potential of this field.

E-Commerce PIM (=Product Information Management) systems with AI can analyze photos and provide rich and detailed features for web catalogs automatically Image captioning software can evaluate product photos and automatically recommend appropriate qualities and categories, which may save time and money. The system may, for example, recognize the type of fashion item, its material, color, design, and garment fit Customers can navigate through categories more easily with Al-powered visual suggestions For efficient user involvement, brands like Asos, eBay, and Forever21 already utilize AI-based visual search and picture recognition.

Aid for the Visually Impaired Image descriptions automatically generated by a computer aren’t as good as those written by a human who can include additional context, but they can be accurate and helpful An image description might help a blind person read a restaurant menu, or better understand what their friends are posting on social media.

An app called Seeing AI! developed by Microsoft allows people with eye problems to see the world around them using smartphones When the camera is directed at the text, the application can read it and provide auditory suggestions.

It can distinguish between printed and handwritten text, as well as objects and people.

Google also launched a program that can generate a written description for a picture, helping blind persons or others with vision issues to comprehend the image’s context There are multiple levels to this machine learning technique In the illustration, the first model identifies text and handwritten numerals Another approach detects simple items in the environment, such as vehicles, trees, and

'https://www.microsoft.com/en-us/ai/seeing-ai

Introduction animals A third layer is a sophisticated model capable of extracting the essential concept from a lengthy written exposition.

Security Security data annotation can be used in various security-related applications: ¢ Image recognition for detection of weapons and/or dangerous objects. ¢ Image annotation for face recognition. ¢ Object classification on security monitors. ¢ Object/person detection and tagging, and tracking through multiple frames, and much more.

CCTV cameras are now ubiquitous, but if we can create appropriate captions in addition to watching the world, we will be able to trigger warnings as soon as any dangerous conduct is discovered AI-based algorithms assist in the assignment of labels to any form of security image data, allowing your systems to learn how to respond to any potentially harmful event This is likely to lower crime rates and the amount of accidents.

Automatic image captioning remains challenging despite the recent impressive progress in neural image captioning Some of the big challenges can be mentioned as:

The lack of naturalness The first challenge stems from the compositional nature of natural language and visual scenes While the training dataset includes co-occurrences of particular items in their contexts, a captioning system should be

Introduction able to generalize by combining objects in other contexts Traditional captioning systems suffer from lack of naturalness as they often generate captions in a sequential manner, 1.e., next generated word depends on both the previous word and the image feature This can frequently lead to syntactically correct, but semantically irrelevant language structures, as well as to a lack of diversity in the generated captions.

Generalization The second challenge is the dataset bias impacting current captioning systems The trained models overfit to the common objects that co- occur in a common context (e.g., bed and bedroom), which leads to a problem where such systems struggle to generalize to scenes where the same objects appear in unseen contexts (e.g., bed and forest).

Challenges in Vietnamese language The lack of data is always one of the leading problems when researching the field of image captions for Vietnamese. Although there are ways to solve this problem such as using machine translation models or manually collecting data from life, there are still certain limitations. While manual data collecting is time and labor demanding, using a machine translation model causes the translated sentences to be unstable Some sentences, for example, contain vocabulary that are infused with Western culture and will not be translated accurately into Vietnamese Even if we use a good machine translation model, we won’t be able to change the style of the translated sentence if we translate a sentence from English to Vietnamese This leads to the subtitles being created not in the style of the Vietnamese people Another problem is that the activities and landscapes depicted in the photos are completely different from those in Vietnam Therefore, when we try to evaluate the model with a "pure Vietnamese" image, it will not really bring a good result and in fact it is also

Introduction difficult to apply to systems related to this field in Vietnam.

Objectives 2 ee 8 1.3 Contributions 2 2.2 ee 8

The COVID-19 pandemic has exacerbated the ongoing shortage of health workers globally, posing an urgent need for smart assistants that can effectively cooperate with humans to fill the gap Towards this end goal, this project aims to study the modern approach that has achieved high results in English and apply it to Vietnamese for describing visual content in healthcare settings In specific, we have three main objectives in this thesis:

1 Research about Oscar: We learn a new method Oscar, which uses object tags detected in images as anchor points to significantly ease the learning of alignments and how fine-tune Oscar on downstream tasks

2 Adapt on Vietnamese language We implement a system that can automatically generating text for images in the healthcare domain in Vietnamese based on the Oscar model.

3 Experimental setup We learn about the performance of the Oscar model compared to previous approaches In addition, we will test the Oscar model with and without object tags to learn about their impact on our dataset

The main contributions of this work can be summarized as follows: We introduce Oscar, a powerful VLP method to learn generic image-text representations for V+L understanding and generation tasks We have developed Oscar model that

Introduction can generate captions in Vietnamese language Our model is outperform existing approach based on Encoder-Decoder framework which use Convolutional neural network as an encoder to encode images and Recurrent Neural Network - Long short term memory to decode image features into text We present experiments and analysis to provide insights on the effectiveness of using object tags as anchor points for cross-modal representation learning on image captioning task inVietnamese

Visual Feature Encoding

Background 2 0000004 10 2.1.2 FasterR-CNN Qua 14

Overview of the Object Detection Pipeline

Traditional object detection techniques follow the 3 major steps: Region proposal Generator, Feature Extraction, Classification The initial phase is creating a number of different region proposals These region proposals are candidates for containing things The number of these regions is frequently in the thousands, with 2,000 or more being the most common Examples of some algorithms that generate region proposals are Selective Search [16] and EdgeBoxes [17] A fixed- length feature vector is extracted from each region proposal using various image descriptors such as the histogram of oriented gradients (HOG), for example This feature vector is crucial to the object detectors’ performance Even if the item

Related Works changes due to a transformation like scale or translation, the vector should appro- priately reflect it Each region proposal is then assigned to either the background class or one of the object classes using the feature vector As the number of classes grows, so does the difficulty of creating a model that can distinguish between all of these objects The support vector machine (SVM) [18] is one of the most often used models for classifying region proposal.

In 2014, a group of researchers at UC Berkely developed a deep convolutional network called R-CNN (short for region-based convolutional neural network)

[19] that can detect 80 different types of objects in images In comparison to the general object detection approaches pipeline presented above, R-CNN’s contribution is simply extracting features using a convolutional neural network (CNN) Everything else is the same as the general object detection pipeline The R-CNN model’s operation is depicted in the following figure The R-CNN consists

1 Input 2 Extract region 3 Compute 4 Classify Image proposals (~2k) CNN features regions

Figure 2.1: The system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs [19] of 3 main modules: ¢ The first module generates 2,000 region proposals using the Selective Search

Related Works algorithm [16]. ¢ The second module extracts a feature vector of length 4,096 from each region proposal after it has been resized to a set pre-defined size. s The third module uses a pre-trained SVM algorithm to classify the region proposal to either the background or one of the object classes.

The R-CNN model has a few flaws: ¢ It’s a multi-stage model with each stage functioning independently As a result, it cannot be trained from end to end. ¢ It saves the extracted features from the pre-trained CNN to disk so that the SVMs can be trained later Hundreds of gigabytes of storage are required. ¢ For generating region proposals, R-CNN uses the Selective Search method, which takes a long time Furthermore, this approach is not adaptable to the detection problem. ¢ The CNN is fed each area proposal separately for feature extraction R-CNN cannot be run in real-time as a result of this.

As an extension of the R-CNN model, the Fast R-CNN model is proposed [20] to overcome some limitations A quick overview of Fast R-CNN is given in the next section.

Fast R-CNN [20] overcomes several issues in R-CNN As its name suggests, one advantage of the Fast R-CNN over R-CNN is its speed Here is a summary of the main contributions in Fast R-CNN:

Related Works ® Proposed a new layer called ROI Pooling [21] that extracts equal-length feature vectors from all proposals (i.e ROIs) in the same image. ¢ Faster R-CNN develops a network with only one step, unlike R-CNN, which includes three stages (region proposal generation, feature extraction, and classification using SVM). ® R-CNN is faster because it distributes computations (i.e convolutional layer calculations) over all proposals (i.e ROIs) rather than doing them separately for each proposal This is accomplished by employing the new ROI Pooling layer, which allows Fast R-CNN to outperform R-CNN. ¢ Fast R-CNN does not cache the extracted features, hence it requires less disk space than R-CNN, which requires hundreds of gigabytes. ® Fast R-CNN is more accurate than R-CNN.

Outputs: bb Ox softmax regressor pooling

Rol feature vector For each Rol

Figure 2.2: Fast R-CNN architecture An input image and multiple regions of interest (RoIs) are input into a fully convolutional network Each Rol is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs) The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets The architecture is trained end-to- end with a multi-task loss [20]

The general architecture of Fast R-CNN is shown in Fig 2.2 The model consists of a single-stage, compared to the 3 stages in R-CNN It just accepts an image as an input and returns the class probabilities and bounding boxes of the detected objects The feature map from the last convolutional layer is fed to an ROI Pooling layer [21] The reason is to extract a fixed-length feature vector from each region proposal The extracted feature vector using the ROI Pooling is then passed to some FC layers The output of the last FC layer is split into 2 branches:

1 Softmax layer to predict the class scores

2 FC layer to predict the bounding boxes of the detected objects

Each region suggestion is input to the model separately from the other region proposals in R-CNN This indicates that if computing a single region takes $ seconds, N regions will take S x N seconds Because it shares calculations over numerous proposals, the Fast R-CNN is quicker than the R-CNN Despite the advantages of the Fast R-CNN model, there is a critical drawback as it depends on the time-consuming Selective Search algorithm [16] to generate region proposals. The Selective Search method cannot be customized on a specific object detection task Thus, it may not be accurate enough to detect all target objects in the dataset.

Faster R-CNN [22] is an extension of Fast R-CNN [20] As its name suggests, Faster R-CNN is faster than Fast R-CNN thanks to the region proposal network (RPN) The main contributions in [22] are: ¢ Proposing region proposal network (RPN) which is a fully convolutional network that generates proposals with various scales and aspect ratios The RPN

Related Works uses neural network terminology with attention to tell the object detection (Fast R-CNN) on where to look. ¢ This work proposed the notion of anchor boxes instead of pyramids of images (i.e multiple instances of the image at different scales) or pyramids of filters (i.e multiple filters with different sizes) An anchor box is a scale and aspect ratio-specific reference box There are numerous scales and aspect ratios for a same region when there are several reference anchor boxes This may be compared to a pyramid of anchor boxes for reference After then, each region is mapped to each reference anchor box, allowing for the detection of objects of various sizes and aspect ratios. s The convolutional computations are shared across the RPN and the Fast R-CNN This reduces the computational time.

The architecture of Faster R-CNN consists of 2 modules: ¢ RPN: For generating region proposals ¢ Fast R-CNN: For detecting objects in the proposed regions.

The RPN module is responsible for generating region proposals It applies the concept of attention in neural networks, so it guides the Fast R-CNN detection module to where to look for objects in the image The Faster R-CNN works as follows: ¢ The RPN generates region proposals. ¢ For all region proposals in the image, a fixed-length feature vector is extracted from each region using the ROI Pooling layer [20]

Related Works s The extracted feature vectors are then classified using the Fast R-CNN. ¢ The class scores of the detected objects in addition to their bounding-boxes are returned.

Region Proposal Network (RPN) The R-CNN [19] and Fast R-CNN [20] models depend on the Selective Search algorithm [16] for generating region proposals. Each proposal is fed to a pre-trained CNN for classification Faster R-CNN [22] proposed a network called region proposal network (RPN) that can produce the region proposals This has some advantages: ¢ The region proposals are now generated using a network that could be trained and customized according to the detection task. ¢ Because the proposals are generated by a network, they may be trained end- to-end to be customized to the detection task As a result, it outperforms general approaches like Selective Search and EdgeBoxes in terms of region proposals. ¢ The RPN uses the same convolutional layers as the Fast R-CNN detection network to process the image As a result, when compared to algorithms like Selective Search, the RPN takes less time to generate proposals. ¢ Due to sharing the same convolutional layers, the RPN and the Fast R-CNN can be merged/unified into a single network Thus, training is done only once.

Linguistic Feature Decoding

Tiêu đề	Object Semantic Matching for Vietnamese Image Captioning
Tác giả	Hua Van Son, Nguyen Thinh Quyen
Người hướng dẫn	Nguyen Vinh Tiep, Ph.D.
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	84
Dung lượng	42,01 MB