Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 127 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
127
Dung lượng
2,69 MB
Nội dung
VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY HO CHI MINH UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING GRADUATION THESIS EFFECTIVELY APPLY VIETNAMESE FOR VISUAL QUESTION ANSWERING SYSTEM (Old title: Development of a VQA system) Major: Computer Science Council : Software Engineering Instructor: Dr Quan Thanh Tho Reviewer : Mr Le Dinh Thuan Authors : Nguyen Bao Phuc Tran Hoang Nguyen Ho Chi Minh City, July 2021 1712674 1712396 - - KHOA:KH & KT Máy tính KHMT _ rình MSSV: 1712396 NGÀNH: KHMT MT17KH03 _ MSSV: 1712674 NGÀNH: KHMT MT17KH04 _ Co- -end, back- 01/02/2021 01/08/2021 1) 2) 3) _ _ _ TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA KH & KT MÁY TÍNH CỘNG HỊA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự - Hạnh phúc -Ngày tháng năm PHIẾU CHẤM BẢO VỆ LVTN (Dành cho người hướng dẫn/phản biện) Họ tên SV: Trần Hoàng Nguyên MSSV: 1712396 Ngành (chuyên ngành): KHMT Họ tên SV: Nguyễn Bảo Phúc MSSV: 1712674 Ngành (chuyên ngành): KHMT Đề tài: Phát triển hệ thống Visual Question Answering Họ tên người hướng dẫn/phản biện: ThS Lê Đình Thuận Tổng quát thuyết minh: Số trang: Số chương: Số bảng số liệu Số hình vẽ: Số tài liệu tham khảo: Phần mềm tính tốn: Hiện vật (sản phẩm) Tổng qt vẽ: - Số vẽ: Bản A1: Bản A2: Khổ khác: - Số vẽ vẽ tay Số vẽ máy tính: Những ưu điểm LVTN: - - - Đề tài xây dựng hệ thống thông minh trả lời câu hỏi nội dung hình ảnh Sinh viên xây dựng hệ thống VQA huấn luyện mơ hình thành cơng Đề tài cịn có phát triển việc huấn luyện mơ hình để hỗ trợ tiếng Việt Đề tài đánh giá khó Khối lượng đánh giá nhiều Địi hỏi khả tự học SV việc kết hợp nhiều thành phần kiến thức Kết đề tài tổng kết thành báo khoa học hội nghị khoa học FAIR 2021 (Ghi chú: thời điểm phản biện, chưa có kết việc chấp nhận hội nghị với báo khoa học) Luận văn trình bày đầy đủ rõ ràng Sinh viên ý xếp phần demo để thể trọng tâm cơng việc đề tài Những thiếu sót LVTN: Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ câu hỏi SV phải trả lời trước Hội đồng: 10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Không bảo vệ □ Điểm : 10/10 Ký tên (ghi rõ họ tên) ThS Lê Đình Thuận TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA KH & KT MÁY TÍNH CỘNG HỊA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự - Hạnh phúc -Ngày tháng năm PHIẾU CHẤM BẢO VỆ LVTN (Dành cho người hướng dẫn/phản biện) Họ tên SV: Trần Hoàng Nguyên MSSV: 1712396 Ngành (chuyên ngành): KHMT Họ tên SV: Nguyễn Bảo Phúc MSSV: 1712674 Ngành (chuyên ngành): KHMT Đề tài: Phát triển hệ thống Visual Question Answering Họ tên người hướng dẫn/phản biện: PGS.TS Quản Thành Thơ Tổng quát thuyết minh: Số trang: Số chương: Số bảng số liệu Số hình vẽ: Số tài liệu tham khảo: Phần mềm tính tốn: Hiện vật (sản phẩm) Tổng qt vẽ: - Số vẽ: Bản A1: Bản A2: Khổ khác: - Số vẽ vẽ tay Số vẽ máy tính: Những ưu điểm LVTN: - - - Sinh viên hồn thành hệ thống VQA yêu cầu đề Sinh viên nắm vững, hiểu rõ nội dung lý thuyết phát triển xây dựng ứng dụng model thành công Sinh viên dịch tập liệu huấn luyện sang tiếng Việt để hỗ trợ cho việc trả lời câu hỏi tiếng Việt Một phần công việc luận văn viết thành báo khoa học nộp cho hội nghị FAIR Phần công việc luận văn tiếp tục mở rộng cho dự án hợp tác nghiên cứu với nhóm nghiên cứu giáo sưc khác Đài Loan Luận văn viết tiếng Anh tương đối chuẩn rõ ràng Những thiếu sót LVTN: Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ câu hỏi SV phải trả lời trước Hội đồng: 10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Không bảo vệ □ Điểm : 9.8/10 Ký tên (ghi rõ họ tên) PGS.TS Quản Thành Thơ Declaration of Authenticity We assure that the graduation thesis "Effectively apply Vietnamese for Visual Question Answering system" is the original report of our research We have finalized our graduation thesis honestly and guarantee the truth of our work for this thesis We are solely responsible for the precision and reliability of the above information Ho Chi Minh City, August, 9th , 2021 Acknowledgements First and foremost, we would like to thank Dr Quan Thanh Tho, Associate Professor in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT) for the support throughout our research work It has been our great fortune, to work and finish our thesis under his supervision He is the most knowledgeable and insightful person we have ever met He helped us throughout the project with his wise knowledge and enthusiasm in deep learning From him, we have learned how to deep learning research by a critical way and have a chance to widen our knowledge He also let us join his research group, URA This opportunity not only allow us to get more useful suggestions about our thesis from everybody in group, but we also have learned many new things, new skills day by day, such as by joining seminars held by members in group From that, we can create more interesting ideas focusing on our thesis Our sincere thanks also goes to Mr Le Dinh Thuan, Master of Engineering in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, for being our reviewer His feedback, suggestion and advice was essential and influential for the completion of our thesis We are thankful for having such a good reviewer like him Last but not least, we would like to thank the entire teachers at HCMC University of Technology, especially Faculty of Computer Science and Engineering, where it has been our pleasure and honor to studied for the last four years Also our beloved friends and family, who always support us with a constant love and encouragement AUTHORS Abstracts In recent years, deep learning has emerged as a promising technology with the hope that it can be designed to tackle practical problems, which had been considered inconceivable for previous approaches Specifically, the blind and visual impaired are usually afraid to be burdensome for their family, their friends, when they need visual guidance However, there is still a lack of modern systems, which can be a virtual friend to help them interact with the surrounding environment Therefore, we research and develop a novel deep learning application, which can capture the complex relationship between the surrounding objects and deliver assistance to the blind and visual impaired With this dissertation, we propose a novel visual question answering model in Vietnamese and a development of practical systems that utilize our model to address the aforementioned problems CONTENTS List of figures x List of tables xiii Chapter INTRODUCTION 1.1 Motivation 1.2 Topic’s scientific and practical importance 1.3 Thesis objectives and scope 1.4 Our contribution 1.5 Thesis structure Chapter THEORETICAL OVERVIEW 2.1 Deep learning neural network 2.2 2.3 2.4 2.1.1 Perceptron 2.1.2 Multi layer perceptron 2.1.3 Activation functions 10 2.1.4 Loss functions 12 2.1.5 Backpropagation and optimization Computer vision theoretical background 2.2.1 Convolutional Network 14 16 16 2.2.2 Pooling 2.2.3 CNNs variants 2.2.4 Regional-based Convolutional Neural Networks Natural language processing theoretical background 2.3.1 Word Embedding 2.3.2 Recurrent Neural Network (RNN) 2.3.3 LSTM - Long Short Term Memory 2.3.4 GRU - Gated Recurrent Network 2.3.5 Attention mechanism 2.3.6 Bidirectional Encoder Representations from Transformers Visual and Language tasks related to VQA 2.4.1 Image Captioning 19 19 22 27 27 33 34 36 37 39 43 43 viii CONTENTS 2.4.2 2.4.3 Visual Commonsense Reasoning Other Visual and Language tasks 43 44 Chapter RELATED WORK 3.1 Overall 3.2 Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering 3.3 Pythia v0.1: the Winning Entry to the VQA Challenge 2018 3.4 Deep Modular Co-Attention Networks for Visual Question Answering 3.5 ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised 45 46 Image-Text Data 51 Chapter METHODOLOGY 4.1 Feature extraction and co-attention method 4.1.1 Visual feature 4.1.2 Textual feature 4.2 Co-attention layer 4.3 Our proposal model 47 49 49 53 54 54 54 55 58 61 62 63 63 64 65 Statistical Analysis Data sample 65 66 Chapter EXPERIMENTAL ANALYSIS 6.1 Experimental setup 6.1.1 Computing Resources 6.1.2 Dataset 68 69 69 69 Chapter VIETNAMESE VQA DATASET 5.1 VQA-v2 dataset 5.2 Visual Genome dataset 5.3 Challenge 5.4 Automatic data generation 5.5 Data refinement 5.6 5.7 6.2 6.1.3 Evaluation metric 6.1.4 Implementation details 6.1.5 Training strategy Experimental results 69 70 70 71 ix CONTENTS Chapter APPLICATION 7.1 Technology terminology 7.1.1 Flask 7.1.2 7.1.3 7.1.4 7.1.5 77 77 78 78 80 80 80 81 82 83 85 87 System components Our result 87 90 Chapter CONCLUSION 8.1 Summary 8.2 Limitation and broader future works 94 95 95 7.2 7.3 7.4 ReactJS React Native Docker C4 Model: Describing Software Architecture 75 76 76 System functionality 7.2.1 Web application system 7.2.2 Mobile application system System diagram 7.3.1 Overview 7.3.2 Use case diagram 7.3.3 Activity diagram System architecture 7.4.1 7.4.2 8.2.1 Improve existing Vietnamese VQA models 95 8.2.2 Give Vietnamese VQA a new direction 96 References 97 Appendices 101 Chapter A FAIR 2021 CONFERENCE PAPER 102 Chapter B SATU PROJECT 111 CHAPTER REFERENCES 99 [21] Z Yu, J Yu, J Fan, and D Tao, Multi-modal factorized bilinear pooling with coattention learning for visual question answering, 2017 arXiv: 1708.01471 [cs.CV] [22] J Lu, J Yang, D Batra, and D Parikh, Hierarchical question-image co-attention for visual question answering, 2017 arXiv: 1606.00061 [cs.CV] [23] H Nam, J.-W Ha, and J Kim, Dual attention networks for multimodal reasoning and matching, 2017 arXiv: 1611.00471 [cs.CV] [24] P Anderson, X He, C Buehler, D Teney, M Johnson, S Gould, and L Zhang, Bottom-up and top-down attention for image captioning and visual question answering, 2018 arXiv: 1707.07998 [cs.CV] [25] J Redmon, S Divvala, R Girshick, and A Farhadi, “You only look once: Unified, real-time object detection”, 2016 eprint: https://arxiv.org/abs/1506.02640 [26] W Liu, D Anguelov, D Erhan, C Szegedy, S Reed, and A C B Cheng-Yang Fu, “Ssd: Single shot multibox detector”, eprint: https : / / arxiv org / abs / 1512 02325 [27] Y Jiang, V Natarajan, X Chen, M Rohrbach, D Batra, and D Parikh, “Pythia v0 1: The winning entry to the vqa challenge 2018”, 2018 eprint: https://arxiv.org/ abs/1807.09956 [28] Z Yu, J Yu, Y Cui, D Tao, and Q Tian, “Deep modular co-attention networks for visual question answering”, 2019 eprint: https://arxiv.org/abs/1906.10770 [29] K Li, Z Wu, K.-C Peng, J Ernst, and Y Fu, “Tell me where to look: Guided attention inference network”, 2018 eprint: https://arxiv.org/abs/1802.10171 [30] J Devlin, M.-W Chang, K Lee, and K Toutanova, “Pre-training of deep bidirectional transformers for language understanding”, 2018 eprint: https://arxiv.org/abs/ 1810.04805 [31] D Qi, L Su, J Song, E Cui, T Bharti, and A Sacheti, “Cross-modal pre-training with large-scale weak-supervised image-text data”, 2020 eprint: https://arxiv org/abs/2001.07966 REFERENCES 100 [32] Y Liu, M Ott, N Goyal, J Du, M Joshi, D Chen, O Levy, M Lewis, L Zettlemoyer, and V Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019 arXiv: 1907.11692 [cs.CL] [33] T Vu, D Q Nguyen, D Q Nguyen, M Dras, and M Johnson, “Vncorenlp: A vietnamese natural language processing toolkit”, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 2018 DOI: 10.18653/v1/n18-5012 [Online] Available: http: //dx.doi.org/10.18653/v1/N18-5012 [34] D Q Nguyen and A T Nguyen, Phobert: Pre-trained language models for vietnamese, 2020 arXiv: 2003.00744 [cs.CL] [35] Y Goyal, T Khot, D Summers-Stay, D Batra, and D Parikh, Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017 arXiv: 1612.00837 [cs.CV] [36] R Krishna, Y Zhu, O Groth, J Johnson, K Hata, J Kravitz, S Chen, Y Kalantidis, L.-J Li, D A Shamma, M S Bernstein, and F.-F Li, Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016 arXiv: 1602.07332 [cs.CV] Appendices A FAIR 2021 CONFERENCE PAPER As a part of our work, we have submitted the paper that explain our contribution to Fundamental and Applied IT Research (FAIR) 2021 Conference 21- 22/10/2021 EFFECTIVELY APPLYING VIETNAMESE FOR VISUAL QUESTION ANSWERING Nguyen Bao Phuc1, Tran Hoang Nguyen1, Quan Thanh Tho1 Ho Chi Minh City University of Technology phuc.nguyenbao@hcmut.edu.vn, nguyen.tran1312@hcmut.edu.vn, qttho@hcmut.edu.vn ABSTRACT - In recent years, the task of Visual Question Answering (VQA) has evolved into a very attractive research field Normally, this task requires a simultaneous understanding of both the visual content of the image and the textual content of the question To produce an accurate answer, particularly in Vietnamese, we need to address the following four sub-problems effectively: (1) to efficiently extract visual representations of the image; (2) to develop fine-grained language processor for Vietnamese; (3) to implement proper multimodal learning framework to focus on the complex interactions between image and text modalities; and (4) to provide prediction to consider the complex correlations between the image and the question to automatically obtain the answer Not only handling all those tasks in this research, we also release a novel dataset for Vietnamese VQA of 590,083 question-answer pairs with 123,387 MS-COCO images With an ensemble of our models, we achieve 68.76% of overall accuracy on our test set Keywords - Visual Question Answering, multimodal learning, attention-based learning, co-attention learning, deep learning I INTRODUCTION Visual Question Answering (VQA) [1] is one of multimodal machine learning tasks aiming to answer questions based on an image, that consists of two inputs, an image and a natural language question about this image, and requires an answer VQA affords several challenges to AI systems spanning the fields of natural language processing and computer vision Over the past few years, the advent of deep learning, availability of large datasets to train VQA models have contributed to a surge of interest in VQA amongst researchers The VQA-v2 dataset, a dataset in Visual Genome datasets, is a very large dataset consisting of two types of images: natural images (referred to as real images) as well as synthetic images (referred to as abstract scenes) and comes in two answering modalities: multiple choice question answering (the task of selecting the right answer amongst a set of choices) as well as open-ended question answering (the task of generating an answer with an open-ended vocabulary) Figure General flowchart of the VQA task Giving an arbitrary image and a Vietnamese question, the model will generate the answer in Vietnamese Since VQA tasks appeared, there have been many proposals for the VQA models that are more and more complex and capable of answering previously unseen questions and obtain higher and higher accuracy Nowadays, a number of recent works have proposed attention models for VQA Co-attention mechanism is applied for VQA model more and more popularly, and it gets better results in VQA challenges In the rest of this paper, the first relevant related work is reviewed and the problem of traditional approaches is summarized in Section II Then, in section III, the process of transforming the VQA-v2 dataset into the Vietnamese VQA dataset is introduced In Section IV, the description of the feature extraction method, co-attention method, and our architecture of the Vietnamese VQA model is proposed We present our experiments in Section V This paper is concluded in Section VI II RELATED WORK In this section, we will briefly review related work on the VQA problem The multimodal fusion of global features is the basic type of VQA algorithms [2] The image and question are first represented by two vectors: a vector represents the visual content of the whole image and a vector describes the semantic content of the whole question, then fused by a multimodal fusion model to predict the answer More complex methods that apply residual networks to the multimodal EFFECTIVELY APPLYING VIETNAMESE FOR VISUAL QUESTION ANSWERING fusion model are introduced to learn better image and question representations [3] The biggest drawback to this approach is that representing an image as a global feature may lose critical information, which can make a significant error answering the questions about local image regions Recent approaches have introduced the attention mechanism into VQA [4-6] When looking at an image, the focus is necessarily on a certain part of the image In other words, for a given question that is involved in an image, there is a way that select useful information from large amounts of image data, which leads to learning effectively the attended image features, then use the fusion of visual features from image and text features from question to get an accurate answer Many methods based on the attention mechanism achieved great success in the VQA tasks Co-attention learning framework has proven to be among the most efficient approaches for VQA With coattention mechanism, it is necessary to learn the visual attention for the image and the textual attention for the question at the same time, then use fusion methods to merge all components A series of proposals applying co-attention framework to build models for the VQA problem are introduced [7][8], but those mainly learn separate attention distributions for each modality (image and question), ignore the interaction between the visual context of image and the semantic context of the question That bottleneck makes it difficult to obtain a deep understanding of the relationships of those multimodal features, causes a decrease in VQA performance To solve the problem, dense co-attention models have been proposed [8], which means associate the keywords in question with the critical regions in the image, improve the power of VQA models to correctly answer the question, and obtain better results in VQA challenges III VIETNAMESE VQA DATASET Since there is no available dataset for our task, we use the standard machine translation method to create a raw version of the Vietnamese dataset from the VQA-v2 dataset Due to the misunderstanding when converting the English dataset to the Vietnamese dataset, we have carefully refined the raw version to obtain a high-quality Vietnamese dataset for VQA tasks 3.1 Automatic data generation To generate our dataset, we use the VQA-v2 benchmark dataset which contains images from MS-COCO dataset and human-annotated question-answer pairs, each image has related questions and 10 answers for each question The dataset is originally in English Therefore, we need to translate it into Vietnamese From the train dataset and validation dataset split from the VQA-v2 dataset, we obtain 123,287 images, 658,111 question-answer pairs But there is still something that we need to carefully improve later: There are 136,201 different possible answers in total (a huge number of answers lead to the consequence that our model can easily mislead or generate wrong answers for the VQA model) Most of the answers in the raw dataset are meaningless or misspelled words 3.2 Manual data refinement The automatically generated dataset was significantly noisy due to the complex grammar and vocabulary relations when converting such a very large-scale English dataset into Vietnamese To refine our dataset to be more favorable and able to achieve better results in the Vietnamese VQA task, we use a simple language model to correct misspelled words in questions and answers Then, we manually refine the dataset by removing most of the meaningless and useless answers This careful procedure helps us reduce the number of total answers from 136,201 to 70,343 The final dataset we achieved is split into three parts: train set (80k images and 399,059 question-answer pairs), validation set (20k images and 95,887 question-answer pairs), and test set (20k images and 95,887 question-answer pairs) The results consist of three types of accuracy (Yes/No, Number, Other) and overall accuracy Moreover, with this fine-grained dataset, our model can generalize the relationship between Vietnamese questions and answer to deliver some impressive results in VQA task Figure Distribution of the most frequent answers in Vietnamese dataset Nguy n B o Phúc, Tr n Hoàng Nguyên, Qu IV FEATURE EXTRACTION AND CO-ATTENTION METHOD In this section, we first introduce the method we applied to develop two separate pipelines which can deal with Vietnamese text and image modalities Then we revisit the co-attention learning framework on the VQA tasks and give a detailed explanation on the implementation of our approach, which is motivated by [9] 4.1 Visual feature Several recent works have shown that bottom-up attention can achieve excellent performance in obtaining visual region features for VQA tasks By adopting a bottom-up attention network as our visual processor, whose feature extractor is Faster R-CNN object detector built upon Resnet-101 backbone and pre-trained on Visual Genome dataset For each image, we extract a dynamic number n [10,100] of region proposals base on a confidence threshold After obtained n detected region, we extract its 2048-dim features by apply mean-pooling to each detected region Finally, the input image is represented as a feature matrix X R (n x 2048) 4.2 Textual feature Unlike other languages, Vietnamese has a complex grammar system, a wide range of synonyms and compound words To overcome these issues, we develop a Vietnamese language processor for the VQA task, which not only can efficiently extract textual features but able to deal with the aforementioned problems Thanh et al., 2018 [10] presented an NLP annotation pipeline for Vietnamese, providing rich linguistic annotations through key NLP components of word segmentation, POS tagging, named entity recognition (NER), and dependency parsing To address compound word problems, we adopt this toolkit to segment and tokenize input question into words Due to being carefully segmented before tokenized, the language processor not only can reduce the number of the tokens without losing any semantic understanding but it also improves the quality of the textual features Then, we trim these tokens into a maximum of only 10 tokens for each question Traditional works aim to extract textual features by sourcing embedding vectors for each word in the input question and then feed them through a recurrent neural network The question feature is the hidden state of the last recurrent unit In contrast, the input question in our model is considered as a whole block instead of separate word tokens to guarantee an adequate semantic understanding for the question modality which increases overall accuracy We use PhoBERT [11], which is a recently state-of-the-art technique in the Vietnamese NLP community, as our language model to extract features for the input question PhoBERT has shown its strength in various natural language processing tasks in Vietnam by significantly boosting accuracy A lot of works [12][13] have shown that GRU is better than LSTM not only in terms of accuracy and precision but also in terms of computational runtime efficiency For that reason, after being fed through a feed-forward layer to reduce features dimension from 768 to 300, the word embeddings are then passed through a GRU unit to produce textual features Figure Our proposed question processor for Vietnamese VQA task 4.3 Co-attention layer Co-attention layer consists of two attention units, the self-attention unit and the guided-attention unit Being inspired by Transformer [13], a well-known model for tasks such as machine translation and text summarization, the operating principle of both attention units is similar to multi-head attention This mechanism is based on scaled dotproduct attention Scale dot-product attention takes input including three components: query vector q, key vector k and value vector v Three vectors have the same dimension d To process many inputs concurrently, the queries, keys and values are packed into matrices Q, K, V respectively For each query vector q, the output attended feature vector is obtained by the following formula: F = Attention (q, K, V) = Softmax((q*KT) / d0.5) * V (1) EFFECTIVELY APPLYING VIETNAMESE FOR VISUAL QUESTION ANSWERING Q, K, V stands for query, key, value matrix respectively d stands for dimension of query vector q It should be noted that this attention function is performed on all queries (matrix Q) at the same time Instead of getting only an output vector, we obtain an attended feature matrix To boost the representation capacity of the attended features, multi-head attention is proposed It is composed of a fixed number of paralleled independent scaled dothead is executed with given queries Q, keys K, values V simultaneously, generate output attended features, followed by (1) Then, all of them are concatenated to obtain the final output, as shown in the following formulas: F = Multi-head Attention (q, K, V) = Concatenation ( head headX = Attention (Q*WXQ, K*WXK, V*WXV) N ) * WO (2) (3) th headX stands for X head a scaled dot-product attention WXQ, WXK, WXV stand for projection matrices of the Xth head WO stands for learned weight matrix Figure The architecture of multi-head attention module Self-attention unit consists of a multi-head attention layer and a pointwise feed-forward layer Its input is visual features from the image representations Given input visual feature matrix is decomposed into three components: query matrix, key matrix and value matrix Then, they are forwarded to a multi-head attention layer to learn the pairwise relationship between each region pair in the image, obtain output attended visual feature matrix after being transformed by the feed-forward layer Guided-attention unit takes input which is composed of both visual features and textual features The visual features are guided by textual features In other words, the semantic content of the question can help VQA model capture suitable image regions, which makes it easier to understand the image better Given input visual feature matrix is transformed into query matrix, given input textual feature matrix is decomposed into two components: key matrix and value matrix Then, they are forwarded to the multi-head attention layer to learn the pairwise relationship between each pair that come from image and question, obtain output attended mixed feature matrix after being transformed by the feedforward layer Nguy n B o Phúc, Tr n Hoàng Nguyên, Qu Figure Architecture of self-attention unit (left) and guided-attention unit (right) 4.4 Our proposal model In this section, we demonstrate our proposal Vietnamese VQA model in detail The detailed structure of our model is shown in Figure Firstly, we transform image and question into visual feature matrix and contextual feature matrix respectively, which is described in detail in section 3.1 Secondly, we utilize the power of co-attention layers After that, all components are forwarded to two fully connected layers, and we obtain the final-stage visual feature matrix and finalstage textual feature matrix They are fused by a fusion layer, and we obtain a joint presentation of both image and question Finally, by being forwarded to a non-linear layer, the score of each candidate answer is predicted by linear mapping, which finalizes answer prediction Figure Architecture of our proposed model Binary cross entropy (BCE) is used as the loss function to train our Vietnamese VQA model as the classifier of top N answers V EXPERIMENTAL RESULTS In this section, we perform experiments to evaluate the accuracy of different variants of our model We first revisit the Vietnamese VQA dataset and some implementation details in Section 4.1 In Section 4.2, we conduct several ablation studies to show the effectiveness of our proposed model Finally, we discuss our model after obtained optimal parameters 5.1 Dataset and implementation details EFFECTIVELY APPLYING VIETNAMESE FOR VISUAL QUESTION ANSWERING We use our Vietnamese VQA to conduct some ablation studies To extract semantic features from the input question, we use a pre-trained phobert-base model And only the hidden state corresponding to the last attention block features is extracted as semantic word embeddings, thereby obtaining 768-dim for each token of the input question The dimensionality of the input images, the output of the GRU unit (also the input question features) are 2048 and 512 respectively Following some previous works, we set the dimension of multi-head attention as 512 for the small model, and 1024 the for large model = 0.98 All models are trained up to 15 epochs with the same batch size of 128 5.2 Ablation studies We conduct a number of ablations on our Vietnamese VQA dataset to investigate how efficient our model is The results shown in Figure and Table are discussed in detail below: Figure Accuracies vs co-attention depths on Vietnamese VQA dataset In Figure 7, we demonstrate that the overall accuracy is considerably affected by the co-attention depth With increasing co-attention depth, they show the fact that the overall accuracies steadily improve in both variants This explains why incorporating a self-attention unit and a guided-attention unit can help our model deal with the problem of multimodal learning Table Ablation experiments for our model on the Vietnamese VQA dataset From table 1, we can see that: First, our language processor with PhoBERT as a language model to extract word embeddings and a segmentation layer can significantly improve the accuracy More specifically, our proposed model with a language processor outperforms the baseline about more than 2% accuracy on the whole test set, in terms of both candidates use the setting of small model with hidden layers This is because our language processor captures rich semantics over the input question features through its word - segmenter and the PhoBERT-base model Second, by increasing hidden layer and hidden dimension, the performances of our models steadily improve and achieve 67.90% overall accuracy for our large model This verifies the fact that the co-attention mechanism plays a key role in multimodal learning tasks, particularly in Vietnamese VQA 5.3 Experimental results Our best single model achieves the result of 67.90% accuracy on our test set We also ensemble several models and conduct major votes to obtain a more accurate answer With an ensemble of models, the result has a great improvement and achieves 68.76% accuracy on the test set Nguy n B o Phúc, Tr n Hoàng Nguyên, Qu VI CONCLUSION In this paper, we framed the problem into a VQA task with Vietnamese using co-attention learning framework and state-of-the-art technique in the Vietnamese NLP community Additionally, since there was no suitable dataset which is directly supported for our task, we presented a novel Vietnamese VQA dataset Our dataset can help the Vietnamese VQA task achieve impressive performance By quantitatively and qualitatively evaluating our model on our proposed Vietnamese VQA benchmark dataset, our results shows that image and the language-specific processor can significantly improve the final accuracy of the Vietnamese VQA task VII ACKNOWLEDGEMENTS We would like to thank Dr Quan Thanh Tho, Associate Professor in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT) for his support for our research work VIII REFERENCES [1] Aishwarya Agrawal, Jiansen Lu, Stanislaw Antol, Margeret Mitchell, C.Lawrence Zitnick, Dhruv Batra, Devi nt arXiv:1505.00468, 2015 [2] [3] for visual Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak (NIPS), pp 361 369, 2016 [4] Stacked attention networks for image (CVPR), pp 21 29, 2016 [5] compact bilinear pooling for visual question answering and visual arXiv:1606.01847, 2016 [6] -attention , IEEE International Conference on Computer Vision (ICCV), pp 1839 1848, 2017 [7] [8] [9] -image co-attention for visual ems (NIPS), pp 289 297, 2016 Hyeonseob Nam, Jung-attention networks for Visual Question [10] nce of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp 56-60, 2018 [11] of the Association for Computational Linguistics: EMNLP, pp1037-1042, 2020 in Findings [12] S Yang, X Yu and Y Zhou, "LSTM and GRU Neural Network Performance Comparison Study: TakingYelp Review Dataset as an Example," 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI), pp 98-101, 2020 [13] S Islam, D Valles and M R J Forstner, "Performance Analysis and Evaluation of LSTM and GRU Architectures for Houston toad and Crawfish frog Call Detection," 2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), pp 0106-0111, 2020 EFFECTIVELY APPLYING VIETNAMESE FOR VISUAL QUESTION ANSWERING TRI N KHAI MƠ HÌNH TR L I CÂU H I D A TRÊN HÌNH NH B NG TI NG VI T - MSm - B SATU PROJECT It’s our honor to work with Dr Quan Thanh Tho, Associate Professor in the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT) and Dr Shahab Shamshirband, Associate Professor at the National Yunlin University of Science and Technology (Taiwan) on a project concerning about an AI application for environmental protection In a more detailed way, we are discussing about a optimized solution utilising VQA for predicting and warning natural disasters as early as possible The following figure shows an application of VQA for predicting flood - a natural disaster that we have implemented by being based on a public flood dataset for VQA task CHAPTER B SATU PROJECT 112 Figure B.1: Our application of VQA for predicting the potential of natural disaster For example, giving a question "what is the overall condition of the given image ?", VQA can generate answer based on the visual content of the image The answer here is "Non-flooded" From that, we think we can apply VQA to develop a surveillance system to warning about natural disasters We now are mainly focusing on wildfires Until now, we have finalized a complete pipeline for building a wildfires surveillance system Based on our best knowledge, there have not been an available wildfires dataset for VQA task yet This is the reason why we decide to firstly prepare dataset for wildfires by collecting wildfires images taken by unmanned aerial vehicle (UAV, commonly known as a drone) All of them are manually annotated Inspired from VQA-v2 dataset, for each image, we provide three or four questions and three or four ground truth answers respectively Each question is classified into a type of question We have four types: conditional recognition, simple counting, complex counting and yes/no Dataset preparation is one of the most challenging tasks It requires much effort and time At the present, we are trying our best to complete wildfires benchmark dataset as we described above As our plan, a classifier is needed to determine what wildfires images are, which prevents us from spending much time on unsuitable images Then, we gather information from those wildfires images by forwarding them into a well-trained VQA model to make a decision In fact, each decision made from visual content of wildfires image can be a trigger, which determines what we need to to cope with natural disaster From that, we can minimize the damage the wildfires causes The following figure describes clearly how we develop our propose wildfires surveil- CHAPTER SATU PROJECT lance system that makes use of VQA Figure B.2: Our proposed wildfires surveillance system pipeline 113 ... Declaration of Authenticity We assure that the graduation thesis "Effectively apply Vietnamese for Visual Question Answering system" is the original report of our research We have finalized our... Usecase diagram of Vietnamese VQA Web System Usecase diagram of Vietnamese VQA Mobile System Activity diagram of Vietnamese VQA Web System Activity diagram of Vietnamese VQA Mobile System Component... and visual impaired With this dissertation, we propose a novel visual question answering model in Vietnamese and a development of practical systems that utilize our model to address the aforementioned