Using retrieval augmentation and deep generative models to build question answering systems

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING —————————————– GRADUATION THESIS Using Retrieval Augmentation and Deep Generative Models to Build Question Answering Systems THESIS COMMITTEE: COMPUTER SCIENCE SUPERVISOR: ASSOC PROF PHAM TRAN VU REVIEWER: DR LE THANH VAN ——— STUDENT: NGUYEN KHAC HAO - 1852346 HO CHI MINH CITY, JANUARY 2023 Declaration Of Authenticity I declare that the research for this thesis is my own work, conducted under the supervision and guidance of Assoc Prof Pham Tran Vu The result of this research is legitimate and has not been published in any form prior to this All materials used within this research are collected by myself by various sources and are appropriately listed in the references section In any case of plagiarism, I stand by my actions and will be responsible for it Ho Chi Minh City University of Technology therefore is not responsible for any copyright infringements conducted within my research i Acknowledgement I would like to thank my instructor, Assoc Prof Pham Tran Vu, not only for his academic guidance and assistance, but also for his patience and personal support which made me truly grateful I also express my gratitude to Ho Chi Minh City University of Technology for giving me the opportunity to work on this research Finally, I would like to express my deep sense of gratitude to the research teams and individuals working in computer science who made their incredible work publicly accessible Open research papers, datasets and tools are the fundamental elements that helped made this thesis ii Abstract The recent developments in information technology has given rise to a new generation of conversational applications A large number of these applications are question answering systems, where the user can ask an information-seeking question, and the application will reply with the corresponding information Realizing the growth in this kind of application, I set out to build a general-purpose Vietnamese dialogue system that can be quickly adapted into any domain, reducing the amount of development work needed to build a question answering application In this work, I introduce a Vietnamese retrieval-augmented question answering system, and explore ways to improve question answering accuracy in Vietnamese I also introduce new Vietnamese question answering datasets, created using machine translation and two question answering applications iii Contents Introduction 1.1 Problem Statement 1.1.1 High computational complexity of deep language models 1.1.2 Cost of training deep language models 1.1.3 Limited amount of Vietnamese training data 1.2 Objective 1.3 Significance 1.3.1 Academic significance 1.3.2 Practical significance 1.4 Scope 1.5 Thesis Structure 1 2 3 3 Theoretical Background 2.1 Transformer 2.1.1 The encoder 2.1.2 The decoder 2.1.3 Use 2.2 BARTpho 2.2.1 Architecture 2.2.2 Pre-training 2.2.3 Pre-training data 2.2.4 Pre-training objective 2.2.5 Use 2.3 mBERT 2.3.1 Architecture 2.3.2 Pre-training 2.3.3 Pre-training data 2.3.4 Pre-training objective 2.3.5 Use 2.4 XLM-R 2.4.1 Architecture 2.4.2 Pre-training 2.4.3 Pre-training data 2.4.4 Pre-training objective 2.4.5 Use 2.5 Contriever 2.5.1 Architecture and pre-training 2.5.2 Use 5 9 9 10 10 10 11 11 11 12 12 13 13 13 13 14 14 14 15 iv 2.6 2.7 2.8 2.9 VinAI Translate mT5 2.7.1 Use Retrieval Augmentation Question Answering 2.9.1 Reading comprehension question answering 2.9.2 Closed-book question answering 2.9.3 Closed-domain question answering 2.9.4 Open-domain question answering Implementation 3.1 Creating Vietnamese Question Answering Datasets 3.1.1 Preprocessing 3.1.2 Translation 3.1.3 Post-processing 3.1.4 Result 3.2 Reading Comprehension Model 3.2.1 Experimenting with fine-tuning data mixtures 3.2.2 Further analysis on multilingual fine-tuning 3.2.3 Combining best strategies 3.2.4 Model experiment 1: BARTpho VQA 3.2.5 Model experiment 2: mBERT VQA 3.2.6 Model experiment 3: XLM-R VQA 3.3 Retrieval-Augmentation System 3.3.1 The Retriever 3.3.2 The Generator 15 15 16 16 17 18 18 18 19 20 20 20 23 23 24 26 27 31 36 37 37 40 40 41 42 Application 43 4.1 Academic Counselling System 43 4.1.1 Software design 43 Evaluation 5.1 Vietnamese Question Answering Accuracy 5.2 English Question Answering Accuracy 5.3 Answer Retrieval Accuracy 5.3.1 Mean reciprocal rank 5.3.2 Exact match 5.4 Real-life Testing 49 49 50 51 52 53 53 Conclusion and Future Development 6.1 Accomplished Results 6.2 Limitations 6.3 Future Plans 54 54 55 55 – Input: A question related to academic topic – Output: A short answer for the question • Users can view alternative answers when the default answer is not correct • Administrators can quickly update the system by updating the documents relating to academic rules, without changing the system code or models Non-functional requirements • Processing time for each question is less than seconds • The system can work with at least 200 documents • The system can process document of any length Use-case diagram Figure 4.1: Use-case diagram for the user and administrator of the Academic Counselling system 44 System architecture Figure 4.2: A diagram of the architecture of the academic counselling application 45 Document format Figure 4.3: Format of an academic document in the Academic Counselling System System user interface Figure 4.4: Example of Academic Counselling System answering questions by reading academic documents and generating an answer 46 Figure 4.5: Example of Academic Counselling System answering questions related to transfers Figure 4.6: Example of Academic Counselling System answering questions related to courses and scheduling 47 Figure 4.7: Example of Academic Counselling System answering questions related to grading Figure 4.8: Example of alternative answers in Academic Counselling System 48 Chapter Evaluation 5.1 Vietnamese Question Answering Accuracy For Vietnamese Question Answering evaluation, I use the official evaluation script for the SQuAD v1.1 dataset, but replace the SQuAD dataset with a Vietnamese subset of the XQuAD multilingual question answering benchmark The Vietnamese subset of XQuAD contains 1190 data samples in multiple domains, written by professional human translators The metrics used to evaluate model performance are F1 and exact match (EM) I compared the models create in this work, BARTpho VQA and mBERT VQA, against baseline models and models from other works The baseline models are: • BARTpho Vietnamese: a BARTpho model trained on Vietnamese-translated SQuAD dataset • BARTpho English: a BARTpho model trained on English SQuAD dataset For this model, this is a zero-shot transfer learning evaluation • BARTpho English + Vietnamese: a BARTpho model trained on a mixture of English SQuAD dataset and Vietnamese-translated SQuAD dataset BARTpho VQA is trained with the data mixture discussed in Chapter 3: A mixture of English SQuAD dataset and translated Vietnamese data from multiple sources The two models from other works are the two variants of the mT5 model 49 Model Parameters F1 EM BARTpho English 150M 33.1 22.8 BARTpho Vietnamese 150M 54.9 36.1 BARTpho Vietnamese + English 150M 56.1 37.1 BARTpho VQA 150M 60.1 41.2 mT5-Small 300M 63.5 46.0 mBERT VQA 110M 71.0 50.1 XLM-R VQA (Base) 270M 75.8 56.5 mT5-Base 600M 75.9 56.3 XLM-R VQA (Large) 550M 80.0 58.1 Table 5.1: Evaluation scores on Vietnamese question answering for baseline models, BARTpho VQA, mBERT VQA and mT5 Scores for mT5 models taken from [16] XLM-R VQA (Large) is the Generator model for the final system of this thesis As observed in Chapter 3, mixing high-quality English data with translated Vietnamese data improves Vietnamese question answering performance Increasing the amount of Vietnamese data also further improves performance BARTpho Vietnamese + English outperformed BARTpho Vietnamese, and BARTpho VQA outperformed both BARTpho VQA also achieves similar scores to mT5-Small, a 300-million parameters model, despite having half as much parameters (150 millions) The encoder-only mBERT VQA outperformed encoder-decoder models such as BARTpho VQA and mT5-Small, despite having the least parameters (110 millions), thanks to it being more efficiently-designed for NLU tasks than encoder-decoder models XLM-R VQA was trained on a larger pre-training corpus, more languages, has a larger vocabulary and has more model parameters than mBERT It also doesn’t have the Vietnamese word tokenization problem that mBERT has It naturally outperformed mBERT, becoming the best models created in this work XLM-R VQA (Large) outperfomed mT5Base by a significant margin, despite having less parameters 5.2 English Question Answering Accuracy Since the fine-tuning data mixture contains English question answering data, I checked if the resulting models can perform English question answering, by evaluating these models 50 on the English SQuAD v1.1 dataset Model F1 (%) EM (%) mT5-Small 84.7 76.4 XLM-R VQA (Base) 86.6 79.0 mT5-Base 89.6 83.8 XLM-R VQA (Large) 90.5 84.2 Table 5.2: English question answering scores for XLM-R VQA and mT5 Scores for mT5 models taken from [16] XLM-R VQA (Large) is the Generator model for the final system of this thesis The model also achieves impressive scores in English question answering, making it a bilingual (Vietnamese & English) question answering model, a benefit from the En+Vi fine-tuning data mixture 5.3 Answer Retrieval Accuracy In this evaluation, I test the accuracy of the Retriever component of the system mContriever The evaluation dataset is a small set of academic questions collected from real-life The set has 36 data samples 51 Figure 5.1: Examples of the retrieval testing data for the Academic Counselling System The data is collected from real-life frequently asked questions 5.3.1 Mean reciprocal rank In the mean reciprocal rank test, the model will grade the relevance between a given question and each answer in the answer set Mean reciprocal rank (MRR) is then calculated based on the ranking of the correct answer Mean reciprocal rank is an evaluation metric where the score increases with the ranking of the most suitable document for the query In this evaluation, both systems are only allowed to read answer data In actual use, the system is allowed to read both answer data and example question data, which 52 increases accuracy Mean reciprocal rank mContriever TF-IDF 0.5353 0.5352 Table 5.3: Mean reciprocal rank scores for answer retrieval of the academic counselling system Mean reciprocal rank showed that mContriever performed better than TF-IDF, though the difference is not significant I used another evaluation metric, exact match (EM) to better measure the difference 5.3.2 Exact match In exact match test, I run the same testing set over the two systems Only the top-1 retrieval result is collected If this result matches the correct answer, the score is 1.0 Otherwise, the score is 0.0 In this evaluation, both systems are only allowed to read answer data In actual use, the system is allowed to read both answer data and example question data, which increases accuracy EM mContriever TF-IDF 0.44 0.36 Table 5.4: Exact match scores for answer retrieval of the academic counselling system In the exact match test, mContriever showed a more significant gain in performance compared to TF-IDF This shows that for answer retrieval, using a dense retriever like mContriever is a better choice than traditional methods 5.4 Real-life Testing The Academic Counseling System was presented to the Academic Department of HCMUT for real-life, non-scientific testing and feedback collection In a small test, the system passed 4/5 frequently-asked questions from students The employees estimated that the system can potentially reduce 50% of their workload However, it was also noted that more academic data should be added to the system 53 Chapter Conclusion and Future Development 6.1 Accomplished Results In this work, I have accomplished the following: • new datasets for Vietnamese question answering, with a total of 441.999 data samples The Vietnamese datasets were made using machine neural translation and various processing techniques, applied on English question answering datasets: – SQuAD – MS MACRO – Natural Questions – NarrativeQA – Quoref These datasets were released for public access • A new bilingual question answering model: XLM-R VQA Base, which outperformed Google’s mT5-Small in Vietnamese and English reading comprehension, despite being smaller in size • A new bilingual question answering model: XLM-R VQA Large - a larger variant of XLM-R VQA Base, which outperformed Google’s mT5-Base in Vietnamese and English reading comprehension, despite being smaller in size • A Vietnamese retrieval-augmented question answering system, which can be adapted to any domain without modification The system itself is an open-domain question answering system • A new application built using the retrieval-augmented system: the Academic Counselling System • A small test set for evaluating academic counselling systems, with data collected from real-life The Academic Counselling system is also presented to the Academic Department of Ho Chi Minh City University of Technology for real-life testing and feedback collection 54 6.2 Limitations Some limitations in this work includes: • Since the Vietnamese datasets were built using machine translation, even with careful post-processing algorithms, they still have a noticeable reduction in quality compared to the original English datasets • Though the question answering and retrieval models in this work achieved competitive scores against models from other work, they are still not sufficiently accurate 6.3 Future Plans With self-developed datasets and a careful data mixture strategy, the models in this work achieved competitive scores compated to models from other work Unfortunately, the accuracy of these models are still not sufficiently high I suggest some future development directions to potentially improve model’s accuracy: • Build a high-quality, large scale Vietnamese question answering dataset, using data from crowdworkers instead of translation • Add user feedback and suggestion features to the academic counselling application The user feedbacks and suggestions can also become useful data for model training • Experiment with training the Retriever and Generator end-to-end • Build a larger academic counselling evaluation dataset As mentioned before, the reading comprehension models created in this work are bilingual models, which means the Academic Counselling System can already English question answering With some small adaptation work, the system can be used for international students Many dialogue systems today have multi-turn conversational capabilities, where they can memorize and use information from past interactions Multi-turn conversational capability can be added to the retrieval-augmented system by training it on datasets such as CoQA [31] and QuAC [32] 55 Bibliography [1] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu (2020), Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer arXiv preprint arXiv:1910.10683 [2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, Illia Polosukhin (2017), Attention Is All You Need arXiv preprint arXiv:1911.00536 [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2018), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding arXiv preprint arXiv:1810.04805 [4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (2020), Language Models are Few-Shot Learners arXiv preprint arXiv:2005.14165 [5] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V Le (2020), XLNet: Generalized Autoregressive Pretraining for Language Understanding arXiv preprint arXiv:1906.08237 [6] Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto (2020), LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention arXiv preprint arXiv:2010.01057 [7] Nguyen Luong Tran, Duong Minh Le, Dat Quoc Nguyen (2021), BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese arXiv preprint arXiv:2109.09701 [8] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer (2019), BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension arXiv preprint arXiv:1910.13461 [9] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer (2020), Multilingual Denoising Pre-training for Neural Machine Translation arXiv preprint arXiv:2001.08210v2 56 [10] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut (2020), ALBERT: A Lite BERT for Self-supervised Learning of Language Representations arXiv preprint arXiv:1909.11942 [11] Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf (2020), DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter arXiv preprint arXiv:1910.01108 [12] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov (2019), Unsupervised Cross-lingual Representation Learning at Scale arXiv preprint arXiv:1911.02116v1 [13] Guillaume Lample, Alexis Conneau (2019), Cross-lingual Language Model Pretraining arXiv preprint arXiv:1901.07291 [14] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave (2021), Unsupervised Dense Information Retrieval with Contrastive Learning arXiv preprint arXiv:2112.09118 [15] Thien Hai Nguyen, Tuan-Duy H Nguyen, Duy Phung, Duy Tran-Cong Nguyen, Hieu Minh Tran, Manh Luong, Tin Duy Vo, Hung Hai Bui, Dinh Phung, Dat Quoc Nguyen (2022) A Vietnamese-English Neural Machine Translation System In Proceedings of the 23rd Annual Conference of the International Speech Communication Association: Show and Tell (INTERSPEECH) [16] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel (2020), mT5: A massively multilingual pretrained text-to-text transformer arXiv preprint arXiv:2010.11934 [17] Omar Khattab, Christopher Potts, Matei Zaharia (2021), Building Scalable, Explainable, and Adaptive NLP Models with Retrieval https://ai.stanford.edu/blog/ retrieval-based-NLP/ [18] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang (2020), REALM: Retrieval-Augmented Language Model Pre-Training arXiv preprint arXiv:2002.08909 [19] Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, Jason Weston (2020), Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion arXiv preprint arXiv:2203.13224 [20] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kă uttler, Mike Lewis, Wen-tau Yih, Tim Rocktăaschel, Sebastian Riedel, Douwe Kiela (2021), Retrieval-Augmented Generation for KnowledgeIntensive NLP Tasks arXiv preprint arXiv:2005.11401 [21] Kiet Van Nguyen, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen (2020), A Vietnamese Dataset for Evaluating Machine Reading Comprehension arXiv preprint arXiv:2009.14725 57 [22] Kiet Van Nguyen, Tin Van Huynh, Duc-Vu Nguyen, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen (2020), New Vietnamese Corpus for Machine Reading Comprehension of Health News Articles arXiv preprint arXiv:2006.11138 [23] Pranav Rajpurkar, Robin Jia, Percy Liang (2018), Know What You Don’t Know: Unanswerable Questions for SQuAD arXiv preprint arXiv:1806.03822 [24] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang (2016), MS MARCO: A Human Generated MAchine Reading COmprehension Dataset arXiv preprint arXiv:1611.09268 [25] Tom Kwiatkowski Jennimaria Palomaki Olivia Redfield Michael Collins Ankur Parikh Chris Alberti Danielle Epstein Illia Polosukhin Matthew Kelcey Jacob Devlin Kenton Lee Kristina N Toutanova Llion Jones Ming-Wei Chang Andrew Dai Jakob Uszkoreit Quoc Le Slav Petrov (2019), Natural Questions: a Benchmark for Question Answering Research [26] Tomáˇs Koˇcisk´ y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, Edward Grefenstette (2017), The NarrativeQA Reading Comprehension Challenge arXiv preprint arXiv:1712.07040 [27] Pradeep Dasigi, Nelson F Liu, Ana Marasović, Noah A Smith, Matt Gardner (2019), Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning arXiv preprint arXiv:1908.05803 [28] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang (2016), SQuAD: 100,000+ Questions for Machine Comprehension of Text arXiv preprint arXiv:1606.05250 [29] Mikel Artetxe, Sebastian Ruder, Dani Yogatama (2019), On the Cross-lingual Transferability of Monolingual Representations arXiv preprint arXiv:1910.11856 [30] Kiet Van Nguyen, Khiem Vinh Tran, Son T Luu, Anh Gia-Tuan Nguyen, Ngan Luu-Thuy Nguyen (2020), Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension arXiv preprint arXiv:2001.05687 [31] Siva Reddy, Danqi Chen, Christopher D Manning (2018), CoQA: A Conversational Question Answering Challenge arXiv preprint arXiv:1808.07042 [32] Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, Luke Zettlemoyer (2018), QuAC : Question Answering in Context arXiv preprint arXiv:1808.07036 58

Định dạng
Số trang	71
Dung lượng	9,81 MB