Question generation with T5 model (Các vấn đề hiện đại của KTMT)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề	Question Generation With T5 Model
Tác giả	Nguyen Duc Huy
Trường học	University of Engineering and Technology
Thể loại	thesis
Năm xuất bản	2023

Định dạng
Số trang	5
Dung lượng	263,7 KB

Nội dung

The goal of question and answer generation (QAG) is to create a set of questions and an swers based on a given context (e.g. a para graph). This task can be useful for many pur poses, such as improving question answering (QA) models, finding information and teach ing. In this paper, we compare three different QAGmethods that use sequencetosequence language model (LM) finetuning. We show that a simple QAG model, which is fast and easy to train and use, is generally better than other more complex methods. However, the performance depends on the type of genera tive LM. We also show that QA models trained only on generated questions and answers can be close to supervised QA models trained on humanlabeled data.

Question generation with T5 model Nguyen Duc Huy - 20020672 - K65K University of Engineering and Technology Abstract The goal of question and answer generation (QAG) is to create a set of questions and answers based on a given context (e.g a paragraph) This task can be useful for many purposes, such as improving question answering (QA) models, finding information and teaching In this paper, we compare three different QAG methods that use sequence-to-sequence language model (LM) fine-tuning We show that a simple QAG model, which is fast and easy to train and use, is generally better than other more complex methods However, the performance depends on the type of generative LM We also show that QA models trained only on generated questions and answers can be close to supervised QA models trained on human-labeled data Introduction Question and answer generation (QAG) is the task of creating pairs of questions and answers based on an input text such as a document, a paragraph or a sentence QAG can be used to train question answering (QA) models without human supervision ([16]; [24]; [18]) and to enhance QA model understanding ([23]; [1]) Moreover, QAG is useful for educational systems ([6], [10]), information retrieval models ([19]; [9]), and model interpretation ([17]; [7]) QAG is derived from question generation (QG) ([12]; [5]; [28]; [4]), which is the task of generating a question given an answer on the input text QG has been widely studied in the era of language models ([14]; [26]), but QAG is a more challenging task, since the answer also needs to be generated and not given as part of the input Therefore, it is not clear what kinds of QAG models work well in practice as there has been no comprehensive comparison so far In this paper, we define QAG as a task that generates question-answer pairs given a text, and compare three simple QAG methods based on fine-tuning encoder-decoder language models (LMs) such as T5 ([20]) and BART ([8]) Our three proposed methods (shown in Figure 1) are: (1) pipeline QAG, which splits the task into answer extraction and question generation, learning a separate model for each subtask; (2) multitask QAG, which uses a single shared model to train both subtasks instead of independent ones; and (3) end2end QAG, which uses end-to-end sequenceto-sequence learning to generate question-answer pairs directly Finally, we compare these three methods on a multi-domain QA-based evaluation, where QA models are trained with the question answer pairs that each QAG model generates 2.1 Preliminary Pipeline Question Answer Generation (QAG) The QAG task can be broken down into two distinct subtasks: answer extraction (AE) and question generation (QG) In the AE process, the model Pae initially produces an answer candidate a ˜ based on a given sentence s within context c Subsequently, the QG model Pqg generates a question q˜ designed to be answered by the generated answer a ˜ within the context c Both the AE and QG models can be independently trained using paragraph-level QG datasets that comprise quadruples (c, s, a, q) Training involves maximizing the conditional log-likelihood as follows: a ˜ =a Pae (a|c, s) (1) q˜ =q Pqg (q|c, s, a) (2) The log-likelihood is factorized into token-level predictions, following a similar approach to other sequence-to-sequence learning settings [25] In practical terms, the input to the AE model is structured as: [c1 , , , s1 , , s|s| , , , c|c| ] Here, si and ci denote the i-th token of s and c respectively, | · | represents the number of tokens in a text, and is the highlighted token used to indicate the sentence in the context This follows the QG formulation of [2] and [27] Similarly, the input to the QG model incorporates the answer as follows: [c1 , , , a1 , , a|a| , , , c|c| ] Here, represents the i-th token of a During inference, the gold answer a in the QG model (2) is replaced with the prediction from the AE model (1) Inference is then conducted over all sentences in the context c to generate questionanswer pairs As a result, the pipeline approach has the potential to generate a maximum number of pairs equal to the number of sentences in c 2.2 Multitask Question Answer Generation (QAG) Rather than training independent models for each subtask, a shared model can be fine-tuned on both Answer Extraction (AE) and Question Generation (QG) simultaneously using a multitask learning approach Specifically, training instances for AE and QG are combined, and batches are randomly sampled during each fine-tuning iteration To distinguish between subtasks, a task prefix is added at the beginning of the input text: " extract answer" for AE and "generate question" for QG 2.3 End-to-End Question Answer Generation (QAG) Instead of decomposing QAG into two separate components, an alternative approach is to model it directly by converting question-answer pairs into a flattened sentence y This is achieved by finetuning a sequence-to-sequence model to generate y from c A function T is defined to map Qc to a sentence using the following formulation: T (Qc) = {t(q , a1 )} | {t(q , a2 )} | ’’ (3) t(q, a) = question:q, answer:a’’ (4) Here, each pair is textualized using the template (4) and joined by a separator | The end-to-end QAG model P qag is then optimized by maximizing the conditional log-likelihood: y˜ = yPqag (y|c) (5) 3.1 Evaluation Data QAG models are trained using the SQuAD dataset ([21]) Given that the output of these models consists of arbitrary questions and answers, traditional reference-based Natural Language Generation (NLG) evaluation metrics used in Question Generation (QG) research, such as [15], [3], Lin (2004), and [13], are deemed unsuitable 3.2 Extrinsic Evaluation To evaluate the QAG models, an extrinsic evaluation is conducted by training Question Answer (QA) models on data generated by the QAG models SQuADShifts ([11]), an English reading comprehension dataset spanning four domains (Amazon, Wikipedia, News, Reddit), is employed The train/validation/test splits from the QG-Bench dataset ([26]) are utilized for both SQuAD and SQuADShifts 3.3 Multi-domain QA Evaluation The evaluation involves generating questionanswer pairs on each domain of SQuADShifts and subsequently fine-tuning DistilBERT ([22]) on the generated pseudo QA pairs Evaluation metrics, namely F1 and exact match on the test set, serve as the target metrics This multi-domain QA-based evaluation assesses the model’s robustness across domains, with an overall performance metric derived by averaging metrics over the domains Hyperparameter optimization during QA model finetuning is facilitated by Tune, an efficient grid search engine 3.4 Base Models For the comparison of different systems (pipeline, multitask, and end2end), T5 ([20]) and BART ([8]) serve as base Language Models (LMs) The model weights used include t5-small,base,large and facebook/bart-base,large, which are shared on HuggingFace Additionally, results from a QG-only model are reported This model, similar to the pipeline method, utilizes gold answers from the provided QA training set as input, excluding the Answer Extraction (AE) component 3.5 QAG Model comparisons So far, we have compared the three QAG approaches in terms of performance However, performance is not the only criterion to consider when choosing a QAG model, since each approach has its own advantages and limitations in terms of computational cost and usability From the perspective of computational complexity, end2end QAG is faster than the others at both of training and inference, because it can generate a number of question-answer pairs at once in a single paragraph pass In contrast, both multitask and pipeline need to parse every sentence separately, and a single prediction consists of two generations (i.e answer extraction and question generation) Essentially, the relative increase of computational cost from end2end QAG to pipeline/multitask QAG can be approximated by the average number of sentences in each paragraph In terms of memory requirements, both multitask and end2end QAG rely on a single model, but pipeline QAG consists of two models, requiring twice as much memory storage Finally, while computational-wise end2end is the lightest model, both pipeline and multitask approaches can generate a larger number of question-answer pairs on average, with the added benefit of being able to run the models on individual sentences Even smaller models like T5SMALL exhibit competitive performance compared to using gold standard question-answer pairs Performance Discrepancies: The choice between BARTLARGE (multitask) and T5LARGE (end2end) for the best performance is unclear BARTLARGE achieves the best average F1 score, while T5LARGE obtains the best average exact match Model Suitability: T5 consistently performs better with the end2end QAG approach, while BART is less effective when used end2end This discrepancy may be attributed to T5’s exposure to structured information during multitask pre-training, unlike BART, which was trained solely on a denoising sequence-to-sequence objective Conclusion: The choice between BARTLARGE (multitask) and T5LARGE (end2end) depends on the evaluation metric priorities, with both models demonstrating competitive performance across domains Additionally, the suitability of model size is highlighted, emphasizing that even smaller models like T5SMALL can yield competitive results Result Table 1: SQuADShifts QA Evaluation Results I have build an UI to generate questions and answers for "FAQ" (Frequently Asked Questions) for products Model Gold QA Gold QA Gold QA Gold QA BARTLARGE BARTLARGE BARTLARGE BARTLARGE BARTLARGE T5LARGE T5LARGE T5LARGE T5LARGE T5LARGE T5SMALL T5SMALL T5SMALL T5SMALL T5SMALL Domain Amazon Wikipedia News Reddit Average Amazon Wikipedia News Reddit Average Amazon Wikipedia News Reddit Average Amazon Wikipedia News Reddit F1 Score 77.3 77.8 75.6 73.1 76.0 75.7 76.2 74.4 72.7 75.9 76.6 76.5 75.2 72.5 74.5 73.4 73.7 72.8 70.2 Exact Match 65.2 66.1 62.4 60.3 63.5 63.1 64.0 61.0 59.6 62.9 64.3 64.1 61.8 59.4 61.2 60.1 60.5 59.7 57.1 From Table we can see that: Top-Performing Models: BARTLARGE (multitask) and T5LARGE (end2end) emerge as the top two models, surpassing Gold QA in two out of four domains for both F1 score and exact match Competitiveness of Smaller Models: My demo References [1] Max Bartolo et al “Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation” In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp 8830–8848 [2] Ying-Hong Chan and Yao-Chung Fan “A Recurrent BERT-based Model for Question Generation” In: Empirical Methods in Natural Language Processing (2019) [3] Michael Denkowski and Alon Lavie “METEOR Universal: Language Specific Translation Evaluation for Any Target Language” In: Proceedings of the Ninth Workshop on Statistical Machine Translation Baltimore, Maryland, USA: Association for Computational Linguistics, 2014, pp 376–380 [4] Xinya Du and Claire Cardie “Harvesting Paragraph-level Question-Answer Pairs from Wikipedia” In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Melbourne, Australia: Association for Computational Linguistics, 2018, pp 1907– 1917 [5] Xinya Du, Junru Shao, and Claire Cardie “Learning to Ask: Neural Question Generation for Reading Comprehension” In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Vancouver, Canada: Association for Computational Linguistics, 2017, pp 1342–1352 [6] Michael Heilman and Noah A Smith “Good question! statistical ranking for questiongeneration” In: Human Language Technologies: The2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2010), pp 609–617 [7] Dong Bok Lee et al “Generating Diverse and Consistent QA Pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs” In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Online: Association for Computational Linguistics, 2020, pp 208–224 [8] Mike Lewis et al “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension” In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Online: Association for Computational Linguistics, 2020, pp 7871–7880 [9] Patrick Lewis et al “PAQ: 65 Million Probably-Asked Questions and What You Can Do with Them” In: Transactions of the Association for Computational Linguistics (2021), pp 1098–1115 [10] David Lindberg et al “Generating Natural Language Questions to Support Learning On-Line” In: Proceedings of the 14th European Workshop on Natural Language Generation Sofia, Bulgaria: Association for Computational Linguistics, 2013, pp 105–114 [11] John Miller et al “The effect of natural distribution shift on question answering models” In: International Conference on Machine Learning PMLR 2020, pp 6905–6916 [12] Ruslan Mitkov and Le An Ha “ComputerAided Generation of Multiple-Choice Tests” In: Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications Using Natural Language Processing 2003, pp 17–22 [13] Alireza Mohammadshahi et al RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question 2022 arXiv: 2211 01482 [http://cs.CL] [14] Lidiya Murakhovs’ka et al “Asking it All: Generating Contextualized Questions for Any Semantic Role” In: Findings of the Association for Computational Linguistics: NAACL 2022 Seattle, United States: Association for Computational Linguistics, 2022, pp 1486–1497 [15] Kishore Papineni et al “BLEU: A Method for Automatic Evaluation of Machine Translation” In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, 2002, pp 311–318 [16] Sebastian Riedel Patrick Lewis Ludovic Denoyer “Unsupervised question answering by cloze translation” In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019), pp 4896– 4910 [17] Ethan Perez et al “Unsupervised Question Decomposition for Question Answering” In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Online: Association for Computational Linguistics, 2020, pp 8864– 8880 [18] Raul Puri et al “Training question answering models from synthetic data” In: Proceed- [19] [20] [21] [22] [23] [24] [25] [26] ings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), pp 5811–5826 Valentina Pyatkin et al “Asking it All: Generating Contextualized Questions for Any Semantic Role” In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp 1429– 1441 Colin Raffel et al “Exploring the Limits of Transfer Learning with a Unified Text-toText Transformer” In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing Online: Association for Computational Linguistics, 2020, pp 38–45 Pranav Rajpurkar et al “SQuAD: 100,000+ Questions for Machine Comprehension of Text” In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing Austin, Texas: Association for Computational Linguistics, 2016, pp 2383–2392 Victor Sanh et al “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter” In: arXiv preprint arXiv:1910.01108 (2019) Siamak Shakeri et al “End-to-End Synthetic Data Generation for Domain Adaptation of Question Answering Systems” In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Online: Association for Computational Linguistics, 2020, pp 5445–5460 Mohit Bansal Shiyue Zhang “Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering” In: Empirical Methods in Natural Language Processing (2019) Ilya Sutskever, Oriol Vinyals, and Quoc V Le “Sequence to sequence learning with neural networks” In: Advances in neural information processing systems 27 (2014) Asahi Ushio, Fernando Alva-Manchego, and Jose Camacho-Collados “Generative Language Models for Paragraph-Level Question Generation” In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing Abu Dhabi, U.A.E.: Association for Computational Linguistics, Dec 2022 [27] Asahi Ushio, Fernando Alva-Manchego, and Jose Camacho-Collados “Generative Language Models for Paragraph-Level Question Generation” In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing Abu Dhabi, U.A.E.: Association for Computational Linguistics [28] Qingyu Zhou et al “Neural Question Generation from Text: A Preliminary Study” In: National CCF Conference on Natural Language Processing and Chinese Computing Springer, 2017, pp 662–671

Ngày đăng: 12/12/2023, 12:00