Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 55 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
55
Dung lượng
0,97 MB
Nội dung
lOMoAR cPSD| 27827034 Nguyen Quoc An AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM GRADUATION THESIS Major: Computer Science HA NOI - 2022 lOMoAR cPSD| 27827034 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY Nguyen Quoc An AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM GRADUATION THESIS Major: Computer Science Supervisors: Assoc.Prof Tran Trong Hieu MSc Can Duy Cat HA NOI - 2022 lOMoAR cPSD| 27827034 Abstract Automatic question answering (QA) systems assist customers in quickly addressing daily questions During the COVID-19 pandemic, one of the topics that users care about is healthcare In the era of information explosion, distilling helpful information from the QA system responses takes time Multi-answers summarization problem is researched for solving this problem The model of this task takes the customer’s question and all answers as input, then return the summary The summary has been shown to aid in better information absorption This thesis focuses on the extractive summarization problem and presents some ontology-based improvements to the baseline multi-answer summarization model in the consumer health question answering system with two main sub-tasks: Ontology construction and Building extractive multi-answer summarization model Ontology construction task focus on building ontology, which is leveraged to extend biological knowledge such as related terms, chemicals, diseases, and symptoms Additionally, WordNet is used for enhancing common sense knowledge In the summarization phase, some sentence scoring methods are proposed for using extending keywords Compared to the baseline, the improved model performs better with large margin As the result, the proposed model outperforms current state-of-the-art comparatives with 0.511 ROUGE2 F1 An application model is built for creating a question-answering summarization model from five world’s leading independent biotechnology companies’ websites in Japan Keywords: multi-answer summarization, extractive summarization, query-based summarization, ontology construction, ROUGE lOMoAR cPSD| 27827034 Acknowledgements I want to thank my supervisor, Assoc.Prof Tran Trong Hieu, MSc Can Duy Cat They always had insightful comments both on my work and on this thesis Their dedication has given me more motivation to complete the thesis in the best way Furthermore, I am very thankful to Dr Le Hoang Quynh and Data Science and Knowledge Technology Laboratory members at the VNU University of Engineering and Technology We had many discussion meetings, and their comments will help me improve myself and become more mature in the future Finally, a deep thank to my family, relatives, and friends who are always with me during the most challenging times, always encouraging us in life and at work Although I attempted to complete the report but will undoubtedly make minor errors, I sincerely receive the teachers’ and professors’ understanding and instruction iv C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Declaration I declare that the thesis has been composed by myself and that the work has not be submitted for any other degree or professional qualification I confirm that the work submitted is my own, except where work which has formed part of jointly-authored publications has been included My contribution and those of the other authors to this work have been explicitly indicated below I confirm that appropriate credit has been given within this thesis where reference has been made to the work of others I certify that, to the best of my knowledge, my thesis does not infringe upon anyone’s copyright nor violate any proprietary rights and that any ideas, techniques, quotations, or any other material from the work of other people included in my thesis, published or otherwise, are fully acknowledged in accordance with the standard referencing practices I take full responsibility and take all prescribed disciplinary actions for our commitments I declare that this thesis has not been submitted for a higher degree to any other University or Institution Student Nguyen Quoc An Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Table of Contents Abstract iii Acknowledgements iv Declaration v Table of Contents vi List of Figures viii List of Tables ix Introduction 1.1 Motivation 1.2 Problem Statement 1.3 Difficulties and Challenges 1.4 Contributions of the thesis Related work 10 2.1 Summarization approach 10 2.2 Ontology Construction Approach 12 Proposed model 14 3.1 Summarization baseline model 14 3.2 3.1.1 Pre-processing 14 3.1.2 Single-answer extractive summarization 15 3.1.3 Multi-answer extractive summarization 17 Ontology Construction 18 3.2.1 Motivation 18 3.2.2 3.2.3 Overview of proposed ontology construction Biomedical databases 19 19 vi Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 3.3 3.2.4 Independence Ontology Construction 21 3.2.5 Ontologies Integration 23 3.2.6 Ontology Population 24 Apply Ontology-based Improvements to Summarization model 25 3.3.1 Baseline Model Improvements 26 3.3.2 Question’s Keyword Expanding 26 3.3.3 Customised scoring methods 29 Experiments and Results 31 4.1 Implementation and Configurations 31 4.2 Dataset and Evaluation methods 32 4.2.1 4.3 4.4 Metrics and Evaluation 32 Experimental results 33 4.3.1 Ontology Construction 34 4.3.2 Summarization Experiments 36 4.3.3 Errors Analysis 37 Application on medical website 39 4.4.1 System overview 39 4.4.2 Application’s result 40 Conclusions 42 List of Publications 43 References 44 Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 List of Figures 1.1 The evolution of MEDLINE citations between 1986 and 2019 1.2 Typical tasks / competitions in the field of natural language processing for biomedical data 1.3 Classification of Text Summarization Approaches 1.4 Multi-Answer Summarization pipeline 2.1 Summarization approaches 10 3.1 Summarization baseline model 15 3.2 Overview of propose ontology construction 20 3.3 CTD disease-chemical relations 25 3.4 Proposed summarization model overview 27 3.5 Ontology expanding method 28 3.6 WordNet expanding method 29 4.1 The statistic of nodes and terms in three independent ontologies 35 4.2 The statistic of nodes and terms in three integrated ontology 35 4.3 The reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version 37 4.4 Ablation test results for various components 38 Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn viii C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 List of Tables 1.1 The result summary example responses to a question in medical question and answer system (MEDIQA) 3.1 MeSH’s topic category list 23 4.1 Configurations and parameters of proposed model 33 4.2 The statistics of extract summary in datasets 34 4.3 The statistic of relations and terms in ontology population 35 4.4 Comparison model’s results of the MEDIQA 2021 Task - Extractive Summarization 37 4.5 Examples of some errors in test set 39 4.6 Five biotechnology companies’ websites in Japan 40 Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Chapter Introduction This chapter will present the motivation and the urgency of the thesis topic in section 1.1 Also, the summarization problem and query-based summarization problem are discussed in section 1.2 1.1 Motivation Many experts and leaders have identified data as an invaluable asset in the era of information explosion For example, Clive Humby - a British mathematician and entrepreneur in the field of data science, said “Data is the new oil” Indeed, exploiting data effectively will bring great value Biomedical text mining is a topic of increasing interest in the research community For example, the expansion of MEDLINE1 is depicted in Figure 1.1 [20] It is one of the largest and most well-known biomedical online databases in the world From million in 1970 to 13.5 million in 2005, the number doubled in 14 years to 26.2 million in 2019 However, in this age of information abundance and overload, the overabundance of data has made it difficult for humans to absorb In that context, some automatic question-answer system is built For example, a question-answer system supports getting information about treatment for common symptoms of COVID-19 from reliable data, which allows users to handle infection situations more scientifically and easily 1the US National Library of Medicine’s biomedical database Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Training and Testing Environment The model can be installed on a personal computer The components such as the sentence scoring model, sentence embedding method, etc are optimized by multi-threading The configurations of system are listed bellow: • Operating System: macOS Monterey • Chip: Apple M1 Pro • CPU: cores • GPU: cores • Neural Engine: 16 cores • Ram: 16 GB Model settings: Table 4.1 lists the hyper-parameters that have been paragram Depending on the dataset and addition techniques, the number of hyper-parameters and hyper-parameters may vary 4.2 Dataset and Evaluation methods ‘Table 4.2 provides statistics on given datasets to extract based on sentence level 4.2.1 Metrics and Evaluation In natural language processing, Recall-Oriented Understudy for Gisting Evaluation score (ROUGE score) [11] is used to evaluate automatic summarization and machine translation technologies The metric compares the generated summary or translation to a human reference(s) ROUGE-n precision (P) and recall (R) is shown in Formula 4.1 and 4.2 ROUGE-n P = ROUGE-n R = |Matched N-grams| |Predict summary N-grams| |Matched N-grams| (4.1) (4.2) |Reference summary N-grams| Additionally, instead of fixing n with a constant, ROUGE-L is used Longest Common Subsequence (LCS) to evaluate the model’s performance ROUGE-L precision (P) and recall (R) is shown in Formula 4.3 and 4.4 ROUGE-L P = Length of the LCS |Predict summary tokens| Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn (4.3) C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Table 4.1: Configurations and parameters of proposed model Information Configuration Having Mondo = True Having Symp = True Ontology Construction Having Gene = False Having Chemical-disease = False Ontology Extending WordNet Extending Sentence scoring Spreading activation Initial weight = function Decay = 0.3 Spreading activation Initial weight = function Decay = 0.6 TF-IDF score Ratio threshold = 0.7 Threshold = 0.065 LexRank score TF-IDF for all answer = True Keyword-based score Word threshold = 0.8 Keyword threshold = 0.8 TF-IDF weight = Query-based score weight = Final score wRWMD weight = LexRank weight =2 Number of top sentences= Single-answer Summarization Related radius = Compression ratio = 0.9 Multi-answer Summarization ROUGE-L R = Length of the LCS |Reference summary tokens| (4.4) The F1 score takes the harmonic mean of precision and recall score to create an evaluation metric, which is presented in Formula 4.5 F1 = × 4.3 R×R P+R (4.5) Experimental results This section presents the result of Ontology Construction process in section 4.3.1 and question-answering summarization model in section 4.3.2 In general, integrated ontology has an increase in the number of nodes and terms after each process Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn 33 C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Table 4.2: The statistics of extract summary in datasets Statistic Training aspects Article Section Validation Test Average Answers per Question 3.54 3.54 3.85 3.8 Sentences per Answer 84.93 29.07 14.50 13.03 Sentences per Single-answer Summary 6.31 6.31 - - Sentences per Multi-answer Summary 10.30 10.30 11.06 - Single-answer Summary 0.12 0.49 - - Multi-answer Summary 0.06 0.18 0.33 - Compression ratio 4.3.1 Ontology Construction This section presents the obtained results of the ontology construction process The results are presented after each step in Independent Ontology Construction, Ontologies Integration and Ontology population Independent Ontology Construction The results obtained are independent ontology with the number of nodes and terms are shown in Figure 4.1 MeSH has the most number of terms and nodes with 29,917 nodes and 252,059 terms in 16 topics Mondo contains 24,409 nodes and 120,0288 terms that focus on human diseases SYMP has 944 nodes and 1179 terms about symptoms Ontologies Integration Figure 4.2 shows the change in the number of nodes and terms before and after the integration process The number of nodes and terms in the disease topic increase dramatically by integrating Mondo ontology The number of nodes increased from 29,917 to 54,326 (180%) and the number of terms increased from 252,059 to 372,347 (150%) Ontology population Table 4.3 shows the amount of chemical-disease relations and symptoms-disease relations after the population process In gene enrichment process, 4602 nodes are populated with different 48172 genes Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 310000 252059 260000 210000 160000 120288 110000 60000 29917 24409 10000 3000 2000 1179 944 1000 MeSH Mondo Nodes SYMP Terms / Figure 4.1: The statistic of nodes and terms in three independent ontologies Figure 4.2: The statistic of nodes and terms in three integrated ontology Table 4.3: The statistic of relations and terms in ontology population Quantity CTD database Extracting term’s definition Chemical-disease relations 67,456 24,726 Symptom-disease relations - 17,425 48,176 - Genes terms 35 Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 4.3.2 Summarization Experiments In section, model’s performance is shown and compare with other model in MEDIQA 2021 dataset After that, the component’s contributions are discussed Comparative models: Some public papers on MEDIQA test dataset is used as a comparative models ROUGE-2 F1 is main metric to rank the participating teams in competitions Other evaluation metrics are used including: ROUGE-1 F1, ROUGE-L F1 • Fine-tuning RoBERTa model (Zhu et al.): The model achieves first place in MEDIQA 2021 Task official results They used ensemble model including coarse ranking, Markov chain and RoBERTa model fine-tuned on the MS-MACRO task • BART model (Mrini et al.): The model ranked 7th in official results They cast problem as an Answer Sentence Selection problem and train BART model • Fine-tuning Text-toText Transfer Transformer (T5) relevance-based re-ranking model (Yadav et al.): The model ranked 9th in official results The model first base on fine-tuning T5 on MS-MACRO passage and the MEDIQA-QA 2019 datasets Final results and comparison: Table 4.4 shows the performances of comparative models Based on the validation set experiments, the number of significant sentences in the single-answer summarization phase is per answer In other sentences, the sentence is selected if the three sentences surrounding it exist at least two significant sentences on the left and right The sentence in the final outputs depends on the number of single summaries and the sentence per single summaries The ontology-based improvement model has the best performance when cutting integrated ontology Nodes about drug, disease, symptoms, and drug-disease relation are focused on, and other related topics and related genes are removed The impact of each proposed component on the model results Figure 4.3 shows the reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version Keywords weight provided by ontology help improvement score methods Keyword-based score and wRWMD have significant gain when using ontology Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Table 4.4: Comparison model’s results of the MEDIQA 2021 Task - Extractive Summarization ROUGE-1 Model name P R ROUGE-2 F1 P R ROUGE-L F1 F1 Fine-tuning RoBERTa 0.475 0.878 0.585 0.407 0.767 0.508 0.435 BART 0.616 0.672 0.607 0.473 0.531 0.472 0.429 Fine-tuning T5 0.420 0.899 0.547 0.358 0.774 0.468 0.328 Baseline model 0.528 0.814 0.611 0.432 0.680 0.504 0.441 0.530 0.821 0.616 0.437 0.687 0.511 0.446 Ontology-based improvement model Figure 4.3: The reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version Contribution of each components: The contribution of each component is discussed by ablating each of them in turn from the model and evaluation the model on the test set in Figure 4.4 Question-driven scoring and Frequency-based scoring, which are customised by ontology improvements, have important roles in the model 4.3.3 Errors Analysis Some errors are discussed after analyzing the results on the test set Table 4.5 shows some example of some errors and discuss about problems and effect Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn 37 C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Figure 4.4: Ablation test results for various components Integrated ontology has not yet covered all biomedical keywords In question #183, Bartholin Abscess is not in ontology As the result, the extending methods can not extends related keywords and relations Some other issues related to the integrated ontology are illustrated in question #153 Symptom-disease relations are inferred by the term’s definition However, the definition of Waldenstrom macroglobuline does not mention to is symptoms, lead to the lack of related symptoms Short words or acronyms can cause faulty expansion of expanding functions For example, Question #38 have gsk keyword, which is acronyms As the result, some, the methods extend wrong term from this word In the test set, there are no questions related to the gene-disease relationship Thus, extending the relevant genes did not increase the model’s performance In addition, genes have complex names, often abbreviated so the noise to the model The model has a problem with answers that are too long Question #119 is one examples with three answers The number of sentences per answers are 88, 16, 132 sentences respectively Too many sentences containing important keywords can make difficult for the model to select important sentences The distribution of target sentences is also a problem With Question #125, three answers with the number per answer are 5, 35 and 17 sentences respectively However, all target sentences belong to first answer Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Besides the problems related to the model components, some problems relate to the input data for which Question #5 is an example The question is “‘Is there gene therapy for persistent cough?”’ while the chiQA answers not mention this topic Therefore, model not have enough linguistic information to summarize these documents Table 4.5: Examples of some errors in test set # Question Problems 183 How to stop getting Bartholin Missing Abscess? Effect related Summary is on the wrong terms 153 What are the symptoms of Missing direction (low precision) related Summary is on the wrong Waldenstrom macroglobuline- symptoms direction (low precision) mia? 38 How long does it take to re- keyword expan- Missing output sentences cover from GBS? 119 Can i take zoledronic acid in- sion error (low precision) Long answer Missing output sentences fusion if my mother has scle- (low precision) roderma? 125 What body parts are affected by systemic lupus erythemato- Distribution of Missing output sentences target sentences (low precision) sus? 4.4 Is there gene therapy for per- The problem in Not enough information to sistent cough? summarize chiQA answers Application on medical website One project is built for creating question-answering summarization model from five world’s leading independent biotechnology companies website in Japan which shown in Table 4.6 I focus on two task: building summarization model and ontology construction The section 4.4.1 presents about application’s components and section 4.4.2 shows some example of application’s results 4.4.1 System overview The application three main components: Crawling websites method, Ontology construction and Building question-answer summarization model Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn 39 C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Table 4.6: Five biotechnology companies’ websites in Japan Name Amgen Domain amgen.com takeda.com Takeda takedaoncology.com takedavaccines.com GlaxoSmithKline gsk.com Bristol Myers Squibb bms.com • Website crawler: The websites’ data are crawling by focus strategies based on human control and the website’s sitemap • Ontology Construction: The integrated ontology is built for keyword extending method However, this ontology did not have SYMP ontology for symptoms term and disease-gene relations • Question-answering summarization model: Summarization is built with user’s query and websites’ data as input, in which user’s query is a list of keywords Because website data is very large, Filter-Cluster method is added in preprocessing This methods receives raw data and return document clusters as paragraphs The K-means algorithm is used in this methods in TF-IDF embedding vector The documents and query then go to summarization model to generate the relevant summary 4.4.2 Application’s result With query about COVID-19 and HIV with some query’s keyword such as covid19 vaccine sanofi, plantderived viruslike particles vaccine medicago, covid19, etc There are two summaries from baseline and ontology-based improvements • Baseline: “In addition, our vaccines for non-bacterial infections such as influenza, rotavirus and malaria prevent the development of diseases that can trigger the use of antibiotics – to treat secondary infections, for example The principle objectives in phase I are to: make sure that the new medicine presents no major safety issues clarify that it can reach the targeted body area, and remain there long enough to deliver its benefits gain preliminary evidence that it could offer therapeutic value, or prevent the disease or condition Phase II If phase I is successful, approval will Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 be sought for a trial involving a larger group of people She is a recognised health policy expert in the development of global and national programmes for improving healthcare access for all patient groups and in ensuring the voice of patients is reflected in research programmes Brian holds authenticity, courage, and inclusion as fundamentals to great leadership and outside of work, enjoys spending time with his wife and daughters and is an exercise enthusiast; particularly running, skiing, mountain climbing and cycling.” • Ontology-based improvements: “During this period, we developed the first treatment for HIV, enhanced our consumer brands and developed new treatments for asthma We are actively engaged with governments and other stakeholders now to strengthen global pandemic preparedness; drawing on learnings from COVID-19 and previous outbreaks, reaffirming the role of innovation and promoting sustainable approaches for the biopharmaceutical sector and for public health Our HIV business is managed through ViiV Healthcare, a global specialist company in HIV that we majority own, with Pfizer and Shionogi as shareholders Global health Our focus is on diseases of the developing world, such as malaria, tuberculosis, enteric and parasitic infections.” The passage using weighted ontology focuses directly on gsk website research: HIV, immunity, and diseases of the developing world At the same time, the length of the paragraph is also shorter Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn 41 C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 Conclusions This thesis provides several studies to solve the multi-answer summarization problems Using external databases to expand question knowledge is an approach that gained attention from the research community However, biomedical data sources are usually only focused on one topic such as human diseases, chemicals, genes, etc while user questions are diverse Therefore, the thesis also has studied ontology integration and population methods to build an ontology for summary purposes After researching related studies, some ontology-based improvements are proposed for the Multi-answer Summarization in Consumer Health Question Answering System There are two main tasks: Ontology construction and Building extractive multi-answer summarization model The proposed ontology construction process has three main steps: Independent ontologies construction, Ontology integration, and Ontology population Summarization model use three scoring strategies to estimate sentence scores, in which some customizing methods are proposed based on TF-IDF, Keyword-based score, and weighted-relaxed word mover’s distance score As the result, Integrated ontology has 54,326 nodes with 420,523 biomedical terms, 92,182 chemical-disease relations and 17,425 symptom-disease relations The proposed model has a higher performance than most other public papers in MEDIQA 2021 contest with 0.511 ROUGE-2 F1 Besides, it has a fast training speed, which can be applied to large data sources One related application is implemented in five world’s leading independent biotechnology companies’ websites in Japan In the future, the proposed can be expandable in several ways: Building ontology in the more biomedical topics, using text data sources to infer related terms and biomedical relations, implementing function that infers the number of selected sentences per answer Finally, I want to spend time researching with the goal and hoping to experiment with the biomedical ontology and summarization model on the basis of the Vietnamese language, to meet the needs of learning about medical knowledge for all Vietnamese people Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 List of Publications [Pub 1] Quoc-An Nguyen, Quoc-Hung Duong, Minh-Quang Nguyen, Huy-Son Nguyen, Hoang-Quynh Le, Duy-Cat Can, Tam Doan Thanh, Mai-Vu Tran “A Hybrid Multi-answer Summarization Model for the Biomedical Question-Answering System.” In 2021 13th International Conference on Knowledge and Systems Engineering (KSE) IEEE, 2021 [Pub 2] Duy-Cat Can, Quoc-An Nguyen, Quoc-Hung Duong, Minh-Quang Nguyen, Huy-Son Nguyen, Linh Nguyen Tran Ngoc, Quang-Thuy Ha, and Mai-Vu Tran “UETrice at MEDIQA 2021: A Prosper-thy-neighbor Extractive Multi document Summarization Model.” In Proceedings of the 20th SIGBioMed Workshop on Biomedical Language Processing, NAACL-BioNLP 2021 Association for Computational Linguistics, 2021 [Pub 3] Hoang-Quynh Le, Quoc-An Nguyen, Quoc-Hung Duong, Minh-Quang Nguyen, Huy-Son Nguyen, Tam Doan Thanh, Hai-Yen Thi Vuong, and Trang M Nguyen “UETfishes at MEDIQA 2021: Standing-on-the-Shoulders-of-Giants Model for Abstractive Multi-answer Summarization.” In Proceedings of the 20th SIGBioMed Workshop on Biomedical Language Processing, NAACL-BioNLP 2021 Association for Computational Linguistics, 2021 [Pub 4] Quoc-An Nguyen, Quoc-Hung Duong, Minh-Quang Nguyen, Huy-Son Nguyen “Nghiên cứu đề xuất giải pháp tóm tắt đa văn tự động cho câu trả lời hệ thống hỏi đáp y sinh học” In Hội thảo khoa học toàn cầu Hội sinh viên Việt Nam, 2021 43 Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 References [1] S Babalou and B Koănig-Ries, “Towards building knowledge by merging multiple ontologies with comerger: a partitioning-based approach,” arXiv preprint arXiv:2005.02659, 2020 [2] K Bennani-Smires, C Musat, A Hossmann, M Baeriswyl, and M Jaggi, “Simple unsupervised keyphrase extraction using sentence embeddings,” in Proceedings of the 22nd Conference on Computational Natural Language Learning, 2018, pp 221–229 [3] S K Bharti and K S Babu, “Automatic keyword extraction for text summarization: A survey,” arXiv preprint arXiv:1704.03242, 2017 [4] E Blomqvist, “Ontocase-automatic ontology enrichment based on ontology design patterns,” in International Semantic Web Conference Springer, 2009, pp 65–80 [5] M Gambhir and V Gupta, “Recent automatic text summarization techniques: a survey,” Artificial Intelligence Review, vol 47, no 1, pp 1–66, 2017 [6] C.-C Huang and Z Lu, “Community challenges in biomedical text mining over 10 years: success, failure and the future,” Briefings in bioinformatics, vol 17, no 1, pp 132–144, 2016 [7] K Jezˇ ek and J Steinberger, “Automatic text summarization (the state of the art 2007 and new challenges),” in Proceedings of Znalosti Citeseer, 2008, pp 1–12 [8] O Kaynar, Y Goărmez, Y E Isák, and F Demirkoparan, “Comparison of graph based document summarization method,” in 2017 International Conference on Computer Science and Engineering (UBMK) IEEE, 2017, pp 598–603 [9] R Khan, Y Qian, and S Naeem, “Extractive based text summarization using kmeans and tf-idf.” International Journal of Information Engineering & Electronic Business, vol 11, no 3, 2019 Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an lOMoAR cPSD| 27827034 [10] A Kogilavani and P Balasubramanie, “Ontology enhanced clustering based summarization of medical documents,” International Journal of Recent Trends in Engineering, vol 1, no 1, p 546, 2009 [11] C.-Y Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp 74–81 [12] P Mitra, N F Noy, and A R Jaiswal, “Omen: A probabilistic ontology mapping tool,” in International Semantic Web Conference Springer, 2005, pp 537–547 [13] O Mohammed, R Benlamri, and S Fong, “Building a diseases symptoms ontology for medical diagnosis: an integrative approach,” in The First International Conference on Future Generation Communication Technologies IEEE, 2012, pp 104–108 [14] K Mrini, F Dernoncourt, S Yoon, T Bui, W Chang, E Farcas, and N Nakashole, “Ucsd-adobe at mediqa 2021: Transfer learning and answer sentence selection for medical summarization,” in Proceedings of the 20th Workshop on Biomedical Language Processing, 2021, pp 257–262 [15] V Nastase, “Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation,” in Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 2008, pp 763–772 [16] I Osman, S B Yahia, and G Diallo, “Ontology integration: approaches and challenging issues,” Information Fusion, vol 71, pp 38–63, 2021 [17] I B Ozyurt, A Bandrowski, and J S Grethe, “Bio-answerfinder: a system to find answers to questions from biomedical texts,” Database, vol 2020, 2020 [18] G Petasis, V Karkaletsis, G Paliouras, A Krithara, and E Zavitsanos, “Ontology population and enrichment: State of the art,” Knowledge-driven multimedia information extraction and ontology evolution, pp 134–166, 2011 [19] N Rahman and B Borah, “Improvement of query-based text summarization using word sense disambiguation,” Complex & Intelligent Systems, vol 6, no 1, pp 75– 85, 2020 [20] H Ramampiaro and C Li, “Supporting biomedical information retrieval: The biotracer approach,” in Transactions on large-scale data-and knowledge-centered systems IV Springer, 2011, pp 73–94 Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn 45 C.vT.Bg.Jy.Lj.Tai lieu Luan vT.Bg.Jy.Lj van Luan an.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj Do an.Tai lieu Luan van Luan an Do an.Tai lieu Luan van Luan an Do an Stt.010.Mssv.BKD002ac.email.ninhd.vT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.LjvT.Bg.Jy.Lj.dtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn.Stt.010.Mssv.BKD002ac.email.ninhddtt@edu.gmail.com.vn.bkc19134.hmu.edu.vn