UNIVERSITY OF ENGINEERING AND TECHNOLOGY Nguyen Quoc An MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM GRADUATION THESIS Major: Computer Science HA NOI - 2022
Trang 1UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Quoc An
MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM
GRADUATION THESIS Major: Computer Science
HA NOI - 2022
AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM
Trang 2UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Supervisors: Assoc.Prof Tran Trong Hieu
MSc Can Duy Cat
HA NOI - 2022
AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM
Trang 3Automatic question answering (QA) systems assist customers in quickly addressing dailyquestions During the COVID-19 pandemic, one of the topics that users care about ishealthcare In the era of information explosion, distilling helpful information from the
QA system responses takes time Multi-answers summarization problem is researchedfor solving this problem The model of this task takes the customer’s question and allanswers as input, then return the summary The summary has been shown to aid in betterinformation absorption
This thesis focuses on the extractive summarization problem and presents someontology-based improvements to the baseline multi-answer summarization model in theconsumer health question answering system with two main sub-tasks: Ontology con-struction and Building extractive multi-answer summarization model Ontology con-struction task focus on building ontology, which is leveraged to extend biological knowl-edge such as related terms, chemicals, diseases, and symptoms Additionally, WordNet
is used for enhancing common sense knowledge In the summarization phase, somesentence scoring methods are proposed for using extending keywords Compared tothe baseline, the improved model performs better with large margin As the result, theproposed model outperforms current state-of-the-art comparatives with 0.511 ROUGE-
2 F1 An application model is built for creating a question-answering summarizationmodel from five world’s leading independent biotechnology companies’ websites inJapan
Keywords: multi-answer summarization, extractive summarization, query-based marization, ontology construction, ROUGE
sum-iii
Trang 4I want to thank my supervisor, Assoc.Prof Tran Trong Hieu, MSc Can Duy Cat Theyalways had insightful comments both on my work and on this thesis Their dedicationhas given me more motivation to complete the thesis in the best way
Furthermore, I am very thankful to Dr Le Hoang Quynh and Data Science andKnowledge Technology Laboratory members at the VNU University of Engineering andTechnology We had many discussion meetings, and their comments will help me im-prove myself and become more mature in the future
Finally, a deep thank to my family, relatives, and friends who are always with meduring the most challenging times, always encouraging us in life and at work
Although I attempted to complete the report but will undoubtedly make minorerrors, I sincerely receive the teachers’ and professors’ understanding and instruction
iv
Trang 5I declare that the thesis has been composed by myself and that the work has not besubmitted for any other degree or professional qualification I confirm that the worksubmitted is my own, except where work which has formed part of jointly-authoredpublications has been included My contribution and those of the other authors to thiswork have been explicitly indicated below I confirm that appropriate credit has beengiven within this thesis where reference has been made to the work of others
I certify that, to the best of my knowledge, my thesis does not infringe upon one’s copyright nor violate any proprietary rights and that any ideas, techniques, quota-tions, or any other material from the work of other people included in my thesis, pub-lished or otherwise, are fully acknowledged in accordance with the standard referencingpractices
any-I take full responsibility and take all prescribed disciplinary actions for our mitments I declare that this thesis has not been submitted for a higher degree to anyother University or Institution
com-Student
Nguyen Quoc An
v
Trang 6Table of Contents
Abstract iii
Acknowledgements iv
Declaration v
Table of Contents vi
List of Figures viii
List of Tables ix
1 Introduction 1
1.1 Motivation 1
1.2 Problem Statement 4
1.3 Difficulties and Challenges
1.4 Contributions of the thesis
2 Related work 10
2.1 Summarization approach
2.2 Ontology Construction Approach 1
3 Proposed model 14
3.1 Summarization baseline model 1
3.1.1 Pre-processing 14
3.1.2 Single-answer extractive summarization 15
3.1.3 Multi-answer extractive summarization 17
3.2 Ontology Construction
3.2.1 Motivation 18
3.2.2 Overview of proposed ontology construction 19
3.2.3 Biomedical databases
vi
Trang 73.2.4 Independence Ontology Construction 21
3.2.5 Ontologies Integration 2
3.2.6 Ontology Population
3.3 Apply Ontology-based Improvements to Summarization model 25
3.3.1 Baseline Model Improvements 26
3.3.2 Question’s Keyword Expanding 26
3.3.3 Customised scoring methods 29
4 Experiments and Results 31
4.1 Implementation and Configurations 31
4.2 Dataset and Evaluation methods 3
4.2.1 Metrics and Evaluation 3
4.3 Experimental results
4.3.1 Ontology Construction 3
4.3.2 Summarization Experiments 36
4.3.3 Errors Analysis 37
4.4 Application on medical website 3
4.4.1 System overview 39
4.4.2 Application’s result
Conclusions 42
List of Publications 43
References 44
vii
Trang 8List of Figures
1.1 The evolution of MEDLINE citations between 1986 and 2019 2
1.2 Typical tasks / competitions in the field of natural language processing for biomedical data 3
1.3 Classification of Text Summarization Approaches 4
1.4 Multi-Answer Summarization pipeline 5
2.1 Summarization approaches
3.1 Summarization baseline model 1
3.2 Overview of propose ontology construction 20
3.3 CTD disease-chemical relations 2
3.4 Proposed summarization model overview 27
3.5 Ontology expanding method
3.6 WordNet expanding method
4.1 The statistic of nodes and terms in three independent ontologies 35
4.2 The statistic of nodes and terms in three integrated ontology 35
4.3 The reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version 37
4.4 Ablation test results for various components 38
viii
Trang 9List of Tables
1.1 The result summary example responses to a question in medical question
and answer system (MEDIQA)
3.1 MeSH’s topic category list
4.1 Configurations and parameters of proposed model 33
4.2 The statistics of extract summary in datasets 34
4.3 The statistic of relations and terms in ontology population 35
4.4 Comparison model’s results of the MEDIQA 2021 Task 2 - Extractive Summarization 37
4.5 Examples of some errors in test set 39
4.6 Five biotechnology companies’ websites in Japan 40
ix
Trang 10Chapter 1
Introduction
This chapter will present the motivation and the urgency of the thesis topic in tion 1.1 Also, the summarization problem and query-based summarization problemare discussed in section 1.2
effec-in the world From 1 million effec-in 1970 to 13.5 million effec-in 2005, the number doubled effec-in 14years to 26.2 million in 2019
However, in this age of information abundance and overload, the overabundance
of data has made it difficult for humans to absorb In that context, some automaticquestion-answer system is built For example, a question-answer system supports gettinginformation about treatment for common symptoms of COVID-19 from reliable data,which allows users to handle infection situations more scientifically and easily
1 the US National Library of Medicine’s biomedical database
1
Trang 11Figure 1.1: The evolution of MEDLINE citations between 1986 and 2019.
The vertical axis represents the number of citations (million) For a clearer representation, the
statistics from before 2005 are issued every 5 years
Nowadays, several automatic question answering systems about health are builtlike Pubmed2or CHiQA3, Google4 Although the answers returned by the search en-gines have been selected, independent answers from different sources still overlap Forinstance, with the question “How long have SARS-CoV-2 existed?”, Pubmed providesabout 1000 long answers, and Google returns 5,070,000,000 response 5
The idea is to use a summary engine to summarize all the responses into a shortparagraph The summary answer gathers all of the necessary information and elimi-nates any duplicates Therefore, the users can read one paragraph instead of a massiveamount of documents This thesis focuses on the summarization model in the Healthquestion-answering system However, it is the two most demanding tasks are the ques-tion answering and summarization systems for biomedical text (according to experts inFigure 1.2 [6])
Realizing the potential of biomedical summarization, a number of competitionshave been launched in recent years to support research and development in this field.The BioNLP workshop series, which is co-hosted by the ACL SIGBIOMED special-
Trang 12Figure 1.2: Typical tasks / competitions in the field of natural language processing for
biomedical data
ized research community, has grown into an exceptional yearly event for researchers topresent their research ideas in the field of natural language processing for biologicaland medical data (bioNLP) wIn 2021, the BioNLP workshop with the topic MEDIQA2021: Summarization in the Medical Domain6was held, consisting of three separatetasks The summarization of Multiple Answers task is similar to the summary engine
in the question-answer system, is chosen by my team Our team won second prize (inextractive summary) and third prize (in abstract summary) in this contest Besides, ourteam won second prize in science research student competition at my university and hasfour papers about summarization
After participating in this completion, the error analysis process indicates that themodel only focuses on the terms mentioned in the question Meanwhile, related termssuch as synonym terms, related chemicals, related diseases, etc., also have a certain de-gree of importance It is main reason for this thesis to continue research about question-driven improvements This thesis proposes some ontology-based improvements with asignificant development compared to the previous model
6 https://sites.google.com/view/mediqa2021
3
Trang 131.2 Problem Statement
Text summarization aims to select or generate important information from the originaltext(s) to create a short version [7] Humans often read all documents to develop un-derstanding, and then write a summary highlighting its main points Because of theabsence of human experience and understanding, generating a text summary is exceed-ingly tough, time-consuming, and effortless for machines
Based on the different characteristics of the summary paragraphs, text tion can be classified in many different ways as Figure 1.3 [3]
summariza-Text Summarization Approach
Based on
Input Document
Based on Summary Usage
Based on Techniques
Based on Characteristics
Figure 1.3: Classification of Text Summarization Approaches
• According to the input document(s): Single-document summarization and document summarization The difference is that a Single-document summarizationonly focuses on a single text while a multi-text summary uses multiple documents
Multi-as input
• According to the summary usage: Generic and Query-based Generic is an proach that does not focus on a specific topic or aspect, and it makes an overview
ap-of sources While the query-based summarization approach, the result is focused
on the user questions
• According to Techniques: Supervised and Unsupervised Unsupervised approachesbased on algorithms do not depend on human support, such as labelling traindatasets These models are suitable for big data, such as website data Supervisedlearning methods are based on a sentence-level classification approach where themodel learns between summary and non-summary sentences
4
Trang 14• According to output characteristics: Extractive summarization and Abstractivesummarization The extraction method entails extracting the most crucial sen-tences from the documents The summary is then made by combining all of thecritical sentences As a reason, every sentence, in summary, belongs to the originaldocument in this approach Secondly, the abstractive approach tries to recreate thesummary base on the original sentences.
Formal definition According to Multi-Answer Summarization task requirements ,7
different answers can bring complementary perspectives that are likely to benefit theusers of QA systems The purpose of this task is multi-answer summarizing model thatcan tackle summary challenges that numerous relevant replies to a medical question Theinput to the model is the customer’s question Q, and all answers A = {A A1, 2, ,An} Theoutput is a summary that answers the given question (Figure 1.4) Table 1.1 shows theexample of result summary
User's question
Multiple related answers
Summarization
Figure 1.4: Multi-Answer Summarization pipeline
Thesis scope In this work, the model focus on the Query-based Multi-document tractive summarization approach According to the classification approaches, the modelfollows the four properties: Multiple, Query-based, Unsupervised, Extract The extrac-tive approach has many advantages, such as (i) quick summarization time, (ii) low cost
Ex-of hardware resources, and (iii) easy to manage summary quality Besides, compressingmultiple replies into a single answer saves time and effort for users The paragraph issummarised based on the user’s question, which is highly applicable
7
https://www.aicrowd.com/challenges/mediqa-2021/problems/mediqa-2021-multi-answer-summarization-mas
5
Trang 15Table 1.1: The result summary example responses to a question in medical question andanswer system (MEDIQA).
to expose the spine On your side, if you are having surgery on your lower back The surgeon will use tools called retractors to gently separate, hold the soft tissues and blood vessels apart, and have room to work A synthetic bone substitute is used With a cut on the front of the neck, toward the side The surgeon will use a graft (such as bone) to hold (or fuse) the bones together permanently There are several ways of fusing vertebrae together Strips of bone graft material may be placed over the back part of the spine Bone graft material may be placed between the vertebrae Special cages may be placed between the vertebrae These cages are packed with bone graft material The surgeon may get the bone graft from different places From another part of your body (usually around your pelvic bone) This is called an autograft Your surgeon will make a small cut over your hip and remove some bone from the back of the rim of the pelvis From a bone bank This is called an allograft A synthetic bone substitute can also be used The vertebrae may also fixed together with rods, screws, plates, or cages They are used to keep the vertebrae from moving until the bone grafts are fully healed Surgery can take
3 to 4 hours.
A bone graft can be taken from the person’s own healthy bone (this is called an tograft) Or, it can be taken from frozen, donated bone (allograft) In some cases, a manmade (synthetic) bone substitute is used You will be asleep and feel no pain (gen- eral anesthesia).During surgery, the surgeon makes a cut over the bone defect The bone graft can be taken from areas close to the bone defect or more commonly from the pelvis The bone graft is shaped and inserted into and around the area The bone graft can be held in place with pins, plates, or screws.
au-Extractive
summary
A bone graft can be taken from the person’s own healthy bone (this is called an tograft) Or, it can be taken from frozen, donated bone (allograft) In some cases, a manmade (synthetic) bone substitute is used The vertebrae may also fixed together with rods, screws, plates, or cages They are used to keep the vertebrae from moving until the bone grafts are fully healed.
au-6