N Ontology-Based Improvement For Multi-Answer Summarization In Consumer Health Question Answering System.pdf

UNIVERSITY OF ENGINEERING AND TECHNOLOGY Nguyen Quoc An MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM GRADUATION THESIS Major: Computer Science HA NOI - 2022

Trang 1

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Nguyen Quoc An

MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM

GRADUATION THESIS Major: Computer Science

HA NOI - 2022

AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM

Trang 2

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Supervisors: Assoc.Prof Tran Trong Hieu

MSc Can Duy Cat

HA NOI - 2022

AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM

Trang 3

Automatic question answering (QA) systems assist customers in quickly addressing dailyquestions During the COVID-19 pandemic, one of the topics that users care about ishealthcare In the era of information explosion, distilling helpful information from the

QA system responses takes time Multi-answers summarization problem is researchedfor solving this problem The model of this task takes the customer’s question and allanswers as input, then return the summary The summary has been shown to aid in betterinformation absorption

This thesis focuses on the extractive summarization problem and presents someontology-based improvements to the baseline multi-answer summarization model in theconsumer health question answering system with two main sub-tasks: Ontology con-struction and Building extractive multi-answer summarization model Ontology con-struction task focus on building ontology, which is leveraged to extend biological knowl-edge such as related terms, chemicals, diseases, and symptoms Additionally, WordNet

is used for enhancing common sense knowledge In the summarization phase, somesentence scoring methods are proposed for using extending keywords Compared tothe baseline, the improved model performs better with large margin As the result, theproposed model outperforms current state-of-the-art comparatives with 0.511 ROUGE-

2 F1 An application model is built for creating a question-answering summarizationmodel from five world’s leading independent biotechnology companies’ websites inJapan

Keywords: multi-answer summarization, extractive summarization, query-based marization, ontology construction, ROUGE

sum-iii

Trang 4

I want to thank my supervisor, Assoc.Prof Tran Trong Hieu, MSc Can Duy Cat Theyalways had insightful comments both on my work and on this thesis Their dedicationhas given me more motivation to complete the thesis in the best way

Furthermore, I am very thankful to Dr Le Hoang Quynh and Data Science andKnowledge Technology Laboratory members at the VNU University of Engineering andTechnology We had many discussion meetings, and their comments will help me im-prove myself and become more mature in the future

Finally, a deep thank to my family, relatives, and friends who are always with meduring the most challenging times, always encouraging us in life and at work

Although I attempted to complete the report but will undoubtedly make minorerrors, I sincerely receive the teachers’ and professors’ understanding and instruction

iv

Trang 5

I declare that the thesis has been composed by myself and that the work has not besubmitted for any other degree or professional qualification I confirm that the worksubmitted is my own, except where work which has formed part of jointly-authoredpublications has been included My contribution and those of the other authors to thiswork have been explicitly indicated below I confirm that appropriate credit has beengiven within this thesis where reference has been made to the work of others

I certify that, to the best of my knowledge, my thesis does not infringe upon one’s copyright nor violate any proprietary rights and that any ideas, techniques, quota-tions, or any other material from the work of other people included in my thesis, pub-lished or otherwise, are fully acknowledged in accordance with the standard referencingpractices

any-I take full responsibility and take all prescribed disciplinary actions for our mitments I declare that this thesis has not been submitted for a higher degree to anyother University or Institution

com-Student

Nguyen Quoc An

v

Trang 6

Table of Contents

Abstract iii

Acknowledgements iv

Declaration v

Table of Contents vi

List of Figures viii

List of Tables ix

1 Introduction 1

1.1 Motivation 1

1.2 Problem Statement 4

1.3 Difficulties and Challenges

1.4 Contributions of the thesis

2 Related work 10

2.1 Summarization approach

2.2 Ontology Construction Approach 1

3 Proposed model 14

3.1 Summarization baseline model 1

3.1.1 Pre-processing 14

3.1.2 Single-answer extractive summarization 15

3.1.3 Multi-answer extractive summarization 17

3.2 Ontology Construction

3.2.1 Motivation 18

3.2.2 Overview of proposed ontology construction 19

3.2.3 Biomedical databases

vi

Trang 7

3.2.4 Independence Ontology Construction 21

3.2.5 Ontologies Integration 2

3.2.6 Ontology Population

3.3 Apply Ontology-based Improvements to Summarization model 25

3.3.1 Baseline Model Improvements 26

3.3.2 Question’s Keyword Expanding 26

3.3.3 Customised scoring methods 29

4 Experiments and Results 31

4.1 Implementation and Configurations 31

4.2 Dataset and Evaluation methods 3

4.2.1 Metrics and Evaluation 3

4.3 Experimental results

4.3.1 Ontology Construction 3

4.3.2 Summarization Experiments 36

4.3.3 Errors Analysis 37

4.4 Application on medical website 3

4.4.1 System overview 39

4.4.2 Application’s result

Conclusions 42

List of Publications 43

References 44

vii

Trang 8

List of Figures

1.1 The evolution of MEDLINE citations between 1986 and 2019 2

1.2 Typical tasks / competitions in the field of natural language processing for biomedical data 3

1.3 Classification of Text Summarization Approaches 4

1.4 Multi-Answer Summarization pipeline 5

2.1 Summarization approaches

3.1 Summarization baseline model 1

3.2 Overview of propose ontology construction 20

3.3 CTD disease-chemical relations 2

3.4 Proposed summarization model overview 27

3.5 Ontology expanding method

3.6 WordNet expanding method

4.1 The statistic of nodes and terms in three independent ontologies 35

4.2 The statistic of nodes and terms in three integrated ontology 35

4.3 The reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version 37

4.4 Ablation test results for various components 38

viii

Trang 9

List of Tables

1.1 The result summary example responses to a question in medical question

and answer system (MEDIQA)

3.1 MeSH’s topic category list

4.1 Configurations and parameters of proposed model 33

4.2 The statistics of extract summary in datasets 34

4.3 The statistic of relations and terms in ontology population 35

4.4 Comparison model’s results of the MEDIQA 2021 Task 2 - Extractive Summarization 37

4.5 Examples of some errors in test set 39

4.6 Five biotechnology companies’ websites in Japan 40

ix

Trang 10

Chapter 1

Introduction

This chapter will present the motivation and the urgency of the thesis topic in tion 1.1 Also, the summarization problem and query-based summarization problemare discussed in section 1.2

effec-in the world From 1 million effec-in 1970 to 13.5 million effec-in 2005, the number doubled effec-in 14years to 26.2 million in 2019

However, in this age of information abundance and overload, the overabundance

of data has made it difficult for humans to absorb In that context, some automaticquestion-answer system is built For example, a question-answer system supports gettinginformation about treatment for common symptoms of COVID-19 from reliable data,which allows users to handle infection situations more scientifically and easily

1 the US National Library of Medicine’s biomedical database

1

Trang 11

Figure 1.1: The evolution of MEDLINE citations between 1986 and 2019.

The vertical axis represents the number of citations (million) For a clearer representation, the

statistics from before 2005 are issued every 5 years

Nowadays, several automatic question answering systems about health are builtlike Pubmed2or CHiQA3, Google4 Although the answers returned by the search en-gines have been selected, independent answers from different sources still overlap Forinstance, with the question “How long have SARS-CoV-2 existed?”, Pubmed providesabout 1000 long answers, and Google returns 5,070,000,000 response 5

The idea is to use a summary engine to summarize all the responses into a shortparagraph The summary answer gathers all of the necessary information and elimi-nates any duplicates Therefore, the users can read one paragraph instead of a massiveamount of documents This thesis focuses on the summarization model in the Healthquestion-answering system However, it is the two most demanding tasks are the ques-tion answering and summarization systems for biomedical text (according to experts inFigure 1.2 [6])

Realizing the potential of biomedical summarization, a number of competitionshave been launched in recent years to support research and development in this field.The BioNLP workshop series, which is co-hosted by the ACL SIGBIOMED special-

Trang 12

Figure 1.2: Typical tasks / competitions in the field of natural language processing for

biomedical data

ized research community, has grown into an exceptional yearly event for researchers topresent their research ideas in the field of natural language processing for biologicaland medical data (bioNLP) wIn 2021, the BioNLP workshop with the topic MEDIQA2021: Summarization in the Medical Domain6was held, consisting of three separatetasks The summarization of Multiple Answers task is similar to the summary engine

in the question-answer system, is chosen by my team Our team won second prize (inextractive summary) and third prize (in abstract summary) in this contest Besides, ourteam won second prize in science research student competition at my university and hasfour papers about summarization

After participating in this completion, the error analysis process indicates that themodel only focuses on the terms mentioned in the question Meanwhile, related termssuch as synonym terms, related chemicals, related diseases, etc., also have a certain de-gree of importance It is main reason for this thesis to continue research about question-driven improvements This thesis proposes some ontology-based improvements with asignificant development compared to the previous model

6 https://sites.google.com/view/mediqa2021

3

Trang 13

1.2 Problem Statement

Text summarization aims to select or generate important information from the originaltext(s) to create a short version [7] Humans often read all documents to develop un-derstanding, and then write a summary highlighting its main points Because of theabsence of human experience and understanding, generating a text summary is exceed-ingly tough, time-consuming, and effortless for machines

Based on the different characteristics of the summary paragraphs, text tion can be classified in many different ways as Figure 1.3 [3]

summariza-Text Summarization Approach

Based on

Input Document

Based on Summary Usage

Based on Techniques

Based on Characteristics

Figure 1.3: Classification of Text Summarization Approaches

• According to the input document(s): Single-document summarization and document summarization The difference is that a Single-document summarizationonly focuses on a single text while a multi-text summary uses multiple documents

Multi-as input

• According to the summary usage: Generic and Query-based Generic is an proach that does not focus on a specific topic or aspect, and it makes an overview

ap-of sources While the query-based summarization approach, the result is focused

on the user questions

• According to Techniques: Supervised and Unsupervised Unsupervised approachesbased on algorithms do not depend on human support, such as labelling traindatasets These models are suitable for big data, such as website data Supervisedlearning methods are based on a sentence-level classification approach where themodel learns between summary and non-summary sentences

4

Trang 14

• According to output characteristics: Extractive summarization and Abstractivesummarization The extraction method entails extracting the most crucial sen-tences from the documents The summary is then made by combining all of thecritical sentences As a reason, every sentence, in summary, belongs to the originaldocument in this approach Secondly, the abstractive approach tries to recreate thesummary base on the original sentences.

Formal definition According to Multi-Answer Summarization task requirements ,7

different answers can bring complementary perspectives that are likely to benefit theusers of QA systems The purpose of this task is multi-answer summarizing model thatcan tackle summary challenges that numerous relevant replies to a medical question Theinput to the model is the customer’s question Q, and all answers A = {A A1, 2, ,An} Theoutput is a summary that answers the given question (Figure 1.4) Table 1.1 shows theexample of result summary

User's question

Multiple related answers

Summarization

Figure 1.4: Multi-Answer Summarization pipeline

Thesis scope In this work, the model focus on the Query-based Multi-document tractive summarization approach According to the classification approaches, the modelfollows the four properties: Multiple, Query-based, Unsupervised, Extract The extrac-tive approach has many advantages, such as (i) quick summarization time, (ii) low cost

Ex-of hardware resources, and (iii) easy to manage summary quality Besides, compressingmultiple replies into a single answer saves time and effort for users The paragraph issummarised based on the user’s question, which is highly applicable

7

https://www.aicrowd.com/challenges/mediqa-2021/problems/mediqa-2021-multi-answer-summarization-mas

5

Trang 15

Table 1.1: The result summary example responses to a question in medical question andanswer system (MEDIQA).

to expose the spine On your side, if you are having surgery on your lower back The surgeon will use tools called retractors to gently separate, hold the soft tissues and blood vessels apart, and have room to work A synthetic bone substitute is used With a cut on the front of the neck, toward the side The surgeon will use a graft (such as bone) to hold (or fuse) the bones together permanently There are several ways of fusing vertebrae together Strips of bone graft material may be placed over the back part of the spine Bone graft material may be placed between the vertebrae Special cages may be placed between the vertebrae These cages are packed with bone graft material The surgeon may get the bone graft from different places From another part of your body (usually around your pelvic bone) This is called an autograft Your surgeon will make a small cut over your hip and remove some bone from the back of the rim of the pelvis From a bone bank This is called an allograft A synthetic bone substitute can also be used The vertebrae may also fixed together with rods, screws, plates, or cages They are used to keep the vertebrae from moving until the bone grafts are fully healed Surgery can take

3 to 4 hours.

A bone graft can be taken from the person’s own healthy bone (this is called an tograft) Or, it can be taken from frozen, donated bone (allograft) In some cases, a manmade (synthetic) bone substitute is used You will be asleep and feel no pain (gen- eral anesthesia).During surgery, the surgeon makes a cut over the bone defect The bone graft can be taken from areas close to the bone defect or more commonly from the pelvis The bone graft is shaped and inserted into and around the area The bone graft can be held in place with pins, plates, or screws.

au-Extractive

summary

A bone graft can be taken from the person’s own healthy bone (this is called an tograft) Or, it can be taken from frozen, donated bone (allograft) In some cases, a manmade (synthetic) bone substitute is used The vertebrae may also fixed together with rods, screws, plates, or cages They are used to keep the vertebrae from moving until the bone grafts are fully healed.

au-6

Tiêu đề	An Ontology-Based Improvement For Multi-Answer Summarization In Consumer Health Question Answering System
Tác giả	Nguyen Quoc An
Người hướng dẫn	Assoc.Prof. Tran Trong Hieu, MSc. Can Duy Cat
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2022
Thành phố	Hanoi

Định dạng
Số trang	55
Dung lượng	3,59 MB