1. Trang chủ
  2. » Luận Văn - Báo Cáo

N Ontology-Based Improvement For Multi-Answer Summarization In Consumer Health Question Answering System.pdf

55 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Trang 1

UNIVERSITY OF ENGINEERING AND TECHNOLOGY FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM

Trang 2

UNIVERSITY OF ENGINEERING AND TECHNOLOGY Major: Computer Science

Supervisors: Assoc.Prof Tran Trong Hieu MSc Can Duy Cat

HA NOI - 2022

AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM

Trang 3

Automatic question answering (QA) systems assist customers in quickly addressing daily questions During the COVID-19 pandemic, one of the topics that users care about is healthcare In the era of information explosion, distilling helpful information from the QA system responses takes time Multi-answers summarization problem is researched for solving this problem The model of this task takes the customer’s question and all answers as input, then return the summary The summary has been shown to aid in better information absorption.

This thesis focuses on the extractive summarization problem and presents some ontology-based improvements to the baseline multi-answer summarization model in the consumer health question answering system with two main sub-tasks: Ontology struction and Building extractive multi-answer summarization model Ontology con-struction task focus on building ontology, which is leveraged to extend biological knowl-edge such as related terms, chemicals, diseases, and symptoms Additionally, WordNet is used for enhancing common sense knowledge In the summarization phase, some sentence scoring methods are proposed for using extending keywords Compared to the baseline, the improved model performs better with large margin As the result, the proposed model outperforms current state-of-the-art comparatives with 0.511 ROUGE-2 F1 An application model is built for creating a question-answering summarization model from five world’s leading independent biotechnology companies’ websites in Japan.

Keywords: multi-answer summarization, extractive summarization, query-based sum-marization, ontology construction, ROUGE.

iii

Trang 4

I want to thank my supervisor, Assoc.Prof Tran Trong Hieu, MSc Can Duy Cat They always had insightful comments both on my work and on this thesis Their dedication has given me more motivation to complete the thesis in the best way.

Furthermore, I am very thankful to Dr Le Hoang Quynh and Data Science and Knowledge Technology Laboratory members at the VNU University of Engineering and Technology We had many discussion meetings, and their comments will help me im-prove myself and become more mature in the future.

Finally, a deep thank to my family, relatives, and friends who are always with me during the most challenging times, always encouraging us in life and at work.

Although I attempted to complete the report but will undoubtedly make minor errors, I sincerely receive the teachers’ and professors’ understanding and instruction.

iv

Trang 5

I declare that the thesis has been composed by myself and that the work has not be submitted for any other degree or professional qualification I confirm that the work submitted is my own, except where work which has formed part of jointly-authored publications has been included My contribution and those of the other authors to this work have been explicitly indicated below I confirm that appropriate credit has been given within this thesis where reference has been made to the work of others.

I certify that, to the best of my knowledge, my thesis does not infringe upon any-one’s copyright nor violate any proprietary rights and that any ideas, techniques, quota-tions, or any other material from the work of other people included in my thesis, pub-lished or otherwise, are fully acknowledged in accordance with the standard referencing practices.

I take full responsibility and take all prescribed disciplinary actions for our com-mitments I declare that this thesis has not been submitted for a higher degree to any other University or Institution.

Nguyen Quoc An

v

Trang 6

1.3 Difficulties and Challenges

1.4 Contributions of the thesis

3.1.2 Single-answer extractive summarization 15

3.1.3 Multi-answer extractive summarization 17

Trang 7

3.2.4 Independence Ontology Construction 21

3.2.5 Ontologies Integration 2

3.2.6 Ontology Population

3.3 Apply Ontology-based Improvements to Summarization model 25

3.3.1 Baseline Model Improvements 26

3.3.2 Question’s Keyword Expanding 26

3.3.3 Customised scoring methods 29

4 Experiments and Results 31

4.1 Implementation and Configurations 31

4.2 Dataset and Evaluation methods 3

4.2.1 Metrics and Evaluation 3

Trang 8

List of Figures

1.1 The evolution of MEDLINE citations between 1986 and 2019 2

1.2 Typical tasks / competitions in the field of natural language processing for biomedical data 3

1.3 Classification of Text Summarization Approaches 4

1.4 Multi-Answer Summarization pipeline 5

2.1 Summarization approaches

3.1 Summarization baseline model 1

3.2 Overview of propose ontology construction 20

3.3 CTD disease-chemical relations 2

3.4 Proposed summarization model overview 27

3.5 Ontology expanding method

3.6 WordNet expanding method

4.1 The statistic of nodes and terms in three independent ontologies 35

4.2 The statistic of nodes and terms in three integrated ontology 35

4.3 The reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version 37

4.4 Ablation test results for various components 38

viii

Trang 9

List of Tables

1.1 The result summary example responses to a question in medical question

and answer system (MEDIQA) .

3.1 MeSH’s topic category list

4.1 Configurations and parameters of proposed model 33

4.2 The statistics of extract summary in datasets 34

4.3 The statistic of relations and terms in ontology population 35

4.4 Comparison model’s results of the MEDIQA 2021 Task 2 - Extractive Summarization 37

4.5 Examples of some errors in test set 39

4.6 Five biotechnology companies’ websites in Japan 40

ix

Trang 10

Chapter 1 Introduction

This chapter will present the motivation and the urgency of the thesis topic in sec-tion 1.1 Also, the summarizasec-tion problem and query-based summarizasec-tion problem are discussed in section 1.2.

Many experts and leaders have identified data as an invaluable asset in the era of informa-tion explosion For example, Clive Humby - a British mathematician and entrepreneur in the field of data science, said “Data is the new oil” Indeed, exploiting data effec-tively will bring great value Biomedical text mining is a topic of increasing interest in the research community For example, the expansion of MEDLINE1is depicted in Fig-ure 1.1 [20] It is one of the largest and most well-known biomedical online databases in the world From 1 million in 1970 to 13.5 million in 2005, the number doubled in 14 years to 26.2 million in 2019.

However, in this age of information abundance and overload, the overabundance of data has made it difficult for humans to absorb In that context, some automatic question-answer system is built For example, a question-answer system supports getting information about treatment for common symptoms of COVID-19 from reliable data, which allows users to handle infection situations more scientifically and easily.

1the US National Library of Medicine’s biomedical database

1

Trang 11

Figure 1.1: The evolution of MEDLINE citations between 1986 and 2019.

The vertical axis represents the number of citations (million) For a clearer representation, the statistics from before 2005 are issued every 5 years.

Nowadays, several automatic question answering systems about health are built like Pubmed2or CHiQA3, Google4 Although the answers returned by the search en-gines have been selected, independent answers from different sources still overlap For instance, with the question “How long have SARS-CoV-2 existed?”, Pubmed provides about 1000 long answers, and Google returns 5,070,000,000 response 5

The idea is to use a summary engine to summarize all the responses into a short paragraph The summary answer gathers all of the necessary information and elimi-nates any duplicates Therefore, the users can read one paragraph instead of a massive amount of documents This thesis focuses on the summarization model in the Health question-answering system However, it is the two most demanding tasks are the ques-tion answering and summarizaques-tion systems for biomedical text (according to experts in Figure 1.2 [6]).

Realizing the potential of biomedical summarization, a number of competitions have been launched in recent years to support research and development in this field The BioNLP workshop series, which is co-hosted by the ACL SIGBIOMED

Trang 12

Figure 1.2: Typical tasks / competitions in the field of natural language processing for biomedical data

ized research community, has grown into an exceptional yearly event for researchers to present their research ideas in the field of natural language processing for biological and medical data (bioNLP) wIn 2021, the BioNLP workshop with the topic MEDIQA 2021: Summarization in the Medical Domain6was held, consisting of three separate tasks The summarization of Multiple Answers task is similar to the summary engine in the question-answer system, is chosen by my team Our team won second prize (in extractive summary) and third prize (in abstract summary) in this contest Besides, our team won second prize in science research student competition at my university and has four papers about summarization.

After participating in this completion, the error analysis process indicates that the model only focuses on the terms mentioned in the question Meanwhile, related terms such as synonym terms, related chemicals, related diseases, etc., also have a certain de-gree of importance It is main reason for this thesis to continue research about question-driven improvements This thesis proposes some ontology-based improvements with a significant development compared to the previous model.

3

Trang 13

1.2 Problem Statement

Text summarization aims to select or generate important information from the original text(s) to create a short version [7] Humans often read all documents to develop un-derstanding, and then write a summary highlighting its main points Because of the absence of human experience and understanding, generating a text summary is exceed-ingly tough, time-consuming, and effortless for machines.

Based on the different characteristics of the summary paragraphs, text summariza-tion can be classified in many different ways as Figure 1.3 [3]

Text Summarization Approach

Figure 1.3: Classification of Text Summarization Approaches

• According to the input document(s): Single-document summarization and Multi-document summarization The difference is that a Single-Multi-document summarization only focuses on a single text while a multi-text summary uses multiple documents as input.

• According to the summary usage: Generic and Query-based Generic is an ap-proach that does not focus on a specific topic or aspect, and it makes an overview of sources While the query-based summarization approach, the result is focused on the user questions.

• According to Techniques: Supervised and Unsupervised Unsupervised approaches based on algorithms do not depend on human support, such as labelling train datasets These models are suitable for big data, such as website data Supervised learning methods are based on a sentence-level classification approach where the model learns between summary and non-summary sentences.

4

Trang 14

• According to output characteristics: Extractive summarization and Abstractive summarization The extraction method entails extracting the most crucial sen-tences from the documents The summary is then made by combining all of the critical sentences As a reason, every sentence, in summary, belongs to the original document in this approach Secondly, the abstractive approach tries to recreate the summary base on the original sentences.

Formal definition According to Multi-Answer Summarization task requirements ,7

different answers can bring complementary perspectives that are likely to benefit the users of QA systems The purpose of this task is multi-answer summarizing model that can tackle summary challenges that numerous relevant replies to a medical question The input to the model is the customer’s question Q, and all answers A = {A A1, 2, ,An} The output is a summary that answers the given question (Figure 1.4) Table 1.1 shows the example of result summary.

User's question

Multiple related answers

Summarization

Figure 1.4: Multi-Answer Summarization pipeline

Thesis scope In this work, the model focus on the Query-based Multi-document Ex-tractive summarization approach According to the classification approaches, the model follows the four properties: Multiple, Query-based, Unsupervised, Extract The extrac-tive approach has many advantages, such as (i) quick summarization time, (ii) low cost of hardware resources, and (iii) easy to manage summary quality Besides, compressing multiple replies into a single answer saves time and effort for users The paragraph is summarised based on the user’s question, which is highly applicable.

7https://www.aicrowd.com/challenges/mediqa-2021/problems/mediqa-2021-multi-answer-summarization-mas

5

Trang 15

Table 1.1: The result summary example responses to a question in medical question and answer system (MEDIQA).

QuestionWhat bone graft materials are used for spinal fusion?

You will be asleep and feel no pain (general anesthesia) The doctor will make a surgicalcut (incision)to view the spine Other surgery, such as a diskectomy, laminectomy, or aforaminotomy, is almost always done first Spinal fusionmay be done On your back orneck over the spine You will be lying face down Muscles and tissue will be separatedto expose the spine On your side, if you are having surgery on your lower back Thesurgeon will use tools called retractors to gently separate, hold the soft tissues and bloodvessels apart, and have room to work.A synthetic bone substitute is used With a cut onthe front of the neck, toward the side The surgeon will use a graft (such as bone) to hold(or fuse) the bones together permanently There are several ways of fusing vertebraetogether Strips of bone graft material may be placed over the back part of the spine.Bone graft material may be placed between the vertebrae Special cages may be placedbetween the vertebrae These cages are packed with bone graft material The surgeonmay get the bone graft from different places From another part of your body (usuallyaround your pelvic bone) This is called an autograft Your surgeon will make a smallcut over your hip and remove some bone from the back of the rim of the pelvis From abone bank This is called an allograft A synthetic bone substitute can also be used.Thevertebrae may also fixed together with rods, screws, plates, or cages They are used tokeep the vertebrae from moving until the bone grafts are fully healed.Surgery can take3 to 4 hours.

A bone graft can be taken from the person’s own healthy bone (this is called an au-tograft) Or, it can be taken from frozen, donated bone (allograft).In some cases, amanmade (synthetic) bone substitute is used.You will be asleep and feel no pain (gen-eral anesthesia).During surgery, the surgeon makes a cut over the bone defect The bonegraft can be taken from areas close to the bone defect or more commonly from the pelvis.The bone graft is shaped and inserted into and around the area The bone graft can beheld in place with pins, plates, or screws.

Extractivesummary

A bone graft can be taken from the person’s own healthy bone (this is called an au-tograft) Or, it can be taken from frozen, donated bone (allograft).In some cases, amanmade (synthetic) bone substitute is used.The vertebrae may also fixed togetherwith rods, screws, plates, or cages They are used to keep the vertebrae from movinguntil the bone grafts are fully healed.

6

Ngày đăng: 04/05/2024, 12:44

Xem thêm: