Machine learning-based extraction of semantic relations from biomedical literature = Trích xuất mối quan hệ ngữ nghĩa trong văn bản y sinh dựa trên học máy

Trang 3

It is not substantially the same as any I have submitted for a degree, diploma or

other qualification at any other university; and no part has already been, or is currentlybeing submitted for any degree, diploma or other qualification.

Hanoi , January 2022

Le Hoang Quynh

iii

Trang 4

1.2.1 Literature review of biomedical named entity recognition 19

1.2.2 Literature review of biomedical relation extraction 24

1.23 Related doctoraldissertaions 2913 RelatedresoUurFCes Ặ - QOQ QOQ Q Q Q2 301.3.1 Datasets for named entity recognition experiments 311.3.2 Datasets for relation classification experiments 32

1.41 Evaluation metrics 2 0.0.00 02.02.02 eee eee 34

1.4.2 Named entity recognition evaluation 35

1.4.3 Relation classification evalualon 36

IV

Trang 5

2 ANEND-TO-END PIPELINE MODEL FOR BIOMEDICAL RELATION

2.1 Distant supervision learning with silverCID corpus 39

2.2 Proposed UET-CAM system 42

2.2.1 Joint model of named entity recognition and normalization (DNER) 432.2.2 Coreferenceresolulon 00000000 ee 492.2.3 Intra-sentence relation classification with support vector machine 522.3 Experimental results and discussion 54

2.3.1 Choosing the combining manner of SSI and skip-gram for named

entity normalization results 54

2.3.2 Named entity recognition and normalization results 55

2.3.3, CID relation classiicatlonresults 57

2.3.4 DiscussilOn ee 58

3 AN IMPROVED CRE-BILSTM MODEL FOR BIOMEDICAL NAMED

ENTITY RECOGNITION 643.1 Introduction to deep learning for named entity recognition 653.2 Proposed D3NER model 673.2.1 Data pre-processing 0.00.00 000 673.2.2 The TPAC embeddings layer 683.2.3 Context representing biLSTM layer 71

4 HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP LEARNING

MODELS FOR BIOMEDICAL RELATION CLASSIFICATION 874.1 The shortest dependency path 894.1.1 Dependency tree 000 89

Trang 6

4.1.2 The shortest dependency path 90

4.13 Dependency Unt 91

4.2 A hybrid adaptive deep learning model for biomedical relation extraction 914.2.1 ProposedMASSmodel 92

4.2.2 Experimental corpora and comparative models 98

4.2.3 Experimental environment and model settings 100

4.2.4 Experimental results and discussion 100

4.3, An attentive augmented deep learning model for biomedical relation traction 2 ee 10643.1 Richer-butsmarterSDP 106

ex-43.2 Proposed RbSPmodel 107

4.3.3 Experimental environment and model setings 114

4.4 A multi-fragment ensemble deep learning model for biomedical relationextraction 2 ee 1184.4.1 Over-fitting problem of deep learning-based models 118

4.4.2 Bagging with bootstrap tramngdata 119

4.4.3 Proposed multi-fragment ensemble architecture 121

4.5 Summary 2 0.0 ee ee 129GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATIONIN BIOMEDICALTEXT 131

5.1 Inter-sentence relations classification problem 132

5.2 Proposed graph-based inter-sentence relation classification model 134

5.2.1 Modeloverview 2.0.0 eee ee eee ee 1345.2.2 Document sub-graph construction 135

5.2.3 Paths finding, merging and choosing 138

5.2.4 Shared-weight convolutional neural network 140

5.3 Experimental results and discussion - 143

5.3.1 Experimental environment and model settings 143

5.3.2 Contribution of the added virtual edges in document sub-graph 144

5.3.3 Different sliding window size w for training and testing 145

5.3.4 Contribution of the model components 146

5.3.5 Comparison to comparativemodel 148

VI

Trang 8

BCS CDR corpus

Bacteria Biotope Task

BioCreative V Chemical-Disease relation

Disease Named Entity Recognition

Deep Neural NetworkDependency Unit

Embeddings from Language Models

False Negative

Vili

Trang 9

FP False Positive

FSU-PRGE The FSU PRotein GEne Corpus

GD Gradient Descent

HAScO Human-Aware Science Ontology

HHEAR Human Health Exposure Analysis ResourceHMM Hidden Markov Model

TAA Inter-annotator AgreementIE Information Extraction

KB Knowledge-base

LSTM Long Short-term Memory

MASS Man for All SeasonS

MESH Medical Subject Headings

mf Multi-fragment

MLP Multilayer Perceptron

MUC Message Understanding Conferences

NCBI National Center for Biotechnology tion

Informa-NCIT National Cancer Institute Thesaurus

NE Named Entity

NEN Named Entity NormalizationNER Named Entity Recognition

NLP Natural Language Processing

OOV Out-Of- Vocabulary

OWL Orthology Ontology

P Precision

PMC Pubmed Central

1X

Trang 10

Radiology Gamuts Ontology

Recurrent Neural Network

The Shortest Dependency Path

A Silver-standard Corpus for induced Disease Relation Extraction

Chemical-Systematized Nomenclature of MedicineSupervised Semantic Indexing

Standard Deviation

Suport Vector Machine

Shared-weight Convolutional Neural Network

True Negative

True Positive

the Token-POS tag-Abbrviation-CharacterEmbeedings

Unified Medical Language System

With out Replacement

Trang 11

Growth of MEDLINE citations from 1986 to 2019 2

Challenges’ subtasks/tracks organized based on NLP perspectives [64] 3

The dissertation outline 2 ee ee ee 10An example taken from the BC5 CDR corpus with recognized names ofDisease, Chemical and Specles 14

Examples of (a) inter-sentence relation and (b) intra-sentence relation 17

Examples of relations with specific and unspecific location 18

Examples of (a) Promotes - a directed relation and (b) Associated - anundirected relation taken from Phenebank corpus 18

Named entity recognition approaches taxonomy 20

Relation extraction approaches taxonomy 25

The statistics of corpora used in our experiments for relation classification 34Analysis of the Direct Evidence field in the CTD databases 40

An example of constructing silverCID corpus 41

Architecture of the proposed UET-CAM system 44

Advanced SSI model using skip-gram information for NEN 45

Hybrid model of SSI and skip-gram model for NEN 47

Sequential back-off model of SSI and skip-gram model for NEN 48

An example of coreference In text 49

An examples of using multi-pass sieve for coreference resolution 51

The D3NER architecture 2 ee ee 68The TPAC embedding architecture of D3NER 70

Example of adependency tree 89

Examples of the shortest dependency paths 90

Examples of the dependency unit in the shortest dependency paths 91

The architecture of MASS model for relation classification 93

The multi-channel LSTM for word representation 95

xi

Trang 12

4.6 Ablation test results for various components and information sources of

MASSmodel ee4.7 Examples of SDPs and attached child nodes

4.8 The architecture of RbSP model for relation classification .4.9 The multi-layer attention architecture to extract the augmented informa-

tion from the children of a token on SDP 4.10 Ablation test results for compositional embeddings of RbSP model 4.11 Ablation test results for augmented information of RbSP model 4.12 Training loss, training accuracy, validation loss and validation accuracy

of our RbSP model in BC5 CDR corpus 4.13 The range of RbSP model’s results on BCS CDR test set

4.14 The multi-fragment ensemble architecfure

4.15 The changes of multi-fragment ensemble model’s results with differentsize of training data 2 Ặ Ặ Q Q Q Q HQ HH ko4.16 The changes #'I of multi-fragment ensemble model with different vote

5.1 Examples of complicated cross-sentence relations .5.2 The proposed model for inter-sentence relation classification .5.3 Use sliding window to choose adjacent sentences for building document

sub-graph 2 ee5.4 Examples of adocument sub-graph .5.5 Examples of two unexpected problems while generating the instance from

5.6 Example of an abstract with many NER annotations that leads to the

ex-plosion of similar paths ẶẶẶ ee

5.7 Diagram illustrating of aswCNN architecture 5.8 Ablation test results for virtual edges of the document sub-graph 5.9 The change of results with different size of sliding window .

Trang 13

List of Tables

Example sentences labeled using different tagging schema 15

Examples for different relation types 17Information about the BCS CDR, NCBI and FSU-PRGE corpora for NER 31Information about the BC5 CDR, BB3, DDI and Phenebank corpora for

relation classification 2 2 ee ee 33Defining the test metrics 2 2 ee 35

Detailed Input/Output and the objectives of UET-CAM components 43Large-scale feature set used in the intra-sentence relation extraction mod-

Named Entity Normalization results with different combining architectures 55

Disease named entity recognition results on BC5 CDR corpus of

UET-CAM system 2 ee 55Relation classification results on BC5 CDR corpus of UET-CAM system 57Analysis of the contribution of methods and resources used in the UET-

CAM system for capturing CID relatonships 60Sources of errors by our system system on the CDR test set 61

Configurations and parameters of D3NER model 75

Experimental results of D3NER for 20 runs each with different random

initialization on BCS CDR and NCBI corpora 77

Performance of D3NER and compared state-of-the-art models on two

benchmark corpora for Disease and Chemical NER 78

Experimental results of D3NER for 20 runs each with different random

initialization on FSU-PRGE corpus (4-fold cross validation) 80

Performance of D3NER and compared state-of-the-art model on

FSU-PRGE corpus for Gene/protein NER 80

Ablation test results for different embeddings of D3NER model 81Impact of fine-tunning embeddings as the D3NER’s hyper-parameters 82D3NER confusion matrix on the CDR corpus 82

XI

Trang 14

3.9 Examples for errors caused by D3NER on the BC5 CDR and FSU-PRGE

COMpOra 2 a 84

4.1 Examples for different relation types - 87

4.2 Configurations and parameters of MASS model 100

4.3 Results of MASS model on the BCS CDR corpus 101

4.4 Results of MASS model on the DDI-2013 corpus 102

4.5 Results of MASS model on the BB3 corpus - 103

4.6 Results of MASS model on the Phenebank corpus 103

4.7 Examples of MASS model’s errors 105

4.8 Configurations and parameters of RbSP model 115

4.9 The RbSP model’s performance on BC5 CDR corpus 115

4.10 Multi-fragment ensemble results on BCS CDR corpus 124

4.11 The comparison of our ensemble proposed models with other tive models on BC5 CDR corpus 127

4.12 The comparison of our ensemble proposed models with other tive models on DDI corpus .0 0.0000 eee eee 1285.1 Tuned hyper-parameter of proposed model 144

compara-5.2 Ablation test results for added virtual edges in the document sub-graph 144

5.35.45.55.65.7Results of the document sub-graph based model on BCS CDR corpuswith different size of sliding window for training and testing 147

Ablation test results for various components of the document sub-graphbased model on BC5 CDR corpus 148

The performance of document sub-graph-based model and some ative models ee 149The detailed results of the document sub-graph based model 150

compar-Examples of errors on the BC5 CDR testset 151

XIV

Trang 15

The necessities of the dissertation:

In the past several decades, biomedicine and human health care have become oneof the major service industries They have been receiving increasing attention fromthe research community and the whole society E.g., in 2011, biomedical research inthe United States received 100—billion dollars of investment, with approximately 65%

supported by industry, 30% by the government, and the remaining 5% by charities, dations, or individual donors [137] Up to the present, many researchers have beenstill working hard with an expectation that more advances would occur for supportingbiomedical science and healthcare Therefore, the inevitable need is understanding andanalyzing the existed information and knowledge bases.

foun-As a result, the field of biomedical research has overgrown, and the number ofbiomedical scientific publications is growing at an extremely high rate Accessing andprocessing this data to keep abreast of the state-of-the-art and making discoveries inbiomedical/healthcare scientific researches is essential for several types of users, in-cluding biomedical researchers, clinicians, database curators, and bibliometricians [77].There is more than 3000 articles are published in biomedical journals every day [64].

MEDLINE®, a biomedical database of the US National Library of Medicine, is one of

the most prominent and largest biomedical digital repositories As of 2019, it alreadycontains more than 26 million citations with a fast increasing number of articles in life

sciences with a concentration on biomedicine! Figure | illustrates the growth of

MED-LINE from ~ 1 million in 1970 to ~ 26 million citations in 2019 More impressively, this

number has increased nearly two times in 14 years, from 2005 (~ 13.5 million) to 2019

(~ 26.2 million).

PubMed®” is a free resource developed and maintained by the NCBI which

"https ://www.ncbi.nlm.nih.gov/pubmed

Trang 16

PL FFP FP x %

Figure 1: Growth of MEDLINE citations from 1986 to 2019.

The vertical axis shows the number of citation (in million) For clearly visualization, theStatistics before 2005 were presented every 5 years.

vides free access to MEDLINE and some other databases Following the statistic

re-ported in November 2019°, the total of PubMed citations cumulative has surpassed 30

million However, even if we got the result returned from PubMed, the difficulty ofprocessing this literature is ever-increasing It comes from the fast-growing volume ofbiomedical literature, the scope of topical coverage, and its interdisciplinary nature andits unstructured form For example, when searching for ‘Influenza’ in Pubmed, we gotthe results of 105,066 articles The rapid growth of volume and variety of biomedicalscientific literature make it an exemplary case of Big Data [169] It is an unprecedentedopportunity to explore biomedical science and an enormous challenge when facing amassive amount of unstructured and semi-structured data.

Recent research progress in biomedicine needs to be supported by methodologies

capable of assisting human experts in formulating hypotheses Biomedical natural guage processing (BioNLP) is a sub-field of Natural language processing (NLP) that

lan-seeks to help scientists understand the wealth of data from results that are hidden inlarge-scale scientific text collections BioNLP does this through the analysis, under-standing, and production of structured data from unstructured free text in large scaletext collections BioNLP now has a wide range of applications in biomedical literaturemining and attracted significant investment of the research communities worldwide, re-

3https://www.nlm.nih.gov/bsd/medline_pubmed_production_stats.html

Trang 17

flecting their central roles in many areas of biomedical research and healthcare science.

As a result, the market of biomedical text data analysis and bioNLP is growing rapidly.In particular, the NLP in Healthcare and life sciences market is estimated to grow from

USD 1030.2 million in 2016 to USD 2650.2 million by 20214 in the United States.

Relation extraction (RE) plays a vital intermediate step in a variety of bioNLPapplications Its contributions range from precision medicine [6], adverse drug reactionsidentification [30, 53], drug abuse events extraction [71], major life events extraction[19, 106], question answering system [31, 120] and clinical decision support system[159], etc.

Fly Ad-hoc Q Bic-entity Chem Ñ

genetics -retrleval centric patent ik `

Figure 2: Challenges’ subtasks/tracks organized based on NLP perspectives [64].

In general, NLP tasks closer to the top of the pyramid are more difficult.

Because of these motivations, several challenge evaluations have been organizedto assess and advance bioNLP researches These challenges evaluations often attractmany scientists around the world to attend and publish their latest research on biomed-ical analysis Huang and Lu (2015) [64] categorized the prevalent challenges by thetargeted problems in NLP research as in Figure 2 We get an observation that BioNLPshared tasks pays much of its attention on information extraction, including relation ex-traction/classification and named entity recognition, which are listed in the middle two

‘https: //www.marketsandmarkets.com/Market-Reports/

healthcare-lifesciences-nlp-market-131821021.htm1

Trang 18

parts of the pyramid Some examples of well-known shared tasks include the BioNLP,BioCreative, 12b2, ShARe/CLEF eHealth, and SemEval.

There have been a number of doctoral dissertations across the world that worked on

relation extraction related topics (more detailed information is given in Section 1.2.3).

Some of them focused on a specific type of relation, examples include disease-generelations [66] and drug-drug relations [172] The data type that they targeted to are alsovery diverse (i.e., scientific literature [100] and electronic health record [96]) Many

machine learning methods were proposed for relation extraction: supervised

feature-based machine learning [9], semi-supervised learning [172], deep learning [96], etc.

In this Dissertation, we consider Relation Extraction as two text mining

sub-tasks, i.e., Named Entity Recognition (NER) and Relation Classification (RC) Thetask of biomedical named entity recognition (NER) seeks to locate named entities fromfree-form biomedical text and classify them into a set of pre-defined categories/typessuch as gene/protein, phenotype, disease, and chemical, or ‘none-of-the-above’ NERproblem consists of three sub-problems: (i) defining the entity boundary and (ii) assign-ing the delimited entity to a pre-defined class and (iii) named entity normalization, i.e.,

match the extracted entities to a concept in the knowledge base In which, the named

entity normalization problem often be separated as an independent problem The

pop-ular methods used for biomedical NER includes dictionary-based methods, Rule-based

methods, classification-based methods, sequence labeling methods hybrid methods that

combine other techniques [138, 167] Relation classification (RC) is the task of

dis-covering semantic connections between biomedical entities The common biomedicalrelations includes Drug-drug interaction [164], chemical-disease relation [180], Protein-protein interaction [83] and many others The most typical methods for relation clas-sification are co-occurrence approaches, rule-based methods, several machine learningmethods, and hybrid methods [5, 142, 167].

In line with worldwide research trend, this dissertation differs from the other

re-search in several aspects: (i) We try to solve both NER and RC of RE as two separate

tasks Most other works focus on only one task, NER or RC Some research addressedRE as RC, and NER was be solved in the previous phase as a pre-processing step (ii)We focus on the scientific literature abstracts and capitalize on their characteristics, notjust consider them as normal documents (iii) The dissertation research and apply a vari-ety of machine learning methods, including supervised feature-based machine learning,

unsupervised machine learning, distant learning, and deep learning (iv) The dissertation

Trang 19

does not entirely focus on a specific type of relationship CID is just a typical ship used to facilitate the comparison of results Many experiments were conducted forother relation types, and all have positive results.

relation-Research challenges:

The biomedical research community pays much attention to developing dedicateddata and resources Recently, it is admitted that biomedicine is a field that having themost abundant amount of available public resources and tools However, the specificcharacteristics of biomedical data still bring many challenges for the research commu-nities [2, 167]:

— Firstly, biomedical NLP is still facing many existing NLP problems, 1.e., problemsexist not only in the field of the biomedical domain but also in the general fieldof NLP We list here some widespread problems: the imbalanced data problem,special linguistics units such as negation and conjunction, and directed relation

— Secondly, information extraction in the biomedical domain often suffers errorscaused by relatively low performance of pre-processing steps Because biomedicaltexts are highly specialized, generic data analysis and NLP tools are not appropri-

— Thirdly, biomedical terms have their own diversities and characteristics, such asthe lack of nomenclatures and the extreme use of unknown words that lead to

highly variable and ambiguous compared to other domains.

— The fourth problem comes from ambiguity and inconsistency, 1.e., NEs with thesame orthographic features may fall into different categories.

— Finally, biomedical is an interdisciplinary field The complexity of the biologicdomain and the growing ability of biomedical research relies increasingly on thedevelopment of methods and concepts crossing these boundaries.

Research objectives and methodology:

Motivated by above necessities and challenges, the Dissertation aims at the ing research objectives:

follow-— [ROI] Appropriately represent the biomedical literature text to make the best useof linguistic, syntactic, and semantic information.

5

Trang 20

— [RO2] Take advantage of the state-of-the-art advanced methods and resources topropose the combination architectures and then improve them to resolve NER and

RC problems with good results.

To reach these research aims, we focus on addressing the following main research

question: How to build an effective machine learning-based architecture for NER andRC systems? It includes two sub-questions to supplement the main research question:

— [Sub-question sQT] How to convert the biomedical literature text, annotated withnamed entity and relation labels, into a rich representation containing useful infor-mation that can be processed by machine learning models?

This research question is addressed throughout the Dissertation, for example, werepresent the relations by using the engineered features in Chapter 2, embedding,

and the shortest dependency path in Chapter 4, and the graph in Chapter 5.

— [Sub-question sQ2] How to apply, combine, and improve advanced machine

learning methods for building NER and RC systems?

This research question is solved in Chapter 2, Chapters 3, Chapter 4 and Chapter 5.

The research methodology of the Dissertation is the combination of qualitativeresearch and quantitative research:

* Qualitative research includes: (i) Analyzing the ideas, proposed methods and niques of related works; (ii) detecting problems, advantages and disadvantages ofthese methods; (iii) improving, combining and proposing new solutions and mod-

tech-els to resolve problems.

* Quantitative research includes: (1) Analyzing available corpus, (ii) deploying

ex-periments, (iii) verifying the performance of proposed methods and models and(iv) publishing the scientific reports to receive verification from the research com-munity.

Overview of our approach:

The Dissertation participates in the research trend of bioNLP in general and ical relation extraction in particular Our focuses are on improving the methods, exploit-

biomed-ing rich information data representation, and build a capable architecture for biomedical

named entity recognition and relation classification, rather than on developing new

ma-chine learning algorithms.

Trang 21

We state that being able to achieve better performance in biomedical relation traction tasks depends on improvements in machine learning and data representation.We firstly build an end-to-end model for named entity recognition and relation classifica-tion This model mostly based on several supervised feature-based learning techniques.BioNLP, like its parent field NLP, has been through a step-change in the last five yearswith a move from machine learning based on expert features to deep learning techniquesthat learn feature representations for themselves Following this research trend, we thenpropose several deep architectures for improving named entity recognition and relationclassification.

ex-The main contribution of the Dissertation:

The Dissertation has three main contributions:

— Researching, improving, and proposing several data representation manners tomake use of linguistic, syntactic, and semantic information This contribution isreflected in the proposal of a rich feature set in Chapter 2, a combination of several

information types in Chapter 3 and Chapter 4, as well as a graph-based

representa-tion in Chapter 5.

— Studying and constructing some machine learning architectures to solve NER andRC problems based on combining and improving advanced machine learning meth-ods from multiple perspectives: (1) UET-CAM system in Chapter 2 is a joint modelof NER-NEN system and rich feature-based machine learning with distant super-

vision learning for RC (1) D3NER system in Chapter 3 combines several types

of information in a deep learning model (iii) MASS and RbSP models in Chapter4 are deep learning-based models with several improvements, including attentionmechanism (iv) The multi-fragment ensemble model is also proposed in Chapter4 (v) Finally, Chapter 5 focuses on intra-sentence relation extraction with a novel

graph-based model Most applied methods/techniques are carefully analyzed toevaluate their contribution to system performance.

— Contributing to the research community by creating a silver-standard dataset called

‘silverCID’ for distant supervision learning This data set is used in Chapter 2 and

Chapter 5 and is demonstrated the good influence to the system performance.

Scope of the Dissertation :

The Dissertation focuses on solving the relation extraction problem in English

7

Trang 22

biomedical literature text by applying natural language processing (NLP) techniques.In which, two sub-problems (i.e., named entity recognition and relation classification)are solved separately by applying several advanced machine learning methods in an ap-propriate architecture.

Biomedical named entity recognition problem is considered as a sequence

la-belling problem Note that the nested entity problem is excluded, 1.e., we do not

con-sider the cases if named entities contain other named entities inside them or severalentities intersect In a part of the Dissertation, named entity recognition is processedsimultaneously with named entity normalization phrase to increase performance The

dissertation experiments worked on three fundamental biomedical entities, 1.e.,

Chem-ical, Disease, and Protein/Gene They are three of the most frequently requested ties by PubMed users worldwide [68] and are annotated in many well-known biomedi-

enti-cal knowledge-bases (Medienti-cal Subject Headings (MESH)°, Unified Medienti-cal Language

System (UMLS)°, Systematized Nomenclature of Medicine (SNOMED)’,” and many

In this Dissertation, we delineate the scope of the study of biomedical relation

classification problem according to the following characteristics:

¢ Only binary biomedical relations are extracted We aim to address the n—ary tions as further extensions of our model in the future works.

rela-¢ We focus on both intra- and inter-sentence relations.

¢ Both directed and undirected relations are considered in the research scope.

¢ Depending on the corpus that the relation classification system works on, it can bea binary classification or a multi-label classification problem.

In experiments, we mostly focus on the chemical-induced disease relation (also knownas the adverse drug reaction and side effect) This relation attracts much attention fromthe research community as well as the industry It is annotated in many biomedical on-

tologies, i.e., SNOMED, Orthology Ontology (OWL)Š, Human Health Exposure

Analy-sis Resource (HHEAR)’, Human-Aware Science Ontology (HAScO)!”, National Cancer

8

Trang 23

Institute Thesaurus (NCTT)!!, Radiology Gamuts Ontology (RGO)!”, and Comparative

Toxicogenomics Database (CTD)!°, etc Other various relations are also considered in

some experiments for further comparisons The example includes the drug-drug

inter-action (includes mechanism, effect, advice and int), the locations (biotopes and graphical places) of bacteria, and many others.

geo-The Biocreative V CDR corpus was selected as benchmark datasets for mentation throughout the Dissertation Besides, depending on the verification directionwe desired, some other datasets were selected includes the DDI corpus, BB3 corpus, andPhenebank corpus.

experi-The dissertation outline:

The Dissertation outline is illustrated in Figure 3, which contain Preface, five ters and the Conclusion The related publications are marked to their correspondingChapter.

Chap-Chapter 1: INTRODUCTION TO BIOMEDICAL RELATION EXTRACTIONprovides an introduction into important concepts relevant throughout this work Themain focus of this chapter are problem statement, literature review, related resourcesand the evaluation method.

Chapter 2: AN END-TO-END PIPELINE MODEL FOR BIOMEDICAL

RELA-TION EXTRACRELA-TION describes the architecture of our UET-CAM system that pated in BioCreative V CDR track It is an end-to-end architecture for chemical-induceddisease relation extraction that consists of several advanced feature-based machine learn-

partici-ing components.

Chapter 3: AN IMPROVED CRF-BILSTM MODEL FOR BIOMEDICAL NAMED

ENTITY RECOGNITION improves the biomedical named entity recognition by

propos-ing a deep learnpropos-ing model with several embeddpropos-ing sources In addition to chemical anddisease entities, gene/protein entities are also considered in this chapter’s experiments.

Chapter 4: HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP

LEARN-ING MODELS FOR BIOMEDICAL RELATION CLASSIFICATION proposes some

deep architectures for the biomedical relation classification Several corpus with ious relation types are also used for demonstrating the flexibility and adaptability of

var-proposed model On-trending attention technique and the ensemble manner are also

Bhttp://ctdbase.org/

Trang 24

applied to propose a novel deep architecture with potential results.

Chapter 5: GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATION

IN BIOMEDICAL TEXT presents our approach for the inter-sentence relation cation To exploit the graph-based representation effectively, we develop novel shared-weight deep learning model.

classifi-Lastly, in the Conclusion, we summarizes the dissertation’ main contributions andlimitation, then ends with an outlook to future works.

AN END-TO-END PIPELINE MODEL FOR BIOMEDICAL RELATION EXTRACTION

(Participated in BioCreative V CDR tracks)

NAMED ENTITY RECOGNITION RELATION CLASSIFICATION[LHQ1] (Oxford Database, 2016)

AN IMPROVED CRF-BILSTM MODEL ee (EMNLP, 2018)

FOR BIOMEDICAL [LHQ5] (ACIDS, 2049)[LHQ6] (WAACL, 2019)

NAMED ENTITY RECOGNITION [LHQ7] (vise, 2020)[LHQ2] (Bioinformatics, 2018)

[LHQ8] (BioNLP-NAACL, 2021) Chapter 5.

[LHQ9] (KSE, 2021) GRAPH-BASED INTER-SENTENCE

RELATION CLASSIFICATIONIN BIOMEDICAL TEXT

(Extensive researches on inter-sentence relation)

Figure 3: The dissertation outline.

The related publications are listed in their corresponding Chapter.

10

Trang 25

1.1.1 Semantic relation extraction

First of all, we present the definition of semantic relation extraction in tion 1.1.

Defini-Definition 1.1 Semantic relations (or semantic relationships) are the associations that

there exist between the meanings of linguistic components (e.g., semantic relations atword level, entities level, phrases level or sentence level, etc.).

Semantic relation extraction (see Definition 1.2) is useful in many fact extraction

applications ranging from question answering [31, 120] to identifying adverse drug

re-actions [53].

Definition 1.2 Relation Extraction (RE) is the task of detecting and characterizing the

11

Trang 26

semantic relations between pairs of named entity mentions in the text [2] Receivingthe (set of) document(s) as an input, the relation extraction system aims to extract allpre-defined relationships mentioned in this document by identifying the corresponding

entities and determining the type of relationship between each pair of entities.

In this Dissertation, we focus on two sub-tasks of Relation Extraction: Named tity Recognition (NER) and Relation Classification The former, named entity recog-nition (NER, entity tagging), is an intermediate step for relation extraction It refers tolocating and classifying named entities in text into predefined categories In the Disser-tation scope, NER is the problem of finding biomedical entity mentions such as diseases,chemicals, genes, proteins, or organisms in natural language biomedical literature text,then tagging them with their location and type The latter, relation classification (RC),goes after NER to find the semantic relations between the corresponding entities [2].Biomedical relation classification often tries to classify the relationship between pairs ofbiomedical entities to relations such as drug-drug interaction, chemical-induced disease,bacteria live-in location, or tag them as ‘none’ if we can not find any relationship be-tween them We describe these two sub-problems in details in Sections 1.1.2 and 1.1.3below.

En-1.1.2 Biomedical named entity recognition

We gives the definitions of named entity in Definition 1.3.

Definition 1.3 A named entity (NE) (also called entity mention) is a continuous quence of words that designates some real world entity [2].

se-The automated recognition of named entities in text has been a highly active areafor over two decades and is referred to variously as ‘terminology extraction’, ‘termrecognition’, ’entity identification’, ‘entity chunking’, ‘entity extraction’ and ‘namedentity recognition’ In this dissertation, we use the term ’named entity recognition’.The task of named entity recognition (NER) seeks to locate NE from free-form textand classify them into a set of predefined categories/types such as person, organization,location, expressions of times, quantities, monetary values, percentages or “none-of-the-above’ In other words, NER is the problem of finding the mentions of entities in naturallanguage text and labelling them with their location and type Oftentimes this task can-not be simply accomplished by string matching against pre-compiled gazetteers becausenamed entities of a given entity type usually do not form a closed set and therefore any

12

Trang 27

gazetteer would be incomplete Another reason is that the type of a named entity canbe context-dependent [2] For example, ‘Ho Chi Minh’ may refer to the person who isVietnamese Communist revolutionary leader or the location ‘Ho Chi Minh city’, ‘Ho ChiMinh museum’ or any other entity is sharing the same part ‘Ho Chi Minh’ To determinethe entity type for this text span occurring in a particular document, its context has to be

Named entity recognition is typically modeled as a sequence labeling problem.We treat each word in a sentence as an observation, sentence as a sequence of observa-tions and try to assign labels to each observation of this sequence It is defined formally

in Definition 1.4.

Definition 1.4 Given a sequence of input tokens X = (z\, ,%„), and a set of labels

L, named entity recognition (NER) task determines a sequence of labels Y = (y1, , Yn)such that y; € L for 1 <i<n [8&8].

While one may apply standard classification to predict the label y; based solelyon z;, in sequence labelling, it is assumed that the label y; depends not only on itscorresponding observation x; but also possibly on other observations and other labels in

the sequence Typically this dependency is limited to observations and labels within a

close neighbourhood of the current position ¿.

The label of NER often incorporates two concepts: the type of the entity (e.g.

whether the mention refers to a person, location, chemical or a disease), and the tion of the token within the entity Hence, the label set should follow a formal taggingscheme (also called ‘label model’, ‘tagging format’), which is a format for tagging to-

posi-kens in a chunking task in computational linguistics, such as NER The simplest model

for the token position is the JO model, which indicates whether the token is inside (7) oroutside (O) of an entity mention While simple, this model cannot differentiate betweena single mention containing several words and distinct mentions comprising consecu-tive terms JOB is the well-known tagging scheme that overcomes the limitation of [Oscheme Different to JO scheme, with JOB scheme, a token is tagged as B if it marksthe beginning of an entity This model is capable of differentiating between consecutiveentities and has excellent support in the literature The more complex model commonlyused is JOBE'S (or IOBEW), which 1s the expressive variant of JOB tagging scheme.In addition to J, O and B, JOBES uses E for Ending, and S$ for Singleton (a one-wordentity) While the JOBES scheme does not provide higher expressive power than theIOB model, it was shown to improve labelling models’ performance marginally [157]

13

Trang 28

and have been used in several NER studies [87, 181] Example sentences annotatedusing each label scheme can be found in Table 1.1.

In reality, entity mentions can appear in various forms, including names, pronouns(1.e., ‘he’, ‘her, ‘who’, etc.), and nominals (i.e., nouns, noun phrases, etc.) In many do-mains such as newswire and literature, NEs were often defined as proper names and their

quantities of interest The popular studied named entity types are person, organization

and location, which were first defined by the sixth in a series of Message ing Conferences (MUC-6) [48] These types are general enough to be useful for manyapplication domains Extraction of expressions of dates, times, monetary values andpercentages, which was also introduced by MUC-6, is often also studied under NER,although strictly speaking these expressions are not named entities Besides these gen-eral entity types, other types of entities are usually defined for specific domains andapplications The NE and NER in the biomedical domain will be described below.

Understand-Biomedical named entities are phrases or combinations of phrases that denote

im-portant concepts in biomedicine They can be chemicals, diseases, anatomies, pathways

and genes/proteins, etc that are named in biomedical literature, which has been growingat an unprecedented speed Automatically extracting them, a task known as biomedi-cal named entity recognition, involves the demarcation of entity names of a specific

semantic type, e.g., proteins It results in annotations corresponding to a name’s in-text

locations as well as the predefined semantic category it has been assigned to [158] Over

the last fourteen years, there has been considerable interest in this problem with a variety

of generic and entity-specific algorithms applied to extract the names biomedical cepts Recent NER researches in the biomedical domain have been primarily focusedon the most frequently requested entities by PubMed users worldwide, includes Disor-der (Disease, Symptom, Phenotype), Gene/Protein, Chemical/Drug, Biological Process,Medical Procedure, Living Being, Research Procedure, Cell Component, Body Part, De-vice or Tissue [68] Still, there are few proposed solutions for the other entities such asphenotypes, anatomy [24].

Because of the need for the development of new treatments for ØfØWW§ï€8ẩể, a pilot study was undertaken to

estimate the pharmacodynamics and tolerability of fusidic acid treatment in chronic active, therapy-resistant patients.

PMID: 1420741 “Chemical Species

Figure 1.1: An example taken from the BC5 CDR corpus with recognized names of

Disease, Chemical and Species.

14

Trang 29

Table 1.1: Example sentences labeled using different tagging schema

Tagging scheme Example

Examples are taken from the BCS CDR corpus.

Figure 1.1 shows an example of biomedical named entities in text, chosen fromthe BioCreative V Chemical-Disease relation corpus [105] In this sentence, all disease,chemical, and species names have been demarcated The following Table 1.1 comparesthree tagging schemes for annotating this example.

1.1.3 Biomedical relation classification

Relation classification (RC) typically follows NER in the relation extraction tem Culotta et al (2006) [29] define relation extraction as the task of discovering seman-tic connections between entities In the text processing, it usually amounts to examiningpairs of entities in a document and determining (from local language cues) whether a

sys-relationship exists between them.

We take the pairwise approach for the task of relation classification Le., afterNER, we considered all pairs of recognized NERs as potential candidates, and givethem as the input to the relation classification system The relation classification systemthen classifies these candidates to assign them to a pre-defined relation type or ‘None-of-above’ (1.e., the negations) In reality, there may be multi-label instances, i.e there aremore than one relationships between an entity pair In this Dissertation, we ignore thesecases and only accept a single label for each instance Generally, a semantic relationshipcan be defined among multiple entities (n-ary), but within the scope of this dissertation,we only consider binary relationships Extracted binary relationships have the structureof a triple < e1,R,e2 >, where e; and eg are named entities (or noun phrases) in a

sentence (or abstract) from which the relationship is being extracted, and R is a relationtype that connects two corresponding entities As treated as a classification problem, we

give the formal definition of relation classification in Definition 1.5.

Definition 1.5 Relation classification task is defined by a real-valued function fp that

15

Trang 30

decides whether the corresponding entities are in a relation or not.

+1lif e, and ea are related according to relation R;

—lLif otherwise

€1 and ea are two entities that create a candidate for relation classification.

d is a document which includes corresponding entities e, and cạ d can be a tence, a paragraph or a document depending on the scope of relationships.

sen-T (d) is the information that is extracted from d.

Many respects should be considering in relation classification system, and they are

often different on different types of entities.

— There may be several relations or only one relation in a corpus For example, BCSCDR [105] and BioNLP-ST 2016 BB3 [33] corpora were annotated with only onerelation type, whilst Phenebank and SemEval-2013 DDI-2013 [61] corpora have

several relation types.

— Several relations are directed and order-sensitive, such as the Mechanism relationin DDI corpus [60], the /nheres-in relation in Phenebank corpus Such relationsrequire the model to predict both relation types and the entity order correctly Incontrast, for undirected relations, such as Associated in Phenebank, both directionscan be accepted, another example is Chemical-Induced Disease relation in BCSCDR [105] which its direction always comes from a chemical to a disease.

— The relation is intra-sentence relation (i.e two corresponding entities appeared in

the same sentence) or inter-sentence relation (i.e two corresponding entities may

appear in different sentences).

Biomedical relation classification concerns the detection of semantic relationsbetween biomedical named entities or noun phrases Recently, there has been consider-able interest in biomedical relation extraction and relation classification with a variety ofrelationships The common biomedical relations includes Drug-drug interaction [164],chemical-disease relation [180], Protein-protein interaction [83] and many others Witha multitude of possible relation types, it is critical to understand how systems will be-have in a variety of settings In biomedical domain, relation classification is useful in

16

Trang 31

many fact extraction applications ranging from identifying adverse drug reactions to

ma-jor life events It is also important in tasks such as Question Answering and KnowledgeAcquisition.

We gives some examples of biomedical relations in Table 1.2, Figure 1.2,

Fig-ure 1.3, and FigFig-ure 1.4 Table 1.2 represents two examples among a multitude of possible

relation types in the biomedical domain Sentence (i) shows an example of Synonym-of

relation which is represented by an abbreviation pattern This is very different to the

predicate relation Mechanism in (ii).

Table 1.2: Examples for different relation types.

(i) <el>Three-dimensional digital subtraction angiographic</el> DSA</e2>) images from diagnostic cerebral angiography were obtained

(<e2>3D-(ii) Dexamethasone: Steady-state trough concentrations of albendazole sulfoxide were

about 56% higher when 8 mg <el>dexamethasone< /el> was coadministered witheach dose of <e2>albendazole< /e2> (15 mg/kg/day) in eight neurocysticercosis

Sentence (i) shows a Synonym-of relation, represented by an abbreviation pattern,which is very different from the predicate relation Mechanism in (ii).

Figure 1.2 includes examples form BC5 CDR corpus [105] of inter-sentence

re-lation (1.e., two corresponding entities belongs to two separate sentences) and sentence (i.e., two corresponding entities belongs to the same sentence).

intra-(a) Cross-sentence relation (b) Intra-sentence relation

Five of 8 patients (63%) improved during Eleven of the cocaine abusers

fusidic acid treatment: 3 at two weeks and and none of the“controls had

2 after faur weeks ECG evidence“ of significant

` „4

-There were rïồ»serious clinical side effects, myocardial ,¢jury defined as

but dose reductiởx.was required in two myocardial infarction, ischemia,

patients because of nausea and bundle branch block.

(PMID: 1420741) (PMID:1601297)

Figure 1.2: Examples of (a) inter-sentence relation and (b) intra-sentence relation.

Examples are taken from the BCS CDR corpus with recognized chemical-induced disease

relation between a chemical (highlighted in bold) and a disease (highlighted in underlinedbold).

Figure 1.3 indicates the difference of unspecific location and specific location tions While relation with specific location has information of the exact positions of thetwo corresponding entities, unspecific location relation does not, i.e all pairs of corre-sponding entities should be considered as positive instances These examples also comefrom BCS CDR corpus.

rela-Figure 1.4 are examples extracted from Phenebank corpus It includes examples of

17

Trang 32

(a) Unspecific location (b) Specific location

Case report: acute unintentional carbachol intoxication Concurrent

INTRODUCTION: Intoxications with carbachol, a muscarinic administration of a

cholinergic receptor agonist are rare ( ) The mode of action TNF antagonist with

was said to be comparable to that of the synthetic ORENCIA has been

compound 'carbamylcholin'; that is, carbachol He bought 25 associated with an

g of carbachol as pure substance in a pharmacy, and the increased _riSk of

father was administered 400 to 500 mg carbachol °€HI9OUS infections

concentrations in serum and urine on day 1 and 2 of hospital and no significant

admission were analysed by HPLC-mass spectrometry additional efficacy

RESULTS: Minutes after oral administration, the patient over use of the TNFdeveloped nausea, sweating and hypotension, and finally 3Ptagonists alone

collapsed Bradycardia, cholinergic symptoms and asystole (DrugBank:Abatacept)

occurred Initial cardiopulmonary resuscitaton and

immediate treatment with adrenaline (epinephrine),atropine and furosemide was successful ( )

Figure 1.3: Examples of relations with specific and unspecific location.

(a) Unspecific location relation taken from the BC5 CDR corpus with recognized chemicals(highlighted in bold) and diseases (highlighted in underlined bold) The annotation points out

there are chemical-induced disease relations between chemical carbachol and diseases, but didnot give the specific location of the corresponding entities (b) Specific location relation taken

from the DDI corpus The annotation specify the Effect relation between two drugs (highlightedin bold) at their specific locations.

directed and directed relations In the directed relation, the order of entities in the tion annotation should be considered, vice versa, in the undirected relation, two entitieshave the same role.

rela-(a) Directed relation (b) Undirected relation

Some patients carrying mutations in Finally, new insight into related

either the ATP6VOA4 or the ATP6V1B1 musculoskeletal complications (such asgene also suffer from hearing myopathy and tendinopathy) has also

impairment of variable degree been gained ( )

(PMC3491836) (PMC4432922)

Directed relations: Undirected relations:

[ATP6V0A4] PROMOTES [hearing impairment] [musculoskeletal complications] and [myopathy] are

[ATP6V1B1] PROMOTES [hearing impairment]

[musculoskeletal complications] and [tendinopathy] areASSOCIATED

Figure 1.4: Examples of (a) Promotes - a directed relation and (b) Associated - anundirected relation taken from Phenebank corpus.

(Entities are highlighted in bold.)

18

Trang 33

1.2 Literature review

1.2.1 Literature review of biomedical named entity recognition

Over the last fourteen years, there has been considerable interest in biomedical

NER problem with a variety of generic and entity-specific algorithms applied to

ex-tract many biomedical NER such as genes, gene products, cells, chemical compoundsand diseases Figure 1.5 gives an overview of NER approaches, in which, some otherdevelopmental branches of machine learning methods, such as transfer learning, life-

long learning, and reinforcement learning, are not within the scope of this Dissertation.

The specific methods that we used to construct the proposed model are highlighted In

general, similar to NER in the newswire domain, approaches to biomedical NER can

be categorized as knowledge-based methods and machine learning-based methods Wealso discuss hybrid approaches (combining several methods into an architecture) andjoint modeling (a research trend that tries to integrate and handle different tasks as a

single task).

Knowledge-based approaches:

The earliest and straight forward solutions to biomedical NER relied on dictionary—based approaches They rely on the use of existing biomedical resources containinga comprehensive list of terms and determine whether expressions in the text match anyof the biomedical terms in the provided list [158] There are many knowledge-basesare used in this approaches, examples include MESH, UMLS, SNOMED, etc Up tonow, this method is still used in several studies, such as Eftimov et al (2107) [41]that proposed a rule-based named-entity recognition method for knowledge extractionof evidence-based dietary recommendations.

Rule—based methods try to craft patterns/rules manually to recognize NE In thisapproach, manually creating the rules for named entity recognition requires human ex-pertise and is labour intensive Example of researches in biomedical fields that followthese strategies includes Hanisch et al (2005) [58] that applied a staged rule-based

system on the UMLS, HPO and MetaMap.

Knowledge-based methods often require human expertise and is labour intensiveto create such knowledge-bases and patterns Since there are millions of entity namesin use, and new ones are added constantly, these methods will never be sufficientlycomprehensive and can not catch up with the growth rate of the biomedical literature.

19

Trang 34

Knowledge-based {

Hidden Markov Model

Named Entity 2~trH mi Support Vector Machine

Recognition SNE) Maximum Entropy + HMM

Machine \= `Learning Deep Learning Recurrent Neural Network r Long Short-term Memoryn i _L 7 1

Figure 1.5: Named entity recognition approaches taxonomy.

The specific methods that we applied in the proposed model are highlighted.

Some other developmental branches of machine learning methods, such as transfer learning,lifelong learning, and reinforcement learning, are not within the scope of this Dissertation

Feature-based supervised machine learning approaches:

Several recent works on biomedical NER use statistical supervised feature-basedmachine learning methods which are often more robust in terms of system performance.Traditionally, to perform well and efficiently, NER models require a set of informativefeatures (i.e linguistic patterns) that are well-engineered and carefully selected, heuris-tically based on domain knowledge [17] These methods utilize a large annotated corpusand the pre-defined feature set for inferring optimal prediction functions by training themodel and then use it to predict the labels of new data Supported by the availabilityof various annotated biomedical corpora, supervised machine learning methods havebecome popular, owing to the satisfactory performance they have demonstrated.

Peceptron [161] is a classic machine learning algorithm with many extended sions Some recent researches successfully apply the structured perceptrons to sequencelabeling tasks, include NER [62, 126] In this Dissertation, perceptron is used for NER

ver-20

Trang 35

1n the UET-CAM system (Chapter 2).

Conditional Random Fields (CRF) [86] is the most popular discriminative

ma-chine learning model that alternative to the previous for sequence labelling, as it

com-bines the advantage of Maximum Entropy Markov Model (MEMM) in exploiting

non-independent contextual features of the entity without a label bias problem CRF-basedmodels have especially shown reliable performance in biomedical NER problem are lin-ear chain CRF [45, 88, 90, 91] and skip-chain CRF [110] In this Dissertation, CRF isused as a labeling phase in the D3NER model (Chapter 3).

In addition to structured peceptron and CRF, supervised machine learning methodsthat can be used for NER are extremely abundant with many variants, such as Hidden

Markov Model (HMM) [24], semi-markov model [90], MEMM [38], Support Vector

Machines (SVM) [25], decision tree [136], transition-based model [118], and more.

Machine learning method with feature engineering, however, is still time-consuming,

very often yields incomplete non-satisfactory sets Moreover, resulting feature sets areboth domain and model-specific.

Deep learning-based approaches:

In the past few years, the advent of deep neural networks with the capability ofautomatically feature engineering even from noisy data has leveraged the developmentof NER models The deep learning models aim to automatically induce the robust rep-resentations of data by manipulating multiple hidden layers They have produced state-of-the-art results in many tasks of NLP as well as biomedical NER A variety of deeplearning methods and architectures have used in the field of NLP in general and biomed-ical NER in particular In which, the most typical deep neural networks (DNNs) are theConvolutional Neural Networks (CNNs), the Recurrent Neural Networks (RNNs) andtheir variants All of them often requires the use of additional techniques to solve theover-fiting problem and reduce the impact of the initialization.

Recurrent Neural Network (RNN) [162] performs effectively on sequential data,

and it furthermore had many different improvements among several state-of-the-art NLP

systems including NER [107, 184] An advanced RNN type called RNN with Long

Short-Term Memory (LSTM) unit [63] is a specific type of RNN that models

dependen-cies between elements in a sequence through recurrent connections Since the LSTMarchitecture can only process the input in one direction, the bidirectional LSTM (biL-

STM) network improves the LSTM by feeding the input to the LSTM network twice

within two directions: forward - from the beginning to the end of the sequence and

21

Trang 36

vice versa, backward - from the end to the beginning of the sequence This design lows for the detection of dependencies from both previous and subsequent words in asequence Very recently, LSTM has increasingly been employed for biomedical NER,yielding state-of-the-art performance at the time of their publication [55, 121, 122, 181].Realizing the potentials of LSTM for the NER problem, we use LSTM in a combinationwith CRF in the biomedical NER model in Chapter 3.

al-Convolutional Neural Network (CNN) [92] is good at capturing the n-gram tures in the flat structure and has also been proved effective in NLP includes NER

fea-[28, 184].

One of the fundamental steps in deep learning model is word representation, i.e.,

transforming each word into a representation vector in the first layer of a deep learningmodel There are several approaches to create a word representation, including randomlyinitialized embeddings, one-hot vectors, character-level word embedding - representinga token’s meaning in sense of its morphological surface [55, 177] The most common

approach to convert a word into a vector is by looking up into the embeddings matrix

(i.e., lookup table) which created based on the pre-trained word embedding Word

em-beddings is a technique to represent a word by low-dimensional continuous vector

repre-sentations (embeddings) that are pre-trained from extremely huge amount of unlabeled

text One of the pre-trained word vectors have been widely used in biomedical named

entity recognition is provided by Pyysalo et al (2013) [151] It is a pre-trained wordembedding of 200 dimensions that was induced from PubMed and PMC texts (6 milliondistinct words) employed the word2vec skip-gram model [130] Another well-knownpre-trained embedding is the FastText [10], which are the 300-dimensional vectors thatrepresent words as the sum of the skip-gram vector and character n-gram vectors to in-corporate sub-word information FastText is provided for the general domain, but it alsoallows us to re-train the model with our biomedical data Since these pre-trained wordembedding models learned the word representation based on the usage of words, theyallow words that are used in similar ways (similar context) to result in having similarrepresentations, naturally capturing their meaning.

In recent years, the use of word embedding in deep learning-based models hasgradually been replaced by more efficient methods that have been remarkably effectivein NLP problems, including NER, namely ELMO (Embeddings from Language Mod-els, 2018) [149] and BERT (Bidirectional Encoder Representations from Transformers,2019) [35] In the early stage of their release, both ELMO and BERT only provided

22

Trang 37

the pre-trained for common-domain English text Re-training them 1s resource

expen-sive: pre-training a BERT-base model on English Wikipedia (2.5 billions of words) and

BooksCorpus (0.8 billions of words) on a TPUv2 takes about 54 hours!; pre-training a

BERT-base model on Pubmed abstracts (4.5 billions of words) and PMC full text (13.5billions of words) on eight NVIDIA V100 (32GB) GPUs takes 23 days [97]; moreover,to fine-tune a pre-trained BERT on a specific task, it often takes a few hours more ona GPU In 2020, Lee et al [97] introduced BioBERT, the first domain-specific BERT-

based model pre-trained on biomedical corpora We leave this research direction to

future works.

Unsupervised and semi-supervised machine learning:

Several unsupervised and semi—supervised methods have been utilized to tackle

the biomedical NER task Unsupervised-machine learning methods for biomedical

NER are often based on phrase chunking and distributional semantics In which the tity recognition may leverages terminologies, shallow syntactic knowledge (noun phrasechunking), and corpus statistics (inverse document frequency and context vectors) [190].The scope of the Dissertation does not focus on these methods.

en-Semi-supervised methods take advantage of both supervised and unsupervised

approaches They are applied in various manner such as self-training (bootstrapping)

[174], co-training [52], transfer learning [179] and distant supervision learning [98] We

apply distant supervision learning in Chapter 2 and Chapter 4 to improve the mance of proposed models.

perfor-Hybrid model and joint modeling:

Hybrid architectures are proposed to take advantages of several different methods

by combining them into a single model This approach integrates

heuristics/rule/pattern-based methods, domain knowledge, and learning-heuristics/rule/pattern-based methods in various combinationmanners One of the state-of-the-art hybrid architecture that successfully applied to NERis the combination of a deep learning network for data representation and then use CRFfor sequence labelling [55].

Following reports of the high-level performance of the joint-inference model inother NLP tasks, several studies tried to joint NER with other NLP task to improve theperformance Sometimes, after NER, we need to link the recognized entity to a concept(or data entry) in ontology or database, this task is called named entity normalization

(NEN) Traditionally, NER and NEN were treated as two separate tasks, in which, NEN

23

Trang 38

took the output of NER as its input in a pipeline manner Several studies [89, 116] havepointed out the limitations of this pipeline approach, i.e causing cascading errors fromNER to NEN, and limiting the ability of the NER system to exploit the lexical infor-mation provided back by the normalization directly The joint model between NER andNEN is expected to overcome these disadvantages of such a traditional pipeline model.Several works tried to build such a NER-NEN joint model; examples include [118] pro-posed a transition-based model to jointly perform disease NER and NEN, TaggerOne[89] is a joint model of a semi-Markov structured linear classifier, with a rich featureapproach for NER and supervised semantic indexing for NEN We exploit this idea into

the proposed model in Chapter 2 to join NER and NEN modules in the decoding phase.

1.2.2 Literature review of biomedical relation extraction

We categorize approaches to biomedical relation classification as knowledge-based

methods and machine learning methods, as illustrated in Figure 1.6 In which, the

spe-cific methods that we used to construct the proposed model are highlighted Note that

there are some other developmental branches of machine learning methods, such as

transfer learning, lifelong learning, and reinforcement learning, but they are not withinthe scope of this Dissertation.

Knowledge-based approaches:

The most simple approach for detecting potential relationships is based on co-—occurrence statistics Based on the hypothesis that, if two entities are frequently men-tioned together, likely, they are somehow related, this method reveals biomedical rela-

tionship through counting their co-existences in the same sentences entire abstracts [20].

More accurate alternatives for relation classification are based on manual-crafted rules[78, 115] and patterns [79, 146] These methods do not require any annotated data totrain a system but typically meet two disadvantages: (i) the rules and patterns reliedon manually-crafted rules/pattern, which are very expensive, time-consuming and oftenrequire domain experts knowledge (ii) they are limited at extracting specific relationtypes Since the co-occurrence methods often have low precision and rule/pattern-basedmethods are labour-intensive but not generalized, machine learning approaches are cur-rently one of the top choices for relation extraction.

Feature-based supervised learning approaches:

Some literature reviews on relation extraction [5, 142] divide supervised learning

24

Trang 39

Crowd Sourcing

Figure 1.6: Relation extraction approaches taxonomy.

The specific methods that we applied in the proposed model are highlighted.

approaches into two sub-categories, i.e., kernel-based and feature-based methods, basedon their input to the classifier While feature-based methods require a set of pre-definedfeatures extracted from sentences, kernel-based techniques often take advantages of richstructural representation such as dependency trees In this dissertation, we only focus onthe feature-based methods Feature-based methods represent labeled instances featurevector, in which, each element represent a feature These feature vectors are then servedto the classifier for training the model and predicting whether the candidate entity pairare related or not These methods are data-driven, 1.e., based on domain-specific man-ually annotated corpora In the biomedical domain, these approaches are widely usedsince they can take advantages of various annotated biomedical corpora which are freelyavailable but bring potential performance.

The popular feature-based supervised machine learning algorithm is Support tor Machines (SVM) [27] which tries to find a linear hyperplane, in a n-dimensional

Vec-space, with the largest distance to the nearest instances of positive and negative classes.Feature-based SVM was used for extracting chemical-induced disease relation [186],Live-in event [99], drug-drug interaction [156], protein-protein interaction [132], protein-organism-location relation [117] and many other biomedical relations SVM with a rich

25

Trang 40

feature set is used for relation classification in Chapter 2 of this Dissertation.

In addition to SVM, machine learning methods that applied for biomedical relationextraction are abundant, such as Conditional Random Fields [14], Naive Bayes [102],

maximum entropy [49], logistic regression [73].

These machine learning methods for relation classification require carefully engineered process Although it is time- and money-consuming to create the feature sets,they are often limited for a specific model and domain.

feature-Deep learning approaches:

Recent successes in deep learning have stimulated interest in applying neural tectures to the task of relation classification They are extremely good at automatically

archi-feature engineering from noisy data, thus, not requiring a handcrafted archi-feature set but still

yielding good performances Deep learning models often requires the use of additionaltechniques to solve the over-fiting problem and reduce the impact of the initialization.The Dissertation applies both CNN and RNN with several different improvements for

classifying the biomedical relations.

Convolutional Neural Networks (CNNs) [92] are among early approaches to beapplied successfully to biomedical relation classification problem and yields the state-of-the-art results Zhao et al [194] used a syntax CNN for extracting drug-drug interaction.Verga et al [176], Zhou et al [198] applied CNN for chemical-induced disease relationextraction.

Recurrent Neural Networks [162] are another approach to capturing relations andnaturally good at modelling long-distance relations within sequential language data.There are some variants of RNN that have been applied to biomedical relation classi-fication task, includes the original RNN [128], RNN with LSTM unit which is used toextend the range of context [108, 197] and Recursive neural network [111].

Deep learning-based researches on relation classification in this Dissertation aremostly based on the shortest dependency path (SDP) Nodes (tokens) and dependenciesin the SDP can be represented as a vector by using methods outlined in Section 1.2.1.While tokens are often represented based on word embedding, dependencies are oftenconverted to a one-hot vector or randomly initialization As introduced in Section 1.2.1,

ELMO (2018) [149] and BERT (2019) [35] have been shown to be effective in numerous

studies of NLPs, including RC However, the researches on relation classification inthis Dissertation are mostly based on the shortest dependency path (SDP), the use of

26