It is not substantially the same as any I have submitted for a degree, diploma or
other qualification at any other university; and no part has already been, or is currentlybeing submitted for any degree, diploma or other qualification.
Hanoi , January 2022
Le Hoang Quynh
iii
Trang 41.2.1 Literature review of biomedical named entity recognition 19
1.2.2 Literature review of biomedical relation extraction 24
1.23 Related doctoraldissertaions 2913 RelatedresoUurFCes Ặ - QOQ QOQ Q Q Q2 301.3.1 Datasets for named entity recognition experiments 311.3.2 Datasets for relation classification experiments 32
1.41 Evaluation metrics 2 0.0.00 02.02.02 eee eee 34
1.4.2 Named entity recognition evaluation 35
1.4.3 Relation classification evalualon 36
IV
Trang 52 ANEND-TO-END PIPELINE MODEL FOR BIOMEDICAL RELATION
2.1 Distant supervision learning with silverCID corpus 39
2.2 Proposed UET-CAM system 42
2.2.1 Joint model of named entity recognition and normalization (DNER) 432.2.2 Coreferenceresolulon 00000000 ee 492.2.3 Intra-sentence relation classification with support vector machine 522.3 Experimental results and discussion 54
2.3.1 Choosing the combining manner of SSI and skip-gram for named
entity normalization results 54
2.3.2 Named entity recognition and normalization results 55
2.3.3, CID relation classiicatlonresults 57
2.3.4 DiscussilOn ee 58
3 AN IMPROVED CRE-BILSTM MODEL FOR BIOMEDICAL NAMED
ENTITY RECOGNITION 643.1 Introduction to deep learning for named entity recognition 653.2 Proposed D3NER model 673.2.1 Data pre-processing 0.00.00 000 673.2.2 The TPAC embeddings layer 683.2.3 Context representing biLSTM layer 71
4 HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP LEARNING
MODELS FOR BIOMEDICAL RELATION CLASSIFICATION 874.1 The shortest dependency path 894.1.1 Dependency tree 000 89
Trang 64.1.2 The shortest dependency path 90
4.13 Dependency Unt 91
4.2 A hybrid adaptive deep learning model for biomedical relation extraction 914.2.1 ProposedMASSmodel 92
4.2.2 Experimental corpora and comparative models 98
4.2.3 Experimental environment and model settings 100
4.2.4 Experimental results and discussion 100
4.3, An attentive augmented deep learning model for biomedical relation traction 2 ee 10643.1 Richer-butsmarterSDP 106
ex-43.2 Proposed RbSPmodel 107
4.3.3 Experimental environment and model setings 114
4.3.4 Experimental results and discussion 114
4.4 A multi-fragment ensemble deep learning model for biomedical relationextraction 2 ee 1184.4.1 Over-fitting problem of deep learning-based models 118
4.4.2 Bagging with bootstrap tramngdata 119
4.4.3 Proposed multi-fragment ensemble architecture 121
4.4.4 Experimental results and discussion 124
4.5 Summary 2 0.0 ee ee 129GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATIONIN BIOMEDICALTEXT 131
5.1 Inter-sentence relations classification problem 132
5.2 Proposed graph-based inter-sentence relation classification model 134
5.2.1 Modeloverview 2.0.0 eee ee eee ee 1345.2.2 Document sub-graph construction 135
5.2.3 Paths finding, merging and choosing 138
5.2.4 Shared-weight convolutional neural network 140
5.3 Experimental results and discussion - 143
5.3.1 Experimental environment and model settings 143
5.3.2 Contribution of the added virtual edges in document sub-graph 144
5.3.3 Different sliding window size w for training and testing 145
5.3.4 Contribution of the model components 146
5.3.5 Comparison to comparativemodel 148
VI
Trang 8BCS CDR corpus
Bacteria Biotope Task
BioCreative V Chemical-Disease relation
Disease Named Entity Recognition
Deep Neural NetworkDependency Unit
Embeddings from Language Models
False Negative
Vili
Trang 9FP False Positive
FSU-PRGE The FSU PRotein GEne Corpus
GD Gradient Descent
HAScO Human-Aware Science Ontology
HHEAR Human Health Exposure Analysis ResourceHMM Hidden Markov Model
TAA Inter-annotator AgreementIE Information Extraction
KB Knowledge-base
LSTM Long Short-term Memory
MASS Man for All SeasonS
MESH Medical Subject Headings
mf Multi-fragment
MLP Multilayer Perceptron
MUC Message Understanding Conferences
NCBI National Center for Biotechnology tion
Informa-NCIT National Cancer Institute Thesaurus
NE Named Entity
NEN Named Entity NormalizationNER Named Entity Recognition
NLP Natural Language Processing
OOV Out-Of- Vocabulary
OWL Orthology Ontology
P Precision
PMC Pubmed Central
1X
Trang 10Radiology Gamuts Ontology
Recurrent Neural Network
The Shortest Dependency Path
A Silver-standard Corpus for induced Disease Relation Extraction
Chemical-Systematized Nomenclature of MedicineSupervised Semantic Indexing
Standard Deviation
Suport Vector Machine
Shared-weight Convolutional Neural Network
True Negative
True Positive
the Token-POS tag-Abbrviation-CharacterEmbeedings
Unified Medical Language System
With out Replacement
Trang 11Growth of MEDLINE citations from 1986 to 2019 2
Challenges’ subtasks/tracks organized based on NLP perspectives [64] 3
The dissertation outline 2 ee ee ee 10An example taken from the BC5 CDR corpus with recognized names ofDisease, Chemical and Specles 14
Examples of (a) inter-sentence relation and (b) intra-sentence relation 17
Examples of relations with specific and unspecific location 18
Examples of (a) Promotes - a directed relation and (b) Associated - anundirected relation taken from Phenebank corpus 18
Named entity recognition approaches taxonomy 20
Relation extraction approaches taxonomy 25
The statistics of corpora used in our experiments for relation classification 34Analysis of the Direct Evidence field in the CTD databases 40
An example of constructing silverCID corpus 41
Architecture of the proposed UET-CAM system 44
Advanced SSI model using skip-gram information for NEN 45
Hybrid model of SSI and skip-gram model for NEN 47
Sequential back-off model of SSI and skip-gram model for NEN 48
An example of coreference In text 49
An examples of using multi-pass sieve for coreference resolution 51
The D3NER architecture 2 ee ee 68The TPAC embedding architecture of D3NER 70
Example of adependency tree 89
Examples of the shortest dependency paths 90
Examples of the dependency unit in the shortest dependency paths 91
The architecture of MASS model for relation classification 93
The multi-channel LSTM for word representation 95
xi
Trang 124.6 Ablation test results for various components and information sources of
MASSmodel ee4.7 Examples of SDPs and attached child nodes
4.8 The architecture of RbSP model for relation classification .4.9 The multi-layer attention architecture to extract the augmented informa-
tion from the children of a token on SDP 4.10 Ablation test results for compositional embeddings of RbSP model 4.11 Ablation test results for augmented information of RbSP model 4.12 Training loss, training accuracy, validation loss and validation accuracy
of our RbSP model in BC5 CDR corpus 4.13 The range of RbSP model’s results on BCS CDR test set
4.14 The multi-fragment ensemble architecfure
4.15 The changes of multi-fragment ensemble model’s results with differentsize of training data 2 Ặ Ặ Q Q Q Q HQ HH ko4.16 The changes #'I of multi-fragment ensemble model with different vote
5.1 Examples of complicated cross-sentence relations .5.2 The proposed model for inter-sentence relation classification .5.3 Use sliding window to choose adjacent sentences for building document
sub-graph 2 ee5.4 Examples of adocument sub-graph .5.5 Examples of two unexpected problems while generating the instance from
5.6 Example of an abstract with many NER annotations that leads to the
ex-plosion of similar paths ẶẶẶ ee
5.7 Diagram illustrating of aswCNN architecture 5.8 Ablation test results for virtual edges of the document sub-graph 5.9 The change of results with different size of sliding window .
Trang 13List of Tables
Example sentences labeled using different tagging schema 15
Examples for different relation types 17Information about the BCS CDR, NCBI and FSU-PRGE corpora for NER 31Information about the BC5 CDR, BB3, DDI and Phenebank corpora for
relation classification 2 2 ee ee 33Defining the test metrics 2 2 ee 35
Detailed Input/Output and the objectives of UET-CAM components 43Large-scale feature set used in the intra-sentence relation extraction mod-
Named Entity Normalization results with different combining architectures 55
Disease named entity recognition results on BC5 CDR corpus of
UET-CAM system 2 ee 55Relation classification results on BC5 CDR corpus of UET-CAM system 57Analysis of the contribution of methods and resources used in the UET-
CAM system for capturing CID relatonships 60Sources of errors by our system system on the CDR test set 61
Configurations and parameters of D3NER model 75
Experimental results of D3NER for 20 runs each with different random
initialization on BCS CDR and NCBI corpora 77
Performance of D3NER and compared state-of-the-art models on two
benchmark corpora for Disease and Chemical NER 78
Experimental results of D3NER for 20 runs each with different random
initialization on FSU-PRGE corpus (4-fold cross validation) 80
Performance of D3NER and compared state-of-the-art model on
FSU-PRGE corpus for Gene/protein NER 80
Ablation test results for different embeddings of D3NER model 81Impact of fine-tunning embeddings as the D3NER’s hyper-parameters 82D3NER confusion matrix on the CDR corpus 82
XI
Trang 143.9 Examples for errors caused by D3NER on the BC5 CDR and FSU-PRGE
COMpOra 2 a 84
4.1 Examples for different relation types - 87
4.2 Configurations and parameters of MASS model 100
4.3 Results of MASS model on the BCS CDR corpus 101
4.4 Results of MASS model on the DDI-2013 corpus 102
4.5 Results of MASS model on the BB3 corpus - 103
4.6 Results of MASS model on the Phenebank corpus 103
4.7 Examples of MASS model’s errors 105
4.8 Configurations and parameters of RbSP model 115
4.9 The RbSP model’s performance on BC5 CDR corpus 115
4.10 Multi-fragment ensemble results on BCS CDR corpus 124
4.11 The comparison of our ensemble proposed models with other tive models on BC5 CDR corpus 127
4.12 The comparison of our ensemble proposed models with other tive models on DDI corpus .0 0.0000 eee eee 1285.1 Tuned hyper-parameter of proposed model 144
compara-5.2 Ablation test results for added virtual edges in the document sub-graph 144
5.35.45.55.65.7Results of the document sub-graph based model on BCS CDR corpuswith different size of sliding window for training and testing 147
Ablation test results for various components of the document sub-graphbased model on BC5 CDR corpus 148
The performance of document sub-graph-based model and some ative models ee 149The detailed results of the document sub-graph based model 150
compar-Examples of errors on the BC5 CDR testset 151
XIV
Trang 15The necessities of the dissertation:
In the past several decades, biomedicine and human health care have become oneof the major service industries They have been receiving increasing attention fromthe research community and the whole society E.g., in 2011, biomedical research inthe United States received 100—billion dollars of investment, with approximately 65%
supported by industry, 30% by the government, and the remaining 5% by charities, dations, or individual donors [137] Up to the present, many researchers have beenstill working hard with an expectation that more advances would occur for supportingbiomedical science and healthcare Therefore, the inevitable need is understanding andanalyzing the existed information and knowledge bases.
foun-As a result, the field of biomedical research has overgrown, and the number ofbiomedical scientific publications is growing at an extremely high rate Accessing andprocessing this data to keep abreast of the state-of-the-art and making discoveries inbiomedical/healthcare scientific researches is essential for several types of users, in-cluding biomedical researchers, clinicians, database curators, and bibliometricians [77].There is more than 3000 articles are published in biomedical journals every day [64].
MEDLINE®, a biomedical database of the US National Library of Medicine, is one of
the most prominent and largest biomedical digital repositories As of 2019, it alreadycontains more than 26 million citations with a fast increasing number of articles in life
sciences with a concentration on biomedicine! Figure | illustrates the growth of
MED-LINE from ~ 1 million in 1970 to ~ 26 million citations in 2019 More impressively, this
number has increased nearly two times in 14 years, from 2005 (~ 13.5 million) to 2019
(~ 26.2 million).
PubMed®” is a free resource developed and maintained by the NCBI which
"https ://www.ncbi.nlm.nih.gov/pubmed
Trang 16PL FFP FP x %
Figure 1: Growth of MEDLINE citations from 1986 to 2019.
The vertical axis shows the number of citation (in million) For clearly visualization, theStatistics before 2005 were presented every 5 years.
vides free access to MEDLINE and some other databases Following the statistic
re-ported in November 2019°, the total of PubMed citations cumulative has surpassed 30
million However, even if we got the result returned from PubMed, the difficulty ofprocessing this literature is ever-increasing It comes from the fast-growing volume ofbiomedical literature, the scope of topical coverage, and its interdisciplinary nature andits unstructured form For example, when searching for ‘Influenza’ in Pubmed, we gotthe results of 105,066 articles The rapid growth of volume and variety of biomedicalscientific literature make it an exemplary case of Big Data [169] It is an unprecedentedopportunity to explore biomedical science and an enormous challenge when facing amassive amount of unstructured and semi-structured data.
Recent research progress in biomedicine needs to be supported by methodologies
capable of assisting human experts in formulating hypotheses Biomedical natural guage processing (BioNLP) is a sub-field of Natural language processing (NLP) that
lan-seeks to help scientists understand the wealth of data from results that are hidden inlarge-scale scientific text collections BioNLP does this through the analysis, under-standing, and production of structured data from unstructured free text in large scaletext collections BioNLP now has a wide range of applications in biomedical literaturemining and attracted significant investment of the research communities worldwide, re-
3https://www.nlm.nih.gov/bsd/medline_pubmed_production_stats.html
Trang 17flecting their central roles in many areas of biomedical research and healthcare science.
As a result, the market of biomedical text data analysis and bioNLP is growing rapidly.In particular, the NLP in Healthcare and life sciences market is estimated to grow from
USD 1030.2 million in 2016 to USD 2650.2 million by 20214 in the United States.
Relation extraction (RE) plays a vital intermediate step in a variety of bioNLPapplications Its contributions range from precision medicine [6], adverse drug reactionsidentification [30, 53], drug abuse events extraction [71], major life events extraction[19, 106], question answering system [31, 120] and clinical decision support system[159], etc.
Fly Ad-hoc Q Bic-entity Chem Ñ
genetics -retrleval centric patent ik `
Figure 2: Challenges’ subtasks/tracks organized based on NLP perspectives [64].
In general, NLP tasks closer to the top of the pyramid are more difficult.
Because of these motivations, several challenge evaluations have been organizedto assess and advance bioNLP researches These challenges evaluations often attractmany scientists around the world to attend and publish their latest research on biomed-ical analysis Huang and Lu (2015) [64] categorized the prevalent challenges by thetargeted problems in NLP research as in Figure 2 We get an observation that BioNLPshared tasks pays much of its attention on information extraction, including relation ex-traction/classification and named entity recognition, which are listed in the middle two
‘https: //www.marketsandmarkets.com/Market-Reports/
healthcare-lifesciences-nlp-market-131821021.htm1
Trang 18parts of the pyramid Some examples of well-known shared tasks include the BioNLP,BioCreative, 12b2, ShARe/CLEF eHealth, and SemEval.
There have been a number of doctoral dissertations across the world that worked on
relation extraction related topics (more detailed information is given in Section 1.2.3).
Some of them focused on a specific type of relation, examples include disease-generelations [66] and drug-drug relations [172] The data type that they targeted to are alsovery diverse (i.e., scientific literature [100] and electronic health record [96]) Many
machine learning methods were proposed for relation extraction: supervised
feature-based machine learning [9], semi-supervised learning [172], deep learning [96], etc.
In this Dissertation, we consider Relation Extraction as two text mining
sub-tasks, i.e., Named Entity Recognition (NER) and Relation Classification (RC) Thetask of biomedical named entity recognition (NER) seeks to locate named entities fromfree-form biomedical text and classify them into a set of pre-defined categories/typessuch as gene/protein, phenotype, disease, and chemical, or ‘none-of-the-above’ NERproblem consists of three sub-problems: (i) defining the entity boundary and (ii) assign-ing the delimited entity to a pre-defined class and (iii) named entity normalization, i.e.,
match the extracted entities to a concept in the knowledge base In which, the named
entity normalization problem often be separated as an independent problem The
pop-ular methods used for biomedical NER includes dictionary-based methods, Rule-based
methods, classification-based methods, sequence labeling methods hybrid methods that
combine other techniques [138, 167] Relation classification (RC) is the task of
dis-covering semantic connections between biomedical entities The common biomedicalrelations includes Drug-drug interaction [164], chemical-disease relation [180], Protein-protein interaction [83] and many others The most typical methods for relation clas-sification are co-occurrence approaches, rule-based methods, several machine learningmethods, and hybrid methods [5, 142, 167].
In line with worldwide research trend, this dissertation differs from the other
re-search in several aspects: (i) We try to solve both NER and RC of RE as two separate
tasks Most other works focus on only one task, NER or RC Some research addressedRE as RC, and NER was be solved in the previous phase as a pre-processing step (ii)We focus on the scientific literature abstracts and capitalize on their characteristics, notjust consider them as normal documents (iii) The dissertation research and apply a vari-ety of machine learning methods, including supervised feature-based machine learning,
unsupervised machine learning, distant learning, and deep learning (iv) The dissertation
Trang 19does not entirely focus on a specific type of relationship CID is just a typical ship used to facilitate the comparison of results Many experiments were conducted forother relation types, and all have positive results.
relation-Research challenges:
The biomedical research community pays much attention to developing dedicateddata and resources Recently, it is admitted that biomedicine is a field that having themost abundant amount of available public resources and tools However, the specificcharacteristics of biomedical data still bring many challenges for the research commu-nities [2, 167]:
— Firstly, biomedical NLP is still facing many existing NLP problems, 1.e., problemsexist not only in the field of the biomedical domain but also in the general fieldof NLP We list here some widespread problems: the imbalanced data problem,special linguistics units such as negation and conjunction, and directed relation
— Secondly, information extraction in the biomedical domain often suffers errorscaused by relatively low performance of pre-processing steps Because biomedicaltexts are highly specialized, generic data analysis and NLP tools are not appropri-
— Thirdly, biomedical terms have their own diversities and characteristics, such asthe lack of nomenclatures and the extreme use of unknown words that lead to
highly variable and ambiguous compared to other domains.
— The fourth problem comes from ambiguity and inconsistency, 1.e., NEs with thesame orthographic features may fall into different categories.
— Finally, biomedical is an interdisciplinary field The complexity of the biologicdomain and the growing ability of biomedical research relies increasingly on thedevelopment of methods and concepts crossing these boundaries.
Research objectives and methodology:
Motivated by above necessities and challenges, the Dissertation aims at the ing research objectives:
follow-— [ROI] Appropriately represent the biomedical literature text to make the best useof linguistic, syntactic, and semantic information.
5
Trang 20— [RO2] Take advantage of the state-of-the-art advanced methods and resources topropose the combination architectures and then improve them to resolve NER and
RC problems with good results.
To reach these research aims, we focus on addressing the following main research
question: How to build an effective machine learning-based architecture for NER andRC systems? It includes two sub-questions to supplement the main research question:
— [Sub-question sQT] How to convert the biomedical literature text, annotated withnamed entity and relation labels, into a rich representation containing useful infor-mation that can be processed by machine learning models?
This research question is addressed throughout the Dissertation, for example, werepresent the relations by using the engineered features in Chapter 2, embedding,
and the shortest dependency path in Chapter 4, and the graph in Chapter 5.
— [Sub-question sQ2] How to apply, combine, and improve advanced machine
learning methods for building NER and RC systems?
This research question is solved in Chapter 2, Chapters 3, Chapter 4 and Chapter 5.
The research methodology of the Dissertation is the combination of qualitativeresearch and quantitative research:
* Qualitative research includes: (i) Analyzing the ideas, proposed methods and niques of related works; (ii) detecting problems, advantages and disadvantages ofthese methods; (iii) improving, combining and proposing new solutions and mod-
tech-els to resolve problems.
* Quantitative research includes: (1) Analyzing available corpus, (ii) deploying
ex-periments, (iii) verifying the performance of proposed methods and models and(iv) publishing the scientific reports to receive verification from the research com-munity.
Overview of our approach:
The Dissertation participates in the research trend of bioNLP in general and ical relation extraction in particular Our focuses are on improving the methods, exploit-
biomed-ing rich information data representation, and build a capable architecture for biomedical
named entity recognition and relation classification, rather than on developing new
ma-chine learning algorithms.
Trang 21We state that being able to achieve better performance in biomedical relation traction tasks depends on improvements in machine learning and data representation.We firstly build an end-to-end model for named entity recognition and relation classifica-tion This model mostly based on several supervised feature-based learning techniques.BioNLP, like its parent field NLP, has been through a step-change in the last five yearswith a move from machine learning based on expert features to deep learning techniquesthat learn feature representations for themselves Following this research trend, we thenpropose several deep architectures for improving named entity recognition and relationclassification.
ex-The main contribution of the Dissertation:
The Dissertation has three main contributions:
— Researching, improving, and proposing several data representation manners tomake use of linguistic, syntactic, and semantic information This contribution isreflected in the proposal of a rich feature set in Chapter 2, a combination of several
information types in Chapter 3 and Chapter 4, as well as a graph-based
representa-tion in Chapter 5.
— Studying and constructing some machine learning architectures to solve NER andRC problems based on combining and improving advanced machine learning meth-ods from multiple perspectives: (1) UET-CAM system in Chapter 2 is a joint modelof NER-NEN system and rich feature-based machine learning with distant super-
vision learning for RC (1) D3NER system in Chapter 3 combines several types
of information in a deep learning model (iii) MASS and RbSP models in Chapter4 are deep learning-based models with several improvements, including attentionmechanism (iv) The multi-fragment ensemble model is also proposed in Chapter4 (v) Finally, Chapter 5 focuses on intra-sentence relation extraction with a novel
graph-based model Most applied methods/techniques are carefully analyzed toevaluate their contribution to system performance.
— Contributing to the research community by creating a silver-standard dataset called
‘silverCID’ for distant supervision learning This data set is used in Chapter 2 and
Chapter 5 and is demonstrated the good influence to the system performance.
Scope of the Dissertation :
The Dissertation focuses on solving the relation extraction problem in English
7
Trang 22biomedical literature text by applying natural language processing (NLP) techniques.In which, two sub-problems (i.e., named entity recognition and relation classification)are solved separately by applying several advanced machine learning methods in an ap-propriate architecture.
Biomedical named entity recognition problem is considered as a sequence
la-belling problem Note that the nested entity problem is excluded, 1.e., we do not
con-sider the cases if named entities contain other named entities inside them or severalentities intersect In a part of the Dissertation, named entity recognition is processedsimultaneously with named entity normalization phrase to increase performance The
dissertation experiments worked on three fundamental biomedical entities, 1.e.,
Chem-ical, Disease, and Protein/Gene They are three of the most frequently requested ties by PubMed users worldwide [68] and are annotated in many well-known biomedi-
enti-cal knowledge-bases (Medienti-cal Subject Headings (MESH)°, Unified Medienti-cal Language
System (UMLS)°, Systematized Nomenclature of Medicine (SNOMED)’,” and many
In this Dissertation, we delineate the scope of the study of biomedical relation
classification problem according to the following characteristics:
¢ Only binary biomedical relations are extracted We aim to address the n—ary tions as further extensions of our model in the future works.
rela-¢ We focus on both intra- and inter-sentence relations.
¢ Both directed and undirected relations are considered in the research scope.
¢ Depending on the corpus that the relation classification system works on, it can bea binary classification or a multi-label classification problem.
In experiments, we mostly focus on the chemical-induced disease relation (also knownas the adverse drug reaction and side effect) This relation attracts much attention fromthe research community as well as the industry It is annotated in many biomedical on-
tologies, i.e., SNOMED, Orthology Ontology (OWL)Š, Human Health Exposure
Analy-sis Resource (HHEAR)’, Human-Aware Science Ontology (HAScO)!”, National Cancer
8
Trang 23Institute Thesaurus (NCTT)!!, Radiology Gamuts Ontology (RGO)!”, and Comparative
Toxicogenomics Database (CTD)!°, etc Other various relations are also considered in
some experiments for further comparisons The example includes the drug-drug
inter-action (includes mechanism, effect, advice and int), the locations (biotopes and graphical places) of bacteria, and many others.
geo-The Biocreative V CDR corpus was selected as benchmark datasets for mentation throughout the Dissertation Besides, depending on the verification directionwe desired, some other datasets were selected includes the DDI corpus, BB3 corpus, andPhenebank corpus.
experi-The dissertation outline:
The Dissertation outline is illustrated in Figure 3, which contain Preface, five ters and the Conclusion The related publications are marked to their correspondingChapter.
Chap-Chapter 1: INTRODUCTION TO BIOMEDICAL RELATION EXTRACTIONprovides an introduction into important concepts relevant throughout this work Themain focus of this chapter are problem statement, literature review, related resourcesand the evaluation method.
Chapter 2: AN END-TO-END PIPELINE MODEL FOR BIOMEDICAL
RELA-TION EXTRACRELA-TION describes the architecture of our UET-CAM system that pated in BioCreative V CDR track It is an end-to-end architecture for chemical-induceddisease relation extraction that consists of several advanced feature-based machine learn-
partici-ing components.
Chapter 3: AN IMPROVED CRF-BILSTM MODEL FOR BIOMEDICAL NAMED
ENTITY RECOGNITION improves the biomedical named entity recognition by
propos-ing a deep learnpropos-ing model with several embeddpropos-ing sources In addition to chemical anddisease entities, gene/protein entities are also considered in this chapter’s experiments.
Chapter 4: HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP
LEARN-ING MODELS FOR BIOMEDICAL RELATION CLASSIFICATION proposes some
deep architectures for the biomedical relation classification Several corpus with ious relation types are also used for demonstrating the flexibility and adaptability of
var-proposed model On-trending attention technique and the ensemble manner are also
Bhttp://ctdbase.org/
Trang 24applied to propose a novel deep architecture with potential results.
Chapter 5: GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATION
IN BIOMEDICAL TEXT presents our approach for the inter-sentence relation cation To exploit the graph-based representation effectively, we develop novel shared-weight deep learning model.
classifi-Lastly, in the Conclusion, we summarizes the dissertation’ main contributions andlimitation, then ends with an outlook to future works.
AN END-TO-END PIPELINE MODEL FOR BIOMEDICAL RELATION EXTRACTION
(Participated in BioCreative V CDR tracks)
NAMED ENTITY RECOGNITION RELATION CLASSIFICATION[LHQ1] (Oxford Database, 2016)
AN IMPROVED CRF-BILSTM MODEL ee (EMNLP, 2018)
FOR BIOMEDICAL [LHQ5] (ACIDS, 2049)[LHQ6] (WAACL, 2019)
NAMED ENTITY RECOGNITION [LHQ7] (vise, 2020)[LHQ2] (Bioinformatics, 2018)
[LHQ8] (BioNLP-NAACL, 2021) Chapter 5.
[LHQ9] (KSE, 2021) GRAPH-BASED INTER-SENTENCE
RELATION CLASSIFICATIONIN BIOMEDICAL TEXT
(Extensive researches on inter-sentence relation)
Figure 3: The dissertation outline.
The related publications are listed in their corresponding Chapter.
10
Trang 251.1.1 Semantic relation extraction
First of all, we present the definition of semantic relation extraction in tion 1.1.
Defini-Definition 1.1 Semantic relations (or semantic relationships) are the associations that
there exist between the meanings of linguistic components (e.g., semantic relations atword level, entities level, phrases level or sentence level, etc.).
Semantic relation extraction (see Definition 1.2) is useful in many fact extraction
applications ranging from question answering [31, 120] to identifying adverse drug
re-actions [53].
Definition 1.2 Relation Extraction (RE) is the task of detecting and characterizing the
11
Trang 26semantic relations between pairs of named entity mentions in the text [2] Receivingthe (set of) document(s) as an input, the relation extraction system aims to extract allpre-defined relationships mentioned in this document by identifying the corresponding
entities and determining the type of relationship between each pair of entities.
In this Dissertation, we focus on two sub-tasks of Relation Extraction: Named tity Recognition (NER) and Relation Classification The former, named entity recog-nition (NER, entity tagging), is an intermediate step for relation extraction It refers tolocating and classifying named entities in text into predefined categories In the Disser-tation scope, NER is the problem of finding biomedical entity mentions such as diseases,chemicals, genes, proteins, or organisms in natural language biomedical literature text,then tagging them with their location and type The latter, relation classification (RC),goes after NER to find the semantic relations between the corresponding entities [2].Biomedical relation classification often tries to classify the relationship between pairs ofbiomedical entities to relations such as drug-drug interaction, chemical-induced disease,bacteria live-in location, or tag them as ‘none’ if we can not find any relationship be-tween them We describe these two sub-problems in details in Sections 1.1.2 and 1.1.3below.
En-1.1.2 Biomedical named entity recognition
We gives the definitions of named entity in Definition 1.3.
Definition 1.3 A named entity (NE) (also called entity mention) is a continuous quence of words that designates some real world entity [2].
se-The automated recognition of named entities in text has been a highly active areafor over two decades and is referred to variously as ‘terminology extraction’, ‘termrecognition’, ’entity identification’, ‘entity chunking’, ‘entity extraction’ and ‘namedentity recognition’ In this dissertation, we use the term ’named entity recognition’.The task of named entity recognition (NER) seeks to locate NE from free-form textand classify them into a set of predefined categories/types such as person, organization,location, expressions of times, quantities, monetary values, percentages or “none-of-the-above’ In other words, NER is the problem of finding the mentions of entities in naturallanguage text and labelling them with their location and type Oftentimes this task can-not be simply accomplished by string matching against pre-compiled gazetteers becausenamed entities of a given entity type usually do not form a closed set and therefore any
12
Trang 27gazetteer would be incomplete Another reason is that the type of a named entity canbe context-dependent [2] For example, ‘Ho Chi Minh’ may refer to the person who isVietnamese Communist revolutionary leader or the location ‘Ho Chi Minh city’, ‘Ho ChiMinh museum’ or any other entity is sharing the same part ‘Ho Chi Minh’ To determinethe entity type for this text span occurring in a particular document, its context has to be
Named entity recognition is typically modeled as a sequence labeling problem.We treat each word in a sentence as an observation, sentence as a sequence of observa-tions and try to assign labels to each observation of this sequence It is defined formally
in Definition 1.4.
Definition 1.4 Given a sequence of input tokens X = (z\, ,%„), and a set of labels
L, named entity recognition (NER) task determines a sequence of labels Y = (y1, , Yn)such that y; € L for 1 <i<n [8&8].
While one may apply standard classification to predict the label y; based solelyon z;, in sequence labelling, it is assumed that the label y; depends not only on itscorresponding observation x; but also possibly on other observations and other labels in
the sequence Typically this dependency is limited to observations and labels within a
close neighbourhood of the current position ¿.
The label of NER often incorporates two concepts: the type of the entity (e.g.
whether the mention refers to a person, location, chemical or a disease), and the tion of the token within the entity Hence, the label set should follow a formal taggingscheme (also called ‘label model’, ‘tagging format’), which is a format for tagging to-
posi-kens in a chunking task in computational linguistics, such as NER The simplest model
for the token position is the JO model, which indicates whether the token is inside (7) oroutside (O) of an entity mention While simple, this model cannot differentiate betweena single mention containing several words and distinct mentions comprising consecu-tive terms JOB is the well-known tagging scheme that overcomes the limitation of [Oscheme Different to JO scheme, with JOB scheme, a token is tagged as B if it marksthe beginning of an entity This model is capable of differentiating between consecutiveentities and has excellent support in the literature The more complex model commonlyused is JOBE'S (or IOBEW), which 1s the expressive variant of JOB tagging scheme.In addition to J, O and B, JOBES uses E for Ending, and S$ for Singleton (a one-wordentity) While the JOBES scheme does not provide higher expressive power than theIOB model, it was shown to improve labelling models’ performance marginally [157]
13
Trang 28and have been used in several NER studies [87, 181] Example sentences annotatedusing each label scheme can be found in Table 1.1.
In reality, entity mentions can appear in various forms, including names, pronouns(1.e., ‘he’, ‘her, ‘who’, etc.), and nominals (i.e., nouns, noun phrases, etc.) In many do-mains such as newswire and literature, NEs were often defined as proper names and their
quantities of interest The popular studied named entity types are person, organization
and location, which were first defined by the sixth in a series of Message ing Conferences (MUC-6) [48] These types are general enough to be useful for manyapplication domains Extraction of expressions of dates, times, monetary values andpercentages, which was also introduced by MUC-6, is often also studied under NER,although strictly speaking these expressions are not named entities Besides these gen-eral entity types, other types of entities are usually defined for specific domains andapplications The NE and NER in the biomedical domain will be described below.
Understand-Biomedical named entities are phrases or combinations of phrases that denote
im-portant concepts in biomedicine They can be chemicals, diseases, anatomies, pathways
and genes/proteins, etc that are named in biomedical literature, which has been growingat an unprecedented speed Automatically extracting them, a task known as biomedi-cal named entity recognition, involves the demarcation of entity names of a specific
semantic type, e.g., proteins It results in annotations corresponding to a name’s in-text
locations as well as the predefined semantic category it has been assigned to [158] Over
the last fourteen years, there has been considerable interest in this problem with a variety
of generic and entity-specific algorithms applied to extract the names biomedical cepts Recent NER researches in the biomedical domain have been primarily focusedon the most frequently requested entities by PubMed users worldwide, includes Disor-der (Disease, Symptom, Phenotype), Gene/Protein, Chemical/Drug, Biological Process,Medical Procedure, Living Being, Research Procedure, Cell Component, Body Part, De-vice or Tissue [68] Still, there are few proposed solutions for the other entities such asphenotypes, anatomy [24].
Because of the need for the development of new treatments for ØfØWW§ï€8ẩể, a pilot study was undertaken to
estimate the pharmacodynamics and tolerability of fusidic acid treatment in chronic active, therapy-resistant patients.
PMID: 1420741 “Chemical Species
Figure 1.1: An example taken from the BC5 CDR corpus with recognized names of
Disease, Chemical and Species.
14
Trang 29Table 1.1: Example sentences labeled using different tagging schema
Tagging scheme Example
new|O treatments|O for|O Crohn|J — DISEASE ’s|I — DISEASE disease|I — DISEASE,1O of |O fusidic|I — CHEMICAL acid|I — CHEMICAL treatment|O in|O chronic|O
active|O, therapy|O -|O resistant|O patients|J — SPECIES
new|Ó treatments|O for|O Crohn|B — DISEASE ’s|I — DISEASE disease|l — DISEASE,IOB of |O fusidic|B — CHEMICAL acid|I - CHEMICAL treatment|O in|O chronic|O
active|O, therapy|O -|O resistant|O patients|B — SPECIES
new|Ó treatments|O for|O Crohn|B — DISEASE ’s|I — DISEASE disease|E — DISEASE,IOBES of |O fusidic|B — CHEMICAL acid|E — CHEMICAL treatment|O in|O chronic|O
active|O, therapy|O -|O resistant|O patients|S — SPECIES
Examples are taken from the BCS CDR corpus.
Figure 1.1 shows an example of biomedical named entities in text, chosen fromthe BioCreative V Chemical-Disease relation corpus [105] In this sentence, all disease,chemical, and species names have been demarcated The following Table 1.1 comparesthree tagging schemes for annotating this example.
1.1.3 Biomedical relation classification
Relation classification (RC) typically follows NER in the relation extraction tem Culotta et al (2006) [29] define relation extraction as the task of discovering seman-tic connections between entities In the text processing, it usually amounts to examiningpairs of entities in a document and determining (from local language cues) whether a
sys-relationship exists between them.
We take the pairwise approach for the task of relation classification Le., afterNER, we considered all pairs of recognized NERs as potential candidates, and givethem as the input to the relation classification system The relation classification systemthen classifies these candidates to assign them to a pre-defined relation type or ‘None-of-above’ (1.e., the negations) In reality, there may be multi-label instances, i.e there aremore than one relationships between an entity pair In this Dissertation, we ignore thesecases and only accept a single label for each instance Generally, a semantic relationshipcan be defined among multiple entities (n-ary), but within the scope of this dissertation,we only consider binary relationships Extracted binary relationships have the structureof a triple < e1,R,e2 >, where e; and eg are named entities (or noun phrases) in a
sentence (or abstract) from which the relationship is being extracted, and R is a relationtype that connects two corresponding entities As treated as a classification problem, we
give the formal definition of relation classification in Definition 1.5.
Definition 1.5 Relation classification task is defined by a real-valued function fp that
15
Trang 30decides whether the corresponding entities are in a relation or not.
+1lif e, and ea are related according to relation R;
—lLif otherwise
€1 and ea are two entities that create a candidate for relation classification.
d is a document which includes corresponding entities e, and cạ d can be a tence, a paragraph or a document depending on the scope of relationships.
sen-T (d) is the information that is extracted from d.
Many respects should be considering in relation classification system, and they are
often different on different types of entities.
— There may be several relations or only one relation in a corpus For example, BCSCDR [105] and BioNLP-ST 2016 BB3 [33] corpora were annotated with only onerelation type, whilst Phenebank and SemEval-2013 DDI-2013 [61] corpora have
several relation types.
— Several relations are directed and order-sensitive, such as the Mechanism relationin DDI corpus [60], the /nheres-in relation in Phenebank corpus Such relationsrequire the model to predict both relation types and the entity order correctly Incontrast, for undirected relations, such as Associated in Phenebank, both directionscan be accepted, another example is Chemical-Induced Disease relation in BCSCDR [105] which its direction always comes from a chemical to a disease.
— The relation is intra-sentence relation (i.e two corresponding entities appeared in
the same sentence) or inter-sentence relation (i.e two corresponding entities may
appear in different sentences).
Biomedical relation classification concerns the detection of semantic relationsbetween biomedical named entities or noun phrases Recently, there has been consider-able interest in biomedical relation extraction and relation classification with a variety ofrelationships The common biomedical relations includes Drug-drug interaction [164],chemical-disease relation [180], Protein-protein interaction [83] and many others Witha multitude of possible relation types, it is critical to understand how systems will be-have in a variety of settings In biomedical domain, relation classification is useful in
16
Trang 31many fact extraction applications ranging from identifying adverse drug reactions to
ma-jor life events It is also important in tasks such as Question Answering and KnowledgeAcquisition.
We gives some examples of biomedical relations in Table 1.2, Figure 1.2,
Fig-ure 1.3, and FigFig-ure 1.4 Table 1.2 represents two examples among a multitude of possible
relation types in the biomedical domain Sentence (i) shows an example of Synonym-of
relation which is represented by an abbreviation pattern This is very different to the
predicate relation Mechanism in (ii).
Table 1.2: Examples for different relation types.
(i) <el>Three-dimensional digital subtraction angiographic</el> DSA</e2>) images from diagnostic cerebral angiography were obtained
(<e2>3D-(ii) Dexamethasone: Steady-state trough concentrations of albendazole sulfoxide were
about 56% higher when 8 mg <el>dexamethasone< /el> was coadministered witheach dose of <e2>albendazole< /e2> (15 mg/kg/day) in eight neurocysticercosis
Sentence (i) shows a Synonym-of relation, represented by an abbreviation pattern,which is very different from the predicate relation Mechanism in (ii).
Figure 1.2 includes examples form BC5 CDR corpus [105] of inter-sentence
re-lation (1.e., two corresponding entities belongs to two separate sentences) and sentence (i.e., two corresponding entities belongs to the same sentence).
intra-(a) Cross-sentence relation (b) Intra-sentence relation
Five of 8 patients (63%) improved during Eleven of the cocaine abusers
fusidic acid treatment: 3 at two weeks and and none of the“controls had
2 after faur weeks ECG evidence“ of significant
` „4
-There were rïồ»serious clinical side effects, myocardial ,¢jury defined as
but dose reductiởx.was required in two myocardial infarction, ischemia,
patients because of nausea and bundle branch block.
(PMID: 1420741) (PMID:1601297)
Figure 1.2: Examples of (a) inter-sentence relation and (b) intra-sentence relation.
Examples are taken from the BCS CDR corpus with recognized chemical-induced disease
relation between a chemical (highlighted in bold) and a disease (highlighted in underlinedbold).
Figure 1.3 indicates the difference of unspecific location and specific location tions While relation with specific location has information of the exact positions of thetwo corresponding entities, unspecific location relation does not, i.e all pairs of corre-sponding entities should be considered as positive instances These examples also comefrom BCS CDR corpus.
rela-Figure 1.4 are examples extracted from Phenebank corpus It includes examples of
17
Trang 32(a) Unspecific location (b) Specific location
Case report: acute unintentional carbachol intoxication Concurrent
INTRODUCTION: Intoxications with carbachol, a muscarinic administration of a
cholinergic receptor agonist are rare ( ) The mode of action TNF antagonist with
was said to be comparable to that of the synthetic ORENCIA has been
compound 'carbamylcholin'; that is, carbachol He bought 25 associated with an
g of carbachol as pure substance in a pharmacy, and the increased _riSk of
father was administered 400 to 500 mg carbachol °€HI9OUS infections
concentrations in serum and urine on day 1 and 2 of hospital and no significant
admission were analysed by HPLC-mass spectrometry additional efficacy
RESULTS: Minutes after oral administration, the patient over use of the TNFdeveloped nausea, sweating and hypotension, and finally 3Ptagonists alone
collapsed Bradycardia, cholinergic symptoms and asystole (DrugBank:Abatacept)
occurred Initial cardiopulmonary resuscitaton and
immediate treatment with adrenaline (epinephrine),atropine and furosemide was successful ( )
Figure 1.3: Examples of relations with specific and unspecific location.
(a) Unspecific location relation taken from the BC5 CDR corpus with recognized chemicals(highlighted in bold) and diseases (highlighted in underlined bold) The annotation points out
there are chemical-induced disease relations between chemical carbachol and diseases, but didnot give the specific location of the corresponding entities (b) Specific location relation taken
from the DDI corpus The annotation specify the Effect relation between two drugs (highlightedin bold) at their specific locations.
directed and directed relations In the directed relation, the order of entities in the tion annotation should be considered, vice versa, in the undirected relation, two entitieshave the same role.
rela-(a) Directed relation (b) Undirected relation
Some patients carrying mutations in Finally, new insight into related
either the ATP6VOA4 or the ATP6V1B1 musculoskeletal complications (such asgene also suffer from hearing myopathy and tendinopathy) has also
impairment of variable degree been gained ( )
(PMC3491836) (PMC4432922)
Directed relations: Undirected relations:
[ATP6V0A4] PROMOTES [hearing impairment] [musculoskeletal complications] and [myopathy] are
[ATP6V1B1] PROMOTES [hearing impairment]
[musculoskeletal complications] and [tendinopathy] areASSOCIATED
Figure 1.4: Examples of (a) Promotes - a directed relation and (b) Associated - anundirected relation taken from Phenebank corpus.
(Entities are highlighted in bold.)
18
Trang 331.2 Literature review
1.2.1 Literature review of biomedical named entity recognition
Over the last fourteen years, there has been considerable interest in biomedical
NER problem with a variety of generic and entity-specific algorithms applied to
ex-tract many biomedical NER such as genes, gene products, cells, chemical compoundsand diseases Figure 1.5 gives an overview of NER approaches, in which, some otherdevelopmental branches of machine learning methods, such as transfer learning, life-
long learning, and reinforcement learning, are not within the scope of this Dissertation.
The specific methods that we used to construct the proposed model are highlighted In
general, similar to NER in the newswire domain, approaches to biomedical NER can
be categorized as knowledge-based methods and machine learning-based methods Wealso discuss hybrid approaches (combining several methods into an architecture) andjoint modeling (a research trend that tries to integrate and handle different tasks as a
single task).
Knowledge-based approaches:
The earliest and straight forward solutions to biomedical NER relied on dictionary—based approaches They rely on the use of existing biomedical resources containinga comprehensive list of terms and determine whether expressions in the text match anyof the biomedical terms in the provided list [158] There are many knowledge-basesare used in this approaches, examples include MESH, UMLS, SNOMED, etc Up tonow, this method is still used in several studies, such as Eftimov et al (2107) [41]that proposed a rule-based named-entity recognition method for knowledge extractionof evidence-based dietary recommendations.
Rule—based methods try to craft patterns/rules manually to recognize NE In thisapproach, manually creating the rules for named entity recognition requires human ex-pertise and is labour intensive Example of researches in biomedical fields that followthese strategies includes Hanisch et al (2005) [58] that applied a staged rule-based
system on the UMLS, HPO and MetaMap.
Knowledge-based methods often require human expertise and is labour intensiveto create such knowledge-bases and patterns Since there are millions of entity namesin use, and new ones are added constantly, these methods will never be sufficientlycomprehensive and can not catch up with the growth rate of the biomedical literature.
19
Trang 34Knowledge-based {
Hidden Markov Model
Named Entity 2~trH mi Support Vector Machine
Recognition SNE) Maximum Entropy + HMM
Machine \= `Learning Deep Learning Recurrent Neural Network r Long Short-term Memoryn i _L 7 1
Figure 1.5: Named entity recognition approaches taxonomy.
The specific methods that we applied in the proposed model are highlighted.
Some other developmental branches of machine learning methods, such as transfer learning,lifelong learning, and reinforcement learning, are not within the scope of this Dissertation
Feature-based supervised machine learning approaches:
Several recent works on biomedical NER use statistical supervised feature-basedmachine learning methods which are often more robust in terms of system performance.Traditionally, to perform well and efficiently, NER models require a set of informativefeatures (i.e linguistic patterns) that are well-engineered and carefully selected, heuris-tically based on domain knowledge [17] These methods utilize a large annotated corpusand the pre-defined feature set for inferring optimal prediction functions by training themodel and then use it to predict the labels of new data Supported by the availabilityof various annotated biomedical corpora, supervised machine learning methods havebecome popular, owing to the satisfactory performance they have demonstrated.
Peceptron [161] is a classic machine learning algorithm with many extended sions Some recent researches successfully apply the structured perceptrons to sequencelabeling tasks, include NER [62, 126] In this Dissertation, perceptron is used for NER
ver-20
Trang 351n the UET-CAM system (Chapter 2).
Conditional Random Fields (CRF) [86] is the most popular discriminative
ma-chine learning model that alternative to the previous for sequence labelling, as it
com-bines the advantage of Maximum Entropy Markov Model (MEMM) in exploiting
non-independent contextual features of the entity without a label bias problem CRF-basedmodels have especially shown reliable performance in biomedical NER problem are lin-ear chain CRF [45, 88, 90, 91] and skip-chain CRF [110] In this Dissertation, CRF isused as a labeling phase in the D3NER model (Chapter 3).
In addition to structured peceptron and CRF, supervised machine learning methodsthat can be used for NER are extremely abundant with many variants, such as Hidden
Markov Model (HMM) [24], semi-markov model [90], MEMM [38], Support Vector
Machines (SVM) [25], decision tree [136], transition-based model [118], and more.
Machine learning method with feature engineering, however, is still time-consuming,
very often yields incomplete non-satisfactory sets Moreover, resulting feature sets areboth domain and model-specific.
Deep learning-based approaches:
In the past few years, the advent of deep neural networks with the capability ofautomatically feature engineering even from noisy data has leveraged the developmentof NER models The deep learning models aim to automatically induce the robust rep-resentations of data by manipulating multiple hidden layers They have produced state-of-the-art results in many tasks of NLP as well as biomedical NER A variety of deeplearning methods and architectures have used in the field of NLP in general and biomed-ical NER in particular In which, the most typical deep neural networks (DNNs) are theConvolutional Neural Networks (CNNs), the Recurrent Neural Networks (RNNs) andtheir variants All of them often requires the use of additional techniques to solve theover-fiting problem and reduce the impact of the initialization.
Recurrent Neural Network (RNN) [162] performs effectively on sequential data,
and it furthermore had many different improvements among several state-of-the-art NLP
systems including NER [107, 184] An advanced RNN type called RNN with Long
Short-Term Memory (LSTM) unit [63] is a specific type of RNN that models
dependen-cies between elements in a sequence through recurrent connections Since the LSTMarchitecture can only process the input in one direction, the bidirectional LSTM (biL-
STM) network improves the LSTM by feeding the input to the LSTM network twice
within two directions: forward - from the beginning to the end of the sequence and
21
Trang 36vice versa, backward - from the end to the beginning of the sequence This design lows for the detection of dependencies from both previous and subsequent words in asequence Very recently, LSTM has increasingly been employed for biomedical NER,yielding state-of-the-art performance at the time of their publication [55, 121, 122, 181].Realizing the potentials of LSTM for the NER problem, we use LSTM in a combinationwith CRF in the biomedical NER model in Chapter 3.
al-Convolutional Neural Network (CNN) [92] is good at capturing the n-gram tures in the flat structure and has also been proved effective in NLP includes NER
fea-[28, 184].
One of the fundamental steps in deep learning model is word representation, i.e.,
transforming each word into a representation vector in the first layer of a deep learningmodel There are several approaches to create a word representation, including randomlyinitialized embeddings, one-hot vectors, character-level word embedding - representinga token’s meaning in sense of its morphological surface [55, 177] The most common
approach to convert a word into a vector is by looking up into the embeddings matrix
(i.e., lookup table) which created based on the pre-trained word embedding Word
em-beddings is a technique to represent a word by low-dimensional continuous vector
repre-sentations (embeddings) that are pre-trained from extremely huge amount of unlabeled
text One of the pre-trained word vectors have been widely used in biomedical named
entity recognition is provided by Pyysalo et al (2013) [151] It is a pre-trained wordembedding of 200 dimensions that was induced from PubMed and PMC texts (6 milliondistinct words) employed the word2vec skip-gram model [130] Another well-knownpre-trained embedding is the FastText [10], which are the 300-dimensional vectors thatrepresent words as the sum of the skip-gram vector and character n-gram vectors to in-corporate sub-word information FastText is provided for the general domain, but it alsoallows us to re-train the model with our biomedical data Since these pre-trained wordembedding models learned the word representation based on the usage of words, theyallow words that are used in similar ways (similar context) to result in having similarrepresentations, naturally capturing their meaning.
In recent years, the use of word embedding in deep learning-based models hasgradually been replaced by more efficient methods that have been remarkably effectivein NLP problems, including NER, namely ELMO (Embeddings from Language Mod-els, 2018) [149] and BERT (Bidirectional Encoder Representations from Transformers,2019) [35] In the early stage of their release, both ELMO and BERT only provided
22
Trang 37the pre-trained for common-domain English text Re-training them 1s resource
expen-sive: pre-training a BERT-base model on English Wikipedia (2.5 billions of words) and
BooksCorpus (0.8 billions of words) on a TPUv2 takes about 54 hours!; pre-training a
BERT-base model on Pubmed abstracts (4.5 billions of words) and PMC full text (13.5billions of words) on eight NVIDIA V100 (32GB) GPUs takes 23 days [97]; moreover,to fine-tune a pre-trained BERT on a specific task, it often takes a few hours more ona GPU In 2020, Lee et al [97] introduced BioBERT, the first domain-specific BERT-
based model pre-trained on biomedical corpora We leave this research direction to
future works.
Unsupervised and semi-supervised machine learning:
Several unsupervised and semi—supervised methods have been utilized to tackle
the biomedical NER task Unsupervised-machine learning methods for biomedical
NER are often based on phrase chunking and distributional semantics In which the tity recognition may leverages terminologies, shallow syntactic knowledge (noun phrasechunking), and corpus statistics (inverse document frequency and context vectors) [190].The scope of the Dissertation does not focus on these methods.
en-Semi-supervised methods take advantage of both supervised and unsupervised
approaches They are applied in various manner such as self-training (bootstrapping)
[174], co-training [52], transfer learning [179] and distant supervision learning [98] We
apply distant supervision learning in Chapter 2 and Chapter 4 to improve the mance of proposed models.
perfor-Hybrid model and joint modeling:
Hybrid architectures are proposed to take advantages of several different methods
by combining them into a single model This approach integrates
heuristics/rule/pattern-based methods, domain knowledge, and learning-heuristics/rule/pattern-based methods in various combinationmanners One of the state-of-the-art hybrid architecture that successfully applied to NERis the combination of a deep learning network for data representation and then use CRFfor sequence labelling [55].
Following reports of the high-level performance of the joint-inference model inother NLP tasks, several studies tried to joint NER with other NLP task to improve theperformance Sometimes, after NER, we need to link the recognized entity to a concept(or data entry) in ontology or database, this task is called named entity normalization
(NEN) Traditionally, NER and NEN were treated as two separate tasks, in which, NEN
23
Trang 38took the output of NER as its input in a pipeline manner Several studies [89, 116] havepointed out the limitations of this pipeline approach, i.e causing cascading errors fromNER to NEN, and limiting the ability of the NER system to exploit the lexical infor-mation provided back by the normalization directly The joint model between NER andNEN is expected to overcome these disadvantages of such a traditional pipeline model.Several works tried to build such a NER-NEN joint model; examples include [118] pro-posed a transition-based model to jointly perform disease NER and NEN, TaggerOne[89] is a joint model of a semi-Markov structured linear classifier, with a rich featureapproach for NER and supervised semantic indexing for NEN We exploit this idea into
the proposed model in Chapter 2 to join NER and NEN modules in the decoding phase.
1.2.2 Literature review of biomedical relation extraction
We categorize approaches to biomedical relation classification as knowledge-based
methods and machine learning methods, as illustrated in Figure 1.6 In which, the
spe-cific methods that we used to construct the proposed model are highlighted Note that
there are some other developmental branches of machine learning methods, such as
transfer learning, lifelong learning, and reinforcement learning, but they are not withinthe scope of this Dissertation.
Knowledge-based approaches:
The most simple approach for detecting potential relationships is based on co-—occurrence statistics Based on the hypothesis that, if two entities are frequently men-tioned together, likely, they are somehow related, this method reveals biomedical rela-
tionship through counting their co-existences in the same sentences entire abstracts [20].
More accurate alternatives for relation classification are based on manual-crafted rules[78, 115] and patterns [79, 146] These methods do not require any annotated data totrain a system but typically meet two disadvantages: (i) the rules and patterns reliedon manually-crafted rules/pattern, which are very expensive, time-consuming and oftenrequire domain experts knowledge (ii) they are limited at extracting specific relationtypes Since the co-occurrence methods often have low precision and rule/pattern-basedmethods are labour-intensive but not generalized, machine learning approaches are cur-rently one of the top choices for relation extraction.
Feature-based supervised learning approaches:
Some literature reviews on relation extraction [5, 142] divide supervised learning
24
Trang 39Crowd Sourcing
Figure 1.6: Relation extraction approaches taxonomy.
The specific methods that we applied in the proposed model are highlighted.
approaches into two sub-categories, i.e., kernel-based and feature-based methods, basedon their input to the classifier While feature-based methods require a set of pre-definedfeatures extracted from sentences, kernel-based techniques often take advantages of richstructural representation such as dependency trees In this dissertation, we only focus onthe feature-based methods Feature-based methods represent labeled instances featurevector, in which, each element represent a feature These feature vectors are then servedto the classifier for training the model and predicting whether the candidate entity pairare related or not These methods are data-driven, 1.e., based on domain-specific man-ually annotated corpora In the biomedical domain, these approaches are widely usedsince they can take advantages of various annotated biomedical corpora which are freelyavailable but bring potential performance.
The popular feature-based supervised machine learning algorithm is Support tor Machines (SVM) [27] which tries to find a linear hyperplane, in a n-dimensional
Vec-space, with the largest distance to the nearest instances of positive and negative classes.Feature-based SVM was used for extracting chemical-induced disease relation [186],Live-in event [99], drug-drug interaction [156], protein-protein interaction [132], protein-organism-location relation [117] and many other biomedical relations SVM with a rich
25
Trang 40feature set is used for relation classification in Chapter 2 of this Dissertation.
In addition to SVM, machine learning methods that applied for biomedical relationextraction are abundant, such as Conditional Random Fields [14], Naive Bayes [102],
maximum entropy [49], logistic regression [73].
These machine learning methods for relation classification require carefully engineered process Although it is time- and money-consuming to create the feature sets,they are often limited for a specific model and domain.
feature-Deep learning approaches:
Recent successes in deep learning have stimulated interest in applying neural tectures to the task of relation classification They are extremely good at automatically
archi-feature engineering from noisy data, thus, not requiring a handcrafted archi-feature set but still
yielding good performances Deep learning models often requires the use of additionaltechniques to solve the over-fiting problem and reduce the impact of the initialization.The Dissertation applies both CNN and RNN with several different improvements for
classifying the biomedical relations.
Convolutional Neural Networks (CNNs) [92] are among early approaches to beapplied successfully to biomedical relation classification problem and yields the state-of-the-art results Zhao et al [194] used a syntax CNN for extracting drug-drug interaction.Verga et al [176], Zhou et al [198] applied CNN for chemical-induced disease relationextraction.
Recurrent Neural Networks [162] are another approach to capturing relations andnaturally good at modelling long-distance relations within sequential language data.There are some variants of RNN that have been applied to biomedical relation classi-fication task, includes the original RNN [128], RNN with LSTM unit which is used toextend the range of context [108, 197] and Recursive neural network [111].
Deep learning-based researches on relation classification in this Dissertation aremostly based on the shortest dependency path (SDP) Nodes (tokens) and dependenciesin the SDP can be represented as a vector by using methods outlined in Section 1.2.1.While tokens are often represented based on word embedding, dependencies are oftenconverted to a one-hot vector or randomly initialization As introduced in Section 1.2.1,
ELMO (2018) [149] and BERT (2019) [35] have been shown to be effective in numerous
studies of NLPs, including RC However, the researches on relation classification inthis Dissertation are mostly based on the shortest dependency path (SDP), the use of
26