VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY LE HOANG QUYNH A HYBRID APPROACH TO FINDING PHENOTYPE CANDIDATES IN GENETIC TEXT MASTER THESIS Hanoi – 2012 VIETNAM NATIONAL[.]
VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY LE HOANG QUYNH A HYBRID APPROACH TO FINDING PHENOTYPE CANDIDATES IN GENETIC TEXT MASTER THESIS Hanoi – 2012 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY LE HOANG QUYNH A HYBRID APPROACH TO FINDING PHENOTYPE CANDIDATES IN GENETIC TEXT Major : Computer Science Code : 60 48 01 MASTER THESIS Supervisor: Assoc.Prof Ha Quang Thuy Hanoi – 2012 A hybrid approach to finding phenotype candidates in genetic texts Le Hoang Quynh Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Associate Professor Ha Quang Thuy A thesis submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science November 2012 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom I have worked with at University of Engineering and Technology and National Institute of Informatic (Tokyo, Japan) or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Hanoi, November 10th , 2012 Signed Le Hoang Quynh i ABSTRACT Named entity recognition (NER) has been extensively studied for the names of genes and gene products but there are few proposed solutions for phenotypes Phenotype terms are expected to play a key role in inferring gene function in complex heritable diseases but are intrinsically difficult to analyse due to their complex semantics and scale In contrast to previous approaches we evaluate state-of-the-art techniques involving the fusion of machine learning on a rich feature set with evidence from extant domain knowledge-sources The techniques are validated on two gold standard collections including a novel annotated collection of 112 abstracts derived from a systematic search of the Online Mendelian Inheritance of Man database for auto-immune diseases Encouragingly the hybrid model outperforms a HMM, a CRF and a pure knowledge-based method to achieve an F1 of 75.37 for BF and micro average F1 of 84.01 for the whole system Publications: • Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations In International Conference on Asian Language Processing 2010 Page 170-173 Harbin, China; December 28-30, 2010, DOI: http://doi.ieeecomputersociety.org/10.1109/IALP.2010.73 • Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan and QuangThuy Ha An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text In Proceedings of International Conference on Asian Language Processing 2011 Page 115-118 DOI: http://doi.ieeecomputersociety.org/10.1109/IALP.2011.37 • Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Anika Oellrich, Ai Kawazoe, Martin HallMay and Dietrich Rebholz-Schuhmann A hybrid approach to finding phenotype candidates in genetic text In The 24th conference on Computational Linguistics (COLING 2012) Accepted as long paper ii ACKNOWLEDGEMENTS First and foremost, I would like to express my deep gratitude to my supervisor, Assoc.Prof Ha Quang Thuy, for his patient guidance and continuous support throughout the years He always appears when I need help, and responds to queries so helpfully and promptly I would like to express my gratitude to the National Institute of Informatics (NII - Tokyo, Japan) for giving me a great chance working at NII in the NII International Internship program Then, I sincerely give my honest thanks and appreciation to Assoc.Prof Nigel H Collier, my internship supervisor at NII, for his great support I would like to say thank you to all my teachers at university of Engineering and Technology (VNU), who bring me many knowledge and experiences I also want to thank my colleagues at the Knowledge and Technology laboratory (UET, VNU) and my classmate for their enthusiasm and promptly help I sincerely acknowledge the Vietnam National University, NAFOSTED and the QG.10.38 project for some supporting finance to my master study And thanks to all my friends who always be by my side and cheer me Finally, this thesis would not have been possible without the support and love of my family Thank you, mother and father Thanks brother and sister, thanks to my nephew And thank you, my beloved husband Again, thank you and love all of you so much ♥ iii Table of Contents Introduction 1.1 Motivation and problem definition 1.2 Phenotype definition 1.3 The challenges of phenotype entity recognition Related works 2.1 Useful resources 2.1.1 GENIA and JNLPBA corpora 2.1.2 The online mendelian inheritance in man 2.1.3 The human phenotype ontology 2.1.4 The mammalian phenotype ontology 2.1.5 The unified medical language system 2.1.6 KMR corpus 2.2 Related researches 2.2.1 Baseline method: Khordad et al (2011) Methods 3.1 Schema 3.2 Annotated data sources 3.3 Proposed model 3.3.1 Pre-processing 3.3.2 Machine learning labeler 3.3.3 Knowledge-based labeler 3.3.4 Merge results 1 6 7 9 10 11 11 16 16 20 22 22 22 24 25 Experimental results and evaluation 29 4.1 Metrics 29 4.2 Experiments on the KMR corpus 31 iv TABLE OF CONTENTS 4.3 4.4 Experiments on the Phenominer Discussion 4.4.1 Discussion on corpora 4.4.2 Discussion on results Conclusion v corpus 32 35 35 36 40 List of Figures 2.1 2.2 2.3 A visual example of HPO hierarchical structure 13 A visual example of MP hierarchical structure 14 Khordad et al (2011)’s system block diagram 15 3.1 3.2 3.3 An informal overview of bodily feature entity 17 Phenotype tagging architecture 27 Brat rapid annotation tool example 28 4.1 4.2 Column chart shows the experimental results on KMR corpus 32 Column chart shows the experimental results of BF entities on Phenominer corpus 34 Column chart shows the experimental results of GGP entities on Phenominer corpus 34 4.3 vi 1.3 The challenges of phenotype entity recognition Due to the motivation and challenges of phenotype recognition, the key contributions of this thesis are: (1) To provide an operational semantics for identifying phenotype candidates in text, (2) To introduce a set of guidelines and an annotated corpus based on a selection of 19 clinically significant auto-immune diseases from The Online Mendelian Inheritance of Man (OMIM) (Hamosh et al., 2005), one of the most widely used gene-disease databases, and (3) To mitigate linguistic variation whilst still meeting the conceptual expectations of biologists we propose a new named entity solution that uses statistical inference and external manually crafted resources The remaining of this thesis is organized as follows In the second chapter, we present some related researches and useful resources The next chapter describes our Phenominer corpus version 1.0 and proposed method for phenotype candidate recognition Then, experimental results, evaluation and discussion are in 3rd chapter Finally, 4th chapter is the conclusions Chapter Related works Such motivation and challenges that we mentioned in chapter have led to a variety of proposed solutions involving a wide range of resources In this chapter, we take a review on some useful resources in section 2.1, they are GENIA and JNLPBA corpora, the online mendelian inheritance in man (OMIM) , the human phenotype ontology (HPO), the mammalian phenotype ontology (MP), the unified medical language system (UMLS), etc Then, in section 2.2, we introduce some related researches in biomedical entity recognition and describe Khordad et al (2011) as our baseline method for BF 2.1 Useful resources Using available resources help us not only to take advantage of knowledge from other researches but also to reduce effort Up to now, there are many resources are used in bio-informatics Among these, linguistically corpora such as GENIA (Tateisi et al., 2000; Kim et al., 2003), OMIM (Hamosh et al., 2005), have proven to be central to the NER solution However due to the size of the vocabularies involved, annotated corpora by themselves not provide a complete solution Researchers have therefore also looked at the rich availability of formally structured biomedical knowledge (ontologies) such as the Unified Medical Language System (UMLS) (Bodenreider et al., 2002), the Human Phenotype Ontology (Robinson and Mundlos, 2010), the Mammalian Phenotype Ontology (Smith and Eppig, 2009), the Gene Ontology (Gene Ontology Consortium, 2000), etc 2.1 Useful resources 2.1.1 GENIA and JNLPBA corpora GENIA corpus version 3.0 (Kim et al., 2003) was formed from a controlled search on MEDLINE using the MeSH terms ’human’, ’blood cells’ and ’transcription factors’ From this search, 2000 abstracts (20,546 sentences, more than 400,000 words) were selected This corpus has been released with linguistically rich annotations including sentence boundaries, term boundaries, term classifications, semi-structured coordinated clauses, recovered ellipsis in terms, etc Entities are hand annotated into 36 classes of DNA, RNA, cell line, cell type and protein (almost 100,000 annotations) JNLPBA data set came from the GENIA version 3.02 corpus It is a training set for the Bio-Entity recognition task at JNLPBA Kim et al (2004) In this share task, they simplify 36 classes of GENIA corpus and used only the classes protein, DNA, RNA, cell line and cell type The GENIA and JNLPBA corpora is important for two major reasons: the first is it provides the large single source of annotated training data for the NE task in molecular biology and the second is in the breadth of classification Follow Kim et al (2004), although number of classes in GENIA/JNLPBA corpora is a fraction of the classes contained in major taxonomies it is still the largest class set that has been attempted so far for the named entity recognition task Moreover, GENIA corpus can be also used for other biomedical tasks, such as POS tagging 2.1.2 The online mendelian inheritance in man The Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005) is a continuously updated catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and phenotypic expression (genotype and phenotype) The full text and referenced overviews in OMIM contain information on many mendelian disorders and over 12,000 genes Derived from the biomedical literature, OMIM is written and edited at Johns Hopkins University with input from scientists and physicians around the world Each OMIM entry has a full text summary of a genetically determined phenotype and/or gene and has numerous links to other genetic databases such as DNA and protein sequence, PubMed references, general and locus-specific mutation databases, HUGO nomenclature, MapViewer, GeneTests, patient support groups and many others Within an OMIM entry, there is a field called ‘Clinical Synopsis’ which is a list of 2.1 Useful resources the clinical features of the disorder appear in this entry or references of this entry There are over 4500 clinical synopses in OMIM, they are a important resources for researches on Phenotype OMIM is an easy and straightforward portal to the burgeoning information in human genetics, it is now distributed electronically by the National Center for Biotechnology Information1 Over five decades OMIM has achieved great success, it is one of the most important information source about human genes and genetic phenotypes (Cohen et al., 2011; Robinson and Mundlos, 2010) Nonetheless OMIM does not use a controlled vocabulary to describe the phenotypic features in its clinical synopsis section that makes it inappropriate for data mining usages In the section 2.1.3, we introduce HPO which is constructed using OMIM 2.1.3 The human phenotype ontology The Human Phenotype Ontology (HPO)2 is a standardized, controlled vocabulary allows phenotypic information to be described in an unambiguous fashion in medical publications and databases (Robinson and Mundlos, 2010) The HPO was originally constructed using data from OMIM by merging synonym and creating the hierarchical structure between terms according to their semantics The hierarchical structure in the HPO represents the subclass relationship, figure 2.1 is a describe a hierarchical structure of HPO by a example of ‘atrioventricular septal defect’ [HP:0010439] (example comes from Robinson and Mundlos (2010)) The HPO currently contains over 9500 unique terms (more than 15000 synonyms) describing human phenotypic features (statistic in 2012) Nevertheless, follow Khordad et al (2011), HPO is not complete and we had several problems finding phenotype names in it: (1) some acronyms and abbreviations are not available in the HPO; (2) although the HPO contains synonyms of phenotypes, there are still some synonyms that are not included in the HPO; (3) in some cases adjectives and other modifiers are added to phenotype names, making it difficult to find these phenotype names in the ontology; (4) new phenotypes are being continuously introduced to the biomedicine world, http://www.ncbi.nlm.nih.gov/omim/ http://www.human-phenotype-ontology.org/ 2.1 Useful resources HPO is being constantly refined, corrected, and expanded manually, but this process is not fast enough nor can the inclusion of new phenotypes be guaranteed Thus, although HPO is a very useful resources, using only it is not enough for phenotype recognition, we should use it just as a additional resources 2.1.4 The mammalian phenotype ontology The Mammalian Phenotype Ontology (MP) (Smith and Eppig, 2009) has been applied to mouse phenotype descriptions in MGI3 , RGD4 , OMIA5 and elsewhere Use of this ontology allows comparisons of data from diverse sources, can facilitate comparisons across mammalian species, assists in identifying appropriate experimental disease models, and aids in the discovery of candidate disease genes and molecular signaling pathways Similar with HPO, the Mammalian Phenotype Ontology (MP) is a standardized hierarchical structured vocabulary The highest level terms describe physiological systems, survival, and behavior The physiological systems branch into morphological and physiological phenotype terms at the next node level The example of hierarchical tree for the term ‘opisthotonus’ [MP:0002880] is shown in figure 2.2 (example comes from Smith and Eppig (2009)) MP has about 9000 unique terms (about 24000 synonyms) of mouse abnormal phenotype descriptions (statistic in 2012) 2.1.5 The unified medical language system The Unified Medical Language System (UMLS) (Bodenreider et al., 2002) is a set of files and software that brings together many health and biomedical vocabularies and standards The UMLS has three tools, which we call the Knowledge Sources: Metathesaurus, semantic network and SPECIALIST Lexicon and Lexical Tools • The Metathesaurus is a very large, multi-purpose, and multi-lingual vocabulary database that contains information about biomedical and health related concepts, their various names, and the relationships among them It contains more than 1.8 million concepts come from more than 100 source vocabularies Mouse Genome Informatics Database: http://www.informatics.jax.org/ Rat Genome Database: http://rgd.mcw.edu Online Mendelian Inheritance in Animals: http://omia.angis.org.au/ 2.1 Useful resources 10 • The Metathesaurus is linked to the Semantic Network: all concepts in the Metathesaurus are assigned to at least one semantic type from the semantic network • MetaMap is a well-known tool in the UMLS SPECIALIST Lexicon and lexical tools It is a highly configurable application to map biomedical text to the UMLS Metathesaurus: MetaMap tokenizes and phrase chunking the input text; map them to UMLS concepts, each phrase is mapped to a set of candidate concepts; word sense disambiguation step will choose the best candidate with respect to the surrounding text However UMLS semantic network does not contain Phenotype as a semantic type so it alone is not adequate to distinguish between phenotypes and other objects in text In addition, some phenotype names not exist in the UMLS Metathesaurus at all But UMLS and its knowledge sources may be useful for phenotype recognition in some ways 2.1.6 KMR corpus We call a manually annotated corpus in Khordad et al (2011) ‘KMR corpus’ It is a collection of 3784 tokens (120 sentences) with 110 annotated phenotype mentions Sentences in KMR corpus were taken from PubMed papers from the year 2009 in the area of human genetics Annotation was conducted with reference to the HPO so that a term was tagged as phenotype if it was in the HPO or if it was not in the HPO but its definition showed that it was caused by a genotype It is not a well-known corpus and only be used in Khordad et al (2011) researches But now we are lack of annotated corpus for phenotype so it is still a valuable choice We will use this corpus for testing and analyzing our proposed model Above, we just introduce some of the most typical useful resources for our researches In additional to them, there are many other resources for bio-informatics that can be used such as medical subject headings6 , Gene list contains more than millions genes7 , etc MeSH:http://www.nlm.nih.gov/mesh/meshhome.html Created by National Center for Biotechnology Information, U.S National Library of Medicine 2.2 Related researches 2.2 11 Related researches Named Entity Recognition in the biomedical domain has been extensively studied and, as a consequence, many methods have been proposed Some methods like MetaMap are generic methods and find many kinds of entities in the text Some methods, are specialized to recognize particular type of entities However, these techniques tend to emphasize finding the name of genes, gene products, cells, diseases and chemical (Fukuda et al., 1998; Rindflesch et al., 1999; Collier et al., 2000; Kazama et al., 2002; Zhou et al., 2003; Settles, 2004; Kim et al., 2004; Leaman and Gonzalez, 2008) So far, there have been a small number of researches done for phenotype they often based primarily on a available resources or rule-based method Whilst other authors have tried similar approaches for other entity types, none have tried both machine learning and external resource lookup for a class as rich and semantically complex as phenotypes In this section, we describe a method proposed by Khordad et al (2011) which is used as our base-line method for comparison in the experiments 2.2.1 Baseline method: Khordad et al (2011) The system built in Khordad et al (2011) is based on Metamap and makes use of the UMLS Metathesaurus and the Human Phenotype Ontology From an initial basic system that uses only these pre-existing tools, five rules that capture stylistic and linguistic properties of this type of literature are proposed to enhance the performance of our NER tool A block diagram showing Khordad et al (2011)’s system processing is shown in figure 2.3 The system performs the following steps: • (1) MetaMap chunks the input text into phrases and assigns the UMLS semantic types associated with each noun phrase • (2) The Disorder Recognizer analyzes the MetaMap output to find phenotypes and phenotype candidates This is the most important part of this method, it based primarily on the idea that phenotype must belong to some certain UMLS semantic types The UMLS Semantic Network contains 133 Semantic Types which are categorized into 15 Semantic Groups that are more general In which, the Semantic Group Disorders contains 12 semantic types that are close to the meaning of phenotype, they are: Acquired Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease 2.2 Related researches 12 or Syndrome, Experimental Model of Disease, Finding, Injury or Poisoning, Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function, Sign or Symptom In this step, phrase are not belong to this semantic group are rejected But a number of semantic types in this semantic group may include concepts that are not phenotypes The problematic semantic groups are: Finding, Disease or Syndrome, Experimental Model of Disease, Injury or Poisoning, Sign or Symptom, Pathologic Function, and Cell or Molecular Dysfunction Therefore, if a phrase is assigned to these semantic types, it is considered as phenotype candidate and will be confirmed as phenotype or not in step (3), otherwise, it is a phenotype • (3) Phenotype candidates from the previous step are searched in the HPO using OBO-Edit8 Phenotype candidates that are found in the HPO are recognized as phenotypes • (4) Result Merger merges the phenotypes found by disorder recognizer and OBO-Edit and makes the output that is the final list of available phenotypes in the input text This model is tested on a small corpus KMR (described in section 2.1.6) annotated by authors The results is precision is 97.58, recall is 88.32 and F1 is 92.71 OBO-Edit: the OBO ontology editor: http://oboedit.org/ 2.2 Related researches Figure 2.1: A visual example of HPO hierarchical structure HP:0010439 13 2.2 Related researches Figure 2.2: A visual example of MP hierarchical structure MP:0002880 14 2.2 Related researches Figure 2.3: Khordad et al (2011)’s system block diagram 15 Chapter Methods In this chapter, firstly, we analyze two entities that we employed in this study: gene/gene product (GGP) and bodily feature (BF) in details (section 3.1) Then, in section 3.2, we introduce our Phenominer corpus version 1.0 which is built based on 19 auto-immune diseases, this corpus can be used in phenotype recognition as well as other biomedical problem And last, section 3.3 describe our proposed Hybrid model for BF and GGP entities recognition, the model consists of there main parts: machine learning labeler, knowledge-based labeler and merge results module 3.1 Schema We employed two types of entity in our study: gene/gene product (GGP) and bodily feature (BF) GGP is proposed because (1) a subset of these entities are useful for applications that explore gene-phenotype relations, and (2) it allows us to compare our results against the many biomedical NER studies of the past, e.g Kim et al (2004); Rebholz-Schuhmann et al (2010) Because of space limitations we will not provide a rigidly formal definition or a taxonomic analysis (Beisswanger et al., 2008) Future work will explore the relationships between these and other entity types In line with BioTop (Beisswanger et al., 2008), GGP is relatively straightforward to define by the conjunction of (BioTop ID Nucleic Acid Structure) and (BioTop ID Peptide Structure) Definition: A gene/gene product (GGP) entity is a mention of one of three major macro-molecules DNA, RNA or protein DNA and RNA 16 3.1 Schema 17 are nucleic acid sequences containing the genetic instructions used in the development and function of an organism Proteins are polypeptide sequences, or parts of polypeptide sequences, folded into structures that facilitate biological function Examples include: [cryoglobulins], [anticariolipin antibodies], [AFM044xg3], [chromosome 17q], [CC16 protein] As mentioned in chapter 1, in this thesis, we use the definition of bodily feature (BF) as Phenotype candidate Definition: A bodily feature (BF) entity is a mention of a bodily quality Tải FULL (60 trang): https://bit.ly/3RVUzAL in an organism Dự phòng: fb.com/TaiHo123doc.net Examples include: [lack of kidney], [abnormal cell migration],[absent ankle reflexes] as well as more complex cases such as [no abnormality in his heart], [unfavorable serum lipid levels] and [suceptibility to ulcerative colitis] Figure 3.1 is an informal overview of bodily feature entity It visually describes some forms of BFs obtained from the data surveying, contains: structural attribute, qualitative attribute, functional attribute and process attribute Figure 3.1: An informal overview of bodily feature entity 3.1 Schema 18 • Structural attributes indicate any presence or absence of a physical component (Anatomy or GGP) For example: [having five fingers], [lack of kidney], [Peritoneal mesothelioma], [missing one finger] • Qualitative attributes show qualities of physical components in organism In simple cases, they have the form: Anatomy/GGP has (or not has) certain quality Qualities can describe any measurable characteristic such as location, color, size, mass, etc and even underspecified qualities of a human/mouse body component Most qualitative phenotypes contain mention of a physical component term, i.e anatomy/GGP, but some phenotypes not (although there is usually a hidden relation to a physical component) For example: [black hair], [not having between 13 and 18 gm/dl hemoglobin concentration], [adult female height 130-157 cm], [conjoined fingers] • Functional attributes are related to functions and disposition of anatomy (Hoehndorf et al., 2010) Intuitively, functions of anatomy establish the reason (or cause) that an anatomy exists while their dispositions determine their capabilities and potentials For example, the endocrine pancreatic cells have a function to produce insulin, and normally have a disposition to produce insulin In general, functional attribute shows the lack or abnormality of anatomy Tải FULL (60 trang): https://bit.ly/3RVUzAL function Dự phòng: fb.com/TaiHo123doc.net For example: [facial grimacing], [sleepy facial expression], [reading disability], [hypotension], [deaf] • Process attributes represent characteristics of the process themselves They include characteristics of physiological process, metabolic process, biological pathways, chemical reactions, gene-related process, gene expression, etc The expression of process attribute sometimes have complex structure, but following the discussion of phenotypes as processes in physiology (Hoehndorf et al., 2012) we include some mentions of processes within the scope of our annotation schema For example: [defective DNA repair after ultraviolet radiation damage], [abnormality of metabolism], [proliferation of BAF-32 cells] 3.1 Schema 19 • These above cases are the most common cases of BF, but there are many other cases of BF that we cannot list or group them into classes For example, there are some non-measurable characteristics of a body component that are experienced by a patient (human or mouse) himself, such as pain or itchiness These characteristic themselves cannot be objectively measured or observed by others This kind of characteristic is complex and has often has several variants, in this work, they are also considered as BF For example: [primary sunburn], [headache], [stress] Table 3.1: Referential semantics and scoping of mentions by entity type BF specific reference Yes generic reference Yes1 under-specified reference No modifiers Yes2,3 conjunctions Yes4 processes Yes5 negation Yes6 GGP Yes Yes No No Yes4 No No Notes on annotation: An entity may be referred with an expression of generic name They may be anaphoric (i.e., refer to other mentions in the context), sometimes they are too vague or descriptive to be called a named entity But because its information contents are valuable, in such a case, the generic name should be annotated For example, [gene], [gene expression], [asthma phenotype] Quantitative modifiers are included, e.g [having five fingers] as well as spatial modifiers, e.g [abnormality in left hand] Qualitative modifiers are included For example, physical components: [black hair], underspecified ranges: [normal height], locational modifers: [low set ears], and level modifiers: [quite small fingers] Where there is elision of the head, e.g [IA/H5 virus], then annotate the whole expression Otherwise annotate each expression separately, e.g [IA virus] and [H5 virus] We exclude however finite verb forms, infinite verb forms with ‘to’, verbs in a progressive or perfect aspect, verb phrases, clauses or sentences and any phrase with a relative clause or complement clause If the negation appears in a noun phrase with an anatomical entity then we generally allow it, e.g [absent ankle reflexes], [no left kidney] 6811996 ... network and SPECIALIST Lexicon and Lexical Tools • The Metathesaurus is a very large, multi-purpose, and multi-lingual vocabulary database that contains information about biomedical and health related... Rebholz-Schuhmann A hybrid approach to finding phenotype candidates in genetic text In The 24th conference on Computational Linguistics (COLING 2012) Accepted as long paper ii ACKNOWLEDGEMENTS First and... expected to play a key role in inferring gene function in complex heritable diseases but are intrinsically difficult to analyse due to their complex semantics and scale In contrast to previous approaches