Comparing Bayesian logistic regression Laplace and Gaussian priorswith SVMs on three text categorization test collections using large train-ing sets and with no domain knowledge.. ...Mac
Information Retrieval for Bioinflormatlqs
The growth of bioinformatics has resulted in a body of data that is highly inter- connected One reason for this connectivity is the fact that, in the biomedical literature, acceptance of a paper for publication is contingent on the submission of any biological sequences or structures determined or analyzed in the course of the research to the ap- propriate databases The corresponding abstract in MEDLINE, a biomedical literature database, is then annotated with the ID of the sequence or structure Furthermore, the detailed annotation of entries in biological databases provides links between databases. This linking allows researchers to find experimental data very easily once they have identified a paper of interest, or conversely to find an analysis of a particular sequence or structure The National Center for Biotechnology Information (NCBI), part of theNational Library of Medicine, provides online access to MEDLINE abstracts, GenBank sequences, and many other data types, through their Entrez system The ability to browse data and literature is important, but the underlying data has much greater
In the first part of the thesis, we are concerned with using domain knowledge in clustering and information retrieval tasks for bioinformatics We propose to com- bine different knowledge sources in a uniform computational framework We believe that explicitly encoded relationships between abstracts and sequences, abstracts and structures, and between sequences and structures coupled with similarity relationships among the same types of biological objects can be utilized to infer clusters of abstracts, sequences, and structures By representing the network of relationships as a graph, we describe a method for clustering biological data To accomplish this, we first construct a graph of biological sequences, structures and scientific publications with pairwise re- lationships Then we employ graph partitioning techniques to infer clusters of related articles, sequences and structures.
We also present one application of our approach to the problem of finding scien- tific papers (MEDLINE abstracts) that describe functions of particular genes, an ad hoc information retrieval task in the TREC 2003 Genomics Track.
Text Classification 2 0 ng cv v.v v k vn S 2 1.3 9.1“ - dd da ẼẶ es 3
Sequence-Sequence Relationships 0004 8 2.3.2 Structure-Structure Relationships
Sequence alignment technique allows us to relate two protein sequences based on simi- larities of their amino acid sequences [59] It computes a distance between two sequences defined as the minimum number of insertions, deletions or mutations required to trans- form one sequence into the other This can be computed by dynamic programming inO(nm) time, where n and m are the lengths of the two sequences [84] For protein se- quences, the probability of a mutation depends on the two amino acids pairs in question.The cost of a mutation is usually defined by a substitution matrix, which gives costs for and mutations and insertions in either sequence make a negative contribution Karlin and Altschul [41] have studied the statistical significance of these similarity scores so that they can be converted to a probability For computing similarity of sequences in our experiments in Chapter 3, we use a publicly available tool, called BLAST (Basic Local Alignment Search Tool) (4].
Techniques for comparing structures try to align the atoms in one structure with the atoms in another structure in order to minimize the distance between the closest pairs of atoms in these two structures In this thesis, we will not compute similarities between structures Instead, we will rely on the existing relationships between structures which are available in public databases such as SCOP (Structural Classification of Proteins) database [58].
Literature-Literature Relationships
Document similarity computation allows us to relate two documents Given a pair of documents in a document collection, their similarity can be computed as follows Each document is represented as a vector of numeric feature values In a document vector, each component corresponds to a distinct term in the document collection and has a value as a function of how often the term corresponding to that component appears in the document and in the whole collection (called term weighting) A commonly used term weighting scheme is TFxIDF (term frequency times inverse document frequency). With this weighting scheme, the weight of a term ¢; in a document d,, w,;, is given by wi = fig X log where ƒ;; is the frequency of term ; in document d;, n; is the number of different documents the term ; appears in, and N is the total number of documents in the collection IDF component of this term weighting scheme increases the weights of vector is normalized to have length one, and the similarity of two vectors is computed as the inner product of the vectors, i.e the cosine of the angle between the vectors [74]. Cosine ranking with TFxIDF weighting performs well in practice Therefore, most information retrieval systems use it for relevance ranking For computing similarity of documents, we use MG software as mentioned above.
Shared keywords constitutes another kind of literature-literature relationship, e.g.,MeSH terms and chemical names assigned manually by human indexers Two docu- ments that share such a keyword have a specific relationship The advantage of linking literature through keywords is that they make the relationship clear to the user.
Structure-Sequence Relationships c 10 2.3.5 Sequence-Literature Relationships
A protein structure always has an associated protein sequence In fact, the 3-D locations of atoms in a structure description are ordered by the primary sequence Given a structure, the sequence can always be determined On the other hand, relating a structure to a given sequence is straightforward if a sequence has a known structure. But this is only the case for 0.4% of sequences (20,000 structures out of 3.5 million sequences) These relationships are encoded in PDB, Protein Data Bank, 3—D structural data of biological macromolecules (proteins and nucleic acids) [11].
Relationships between protein sequences and literature are explicitly encoded in swiss- PROT This is the Swiss Protein Database, a well-annotated protein sequence database [10] Relationships between DNA sequences and literature are explicitly encoded in GenBank, the NIH genetic sequence database In the genomics literature, acceptance of a paper for publication is contingent on the submission of any sequence described in the paper to GenBank For example, 79% of primate sequences in GenBank have links to the literature, and each literature entry is linked to four sequences on average InMEDLINE, the literature database, a maximum of thirty sequence links are recorded,but more links can be inferred by inverting the links from sequence to abstract As an example of sequence-literature links, the paper “Molecular diversity in amino-terminal domains of human calpastatin by exon skipping” contains links to twelve sequences in GenBank, each of which is an example of human calpstatin sequenced as part of the experiments described in the paper Each of the sequences has a corresponding link to the MEDLINE entry for the paper.
The literature entries associated with structural entries usually describe the experi- mental procedure used to calculate the structure As with sequence data, deposition of the structure with the Protein Data Bank (PDB) is required for publication In addition to describing the structure, the authors often make observations about how the structure clarifies the mechanism of the protein’s enzymatic activity There is only one publication associated with the structure determination.
2.3.7 A Network of Literature, Sequences, and Structures
Using the six relationships defined above, a graph can be constructed with biological objects at the vertices and pairwise relationships labeling the edges Part of this graph already exists at NCBI’s Entrez system (http://www.ncbi.nlm.gov) which allows users to follow the relationships described above Entrez is heavily used by biologists for browsing In this thesis, we propose using the graph representation as the basis for computation In particular, we are interested in inferring clusters of biological objects from a heterogeneous corpus of information by partitioning the graph.
Note that there are two different kinds of relationships that papers, sequences and structures might have to each other The first is similarity, as described in the relation- ships above The second is a type of relationship explicitly encoded in databases For example, two protein sequences might be connected because they cooperate to perform a cellular activity, e.g., the multiple subunits of the replisome Literature links not only similar proteins but also related proteins A paper describing replication may discuss both proteins involved in the replisome By clustering, it might be possible to construct a broad group of sequences and literature around a particular cellular function.
Given a weighted, undirected graph, the objective of graph partitioning is to parti- tion the graph into k roughly equal parts such that the sum of the weights connecting different parts is minimized, thereby producing subgraphs that are relatively highly con- nected The graph partitioning problem is NP-complete [26] However, many heuristics have been developed that find a reasonably good partition.
Traditional graph partitioning algorithms compute a partition of a graph by oper- ating directly on the graph, and they are usually slow On the other hand, multilevel graph partitioning algorithms reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then coarsen it to construct a partition for the original graph (31, 42, 43] These algorithms are generally fast and produce high-quality partitions We chose the pmetis program provided by the METIS! software, a publicly available graph partitioning software package [1] The partitioning algorithm used by pmetis is based on multilevel recursive bisection described in [42].
The multilevel recursive bisection method works as follows First, the size of the graph is reduced by collapsing nodes and edges to a few thousand nodes Then the smaller graph is partitioned into two parts Partitioning is repeated by uncoarsening each part one level up until all & subsets are obtained The whole partitioning is done in log(k) phases.
1 Available at http://www-users.cs.umn.edu/~karypis/metis/.
Chapter 3 Clustering in Relational Biological Data
“We understand a complicated system by understanding its simple parts, and by un- derstanding the simple relations between those parts and their immediate neighbors.” Don Knuth
The scientific endeavor of biology has become increasingly reliant on data in electronic form, and it is therefore necessary for biologists to manage and understand large quan- tities of data Publicly available data including biological sequences, biological struc- tures, and literature in the life sciences have grown to such an extent that computing is essential simply to store and access it Here we describe a clustering approach by exploiting the relational structure of biological data to help with the next step: to enhance understanding of the data by combining techniques from information retrieval with those from bioinformatics By computing over a network of biological sequences, structures and literature relationships it is possible to infer clusters of related articles, sequences and structures We describe the general framework and its application to several biological domains.
Clustering is the task of grouping a set of objects into different subsets such that objects belonging to the same cluster are highly similar to each other Conventional clustering algorithms employ distance (or similarity) measure to form the clusters [44].
On the other hand, graph partitioning algorithms exploit the structure of a graph to find highly connected objects Rich relational structure of biological data can be represented as a graph for clustering biological data Clustering biological data would be useful not only for exploring the data but also for discovering hidden relationships behind the raw
Here we describe a technique for clustering of biological objects: sequences, struc- tures and literature We use METIS, a multilevel graph partitioning system, to form the clusters This process identifies subsets of nodes that are highly connected to each other, but are less strongly connected to the rest of the graph Our clustering approach is to find clusters which are disjoint as opposed to clusters that overlap We evaluate the quality of clustering using the existing sequence and structure classifications and show that our clustering solution provide much better clustering than random partitioning. These clusters are formed based on the pairwise relationships among biological data, so we can evaluate their topical cohesiveness by examining independent metadata such as Gene Ontology (GO) annotations and terms in MEDLINE abstracts We also evaluate the clusters by hand for relevance, and find that the clusters are highly topical.
In the next section, we start by describing the databases we used In section 4.5, we describe the construction of a graph from the databases Section 3.4 describes the BioIR system we built In section 4.7, we present and discuss the empirical results to assess the quality of clusters Finally, we end with a summary.
In this section, we will briefly describe the data sources we used to construct our graph.
MEDLINE: MEDLINE is a digital collection of life science literature consisting of over twelve million abstracts It contains some additional information associated with each abstract such as manually assigned MeSH terms and chemical names Moreover, MEDLINE entries contain links to the sequences and structures that the article discuss. For instance, Figure 3.5 shows excerpts from a MEDLINE record that contains references to three structures in PDB In this chapter, we used the MEDLINE abstracts fromMEDLINE 2002 distribution, which are connected to either SWISS-PROT or PDB entries.This amounted to 83,609 MEDLINE abstracts We chose to work with these abstracts because we wanted to focus on organizing information about proteins.
SWISS-PROT: The Swiss Protein Database (SWISS-PROT) is a curated protein se- quence database [10] The database contains high-quality annotation including de- scriptions of each protein’s function SWISS-PROT entries are cross-referenced to several other databases, including MEDLINE, PROSITE and the PDB We used the whole swiss- PROT database, Release 41 of February 2003, containing 120,704 protein sequences. Figure 3.1 illustrates a portion of a sample SWISS-PROT entry Figure 3.2 shows the distribution of the number of references to particular MEDLINE entries (degree) for the protein sequence entries in SWISS-PROT.
A Network of Literature, Sequences, and Structures
Using the six relationships defined above, a graph can be constructed with biological objects at the vertices and pairwise relationships labeling the edges Part of this graph already exists at NCBI’s Entrez system (http://www.ncbi.nlm.gov) which allows users to follow the relationships described above Entrez is heavily used by biologists for browsing In this thesis, we propose using the graph representation as the basis for computation In particular, we are interested in inferring clusters of biological objects from a heterogeneous corpus of information by partitioning the graph.
Note that there are two different kinds of relationships that papers, sequences and structures might have to each other The first is similarity, as described in the relation- ships above The second is a type of relationship explicitly encoded in databases For example, two protein sequences might be connected because they cooperate to perform a cellular activity, e.g., the multiple subunits of the replisome Literature links not only similar proteins but also related proteins A paper describing replication may discuss both proteins involved in the replisome By clustering, it might be possible to construct a broad group of sequences and literature around a particular cellular function.
Given a weighted, undirected graph, the objective of graph partitioning is to parti- tion the graph into k roughly equal parts such that the sum of the weights connecting different parts is minimized, thereby producing subgraphs that are relatively highly con- nected The graph partitioning problem is NP-complete [26] However, many heuristics have been developed that find a reasonably good partition.
Traditional graph partitioning algorithms compute a partition of a graph by oper- ating directly on the graph, and they are usually slow On the other hand, multilevel graph partitioning algorithms reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then coarsen it to construct a partition for the original graph (31, 42, 43] These algorithms are generally fast and produce high-quality partitions We chose the pmetis program provided by the METIS! software, a publicly available graph partitioning software package [1] The partitioning algorithm used by pmetis is based on multilevel recursive bisection described in [42].
The multilevel recursive bisection method works as follows First, the size of the graph is reduced by collapsing nodes and edges to a few thousand nodes Then the smaller graph is partitioned into two parts Partitioning is repeated by uncoarsening each part one level up until all & subsets are obtained The whole partitioning is done in log(k) phases.
1 Available at http://www-users.cs.umn.edu/~karypis/metis/.
Chapter 3 Clustering in Relational Biological Data
“We understand a complicated system by understanding its simple parts, and by un- derstanding the simple relations between those parts and their immediate neighbors.” Don Knuth
The scientific endeavor of biology has become increasingly reliant on data in electronic form, and it is therefore necessary for biologists to manage and understand large quan- tities of data Publicly available data including biological sequences, biological struc- tures, and literature in the life sciences have grown to such an extent that computing is essential simply to store and access it Here we describe a clustering approach by exploiting the relational structure of biological data to help with the next step: to enhance understanding of the data by combining techniques from information retrieval with those from bioinformatics By computing over a network of biological sequences, structures and literature relationships it is possible to infer clusters of related articles, sequences and structures We describe the general framework and its application to several biological domains.
Clustering is the task of grouping a set of objects into different subsets such that objects belonging to the same cluster are highly similar to each other Conventional clustering algorithms employ distance (or similarity) measure to form the clusters [44].
On the other hand, graph partitioning algorithms exploit the structure of a graph to find highly connected objects Rich relational structure of biological data can be represented as a graph for clustering biological data Clustering biological data would be useful not only for exploring the data but also for discovering hidden relationships behind the raw
Here we describe a technique for clustering of biological objects: sequences, struc- tures and literature We use METIS, a multilevel graph partitioning system, to form the clusters This process identifies subsets of nodes that are highly connected to each other, but are less strongly connected to the rest of the graph Our clustering approach is to find clusters which are disjoint as opposed to clusters that overlap We evaluate the quality of clustering using the existing sequence and structure classifications and show that our clustering solution provide much better clustering than random partitioning. These clusters are formed based on the pairwise relationships among biological data, so we can evaluate their topical cohesiveness by examining independent metadata such as Gene Ontology (GO) annotations and terms in MEDLINE abstracts We also evaluate the clusters by hand for relevance, and find that the clusters are highly topical.
In the next section, we start by describing the databases we used In section 4.5, we describe the construction of a graph from the databases Section 3.4 describes the BioIR system we built In section 4.7, we present and discuss the empirical results to assess the quality of clusters Finally, we end with a summary.
In this section, we will briefly describe the data sources we used to construct our graph.
MEDLINE: MEDLINE is a digital collection of life science literature consisting of over twelve million abstracts It contains some additional information associated with each abstract such as manually assigned MeSH terms and chemical names Moreover, MEDLINE entries contain links to the sequences and structures that the article discuss. For instance, Figure 3.5 shows excerpts from a MEDLINE record that contains references to three structures in PDB In this chapter, we used the MEDLINE abstracts fromMEDLINE 2002 distribution, which are connected to either SWISS-PROT or PDB entries.This amounted to 83,609 MEDLINE abstracts We chose to work with these abstracts because we wanted to focus on organizing information about proteins.
SWISS-PROT: The Swiss Protein Database (SWISS-PROT) is a curated protein se- quence database [10] The database contains high-quality annotation including de- scriptions of each protein’s function SWISS-PROT entries are cross-referenced to several other databases, including MEDLINE, PROSITE and the PDB We used the whole swiss- PROT database, Release 41 of February 2003, containing 120,704 protein sequences. Figure 3.1 illustrates a portion of a sample SWISS-PROT entry Figure 3.2 shows the distribution of the number of references to particular MEDLINE entries (degree) for the protein sequence entries in SWISS-PROT.
PDB: The Protein Data Bank (PDB) contains 3-D structural data of biological macromolecules (proteins and nucleic acids) [11] The PDB entries are cross-referenced to the primary citations in MEDLINE and other databases including ENZYME and SWISS- PROT We used the whole PDB database downloaded in July 2003 This version of the database has 22,135 structures Figure 3.4 shows the distribution of the number of references to particular MEDLINE entries (degree) for all structure entries in PDB as of July 2003.
Figure 2.1 in Chapter 2 illustrates the relationships between biological data types. Using the relationships between biological data objects, we construct a weighted undi- rected graph where nodes correspond to entries from the databases listed in Section 3.2, including MEDLINE abstracts, protein sequences from SWISS-PROT, and structures from PDB Figure 3.5 shows excerpts from a MEDLINE record that contains references to three structures in PDB, along with the title and abstract of the paper.
Edges in the graph correspond to explicit links between entries encoded in the databases, such as the sequence annotations in MEDLINE abstracts, and pairwise simi- larity relationships between same type of objects We use BLAST [4], a sequence align- ment technique, to compute similarities between protein sequences We employ MG! [95], a full-text retrieval engine, to compute similarities between MEDLINE abstracts.
1 Available at http://www.cs.mu.oz.au/mg/.
01-DEC-1992 (Rel 24, Last sequence update)
28-FEB-2003 (Rel 41, Last annotation update)
14-3-3 protein zeta/delta (Protein kinase C inhibitor protein-1)
(KCTP-1) (Factor activating exoenzyme S) (FAS).
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
Zupan L.A., Steffens D.L., Berry C.A., Landt M.L., Gross R.W.;
"Cloning and expression of a human 14-3-3 protein mediating phospholipolysis Identification of an arachidonoyl-enzyme intermediate during catalysis.";
Rittinger K., Budman J., Xu J., Volinia S., Cantley L.C.,
"Structural analysis of 14-3-3 phosphopeptide complexes identifies a dual role for the nuclear export signal of 14-3-3 in ligand binding.";
-!- FUNCTION: ACTIVATES TYROSINE AND TRYPTOPHAN HYDROXYLASES IN THE
Brain; Neurone; Phosphorylation; Acetylation; Multigene family;
Figure 3.1: A portion of a sample SWISS-PROT entry DR lines show cross-references to the PDB and the PROSITE entries KW lines show the SWISS-PROT keywords, separated by ’;’.
NEDL NE Rofororcœ for svEsprot
1 L Seed amit ten ei ah en a oe
Figure 3.2: Distribution of MEDLINE references in SWISS-PROT.
Identifying Descriptive Terms From Abstracts
We aim to identify words that best describe the set of documents in clusters by analyzing the MEDLINE entries of the clusters These descriptive words can be used as index terms to identify the contents of the clusters We identified the descriptive words as follows We considered the words in the title and abstract of all articles in a cluster after eliminating stop words We removed all punctuation, and converted all uppercase letters to lowercase Then we ranked the resulting words by calculating p-values considering the entire set of MEDLINE entries in our collection p-value calculation was described in subsection 3.5.4 We kept the top twenty most significant words, the ones having the smallest p-values, in our database for each cluster We use the resulting set of twenty words to index the clusters, and build a search utility against this index using MySQL.
Experimental Results and Discussion
Biological Domains 0.0 0.00 eee ee ee 23 3.5.3 Expert Analysis 2.0 ee es 25 3.5.4 Correlation between Clusters and Go Categories - Go Term As-
The following biological domains were carefully examined by a domain expert, a Ph.D.candidate in Molecular Biology These domains have been chosen from the expertise description of each domain below.
Calmodulin: Calmodulin is a ubiquitous intracellular receptor for calcium ions that functions by changing its shape upon binding to calcium so that it can bind to and activate/inactivate other proteins Most proteins activitated by calmodulin are so-called CaM-kinases.
Chemotaxis: This is a bacterial signaling pathway involved in chemotaxis Repel- lents activate receptors that, with the assistance of CheW, activate CheA Attractants inhibit CheA CheA activates CheY which causes the flagella to rotate such that the bacteria tumble CheZ inactivates CheY.
Rhodopsin and Gt: Rhodopsin is a 7-pass transmembrane G-protein linked re- ceptor containing a pigment, 11-cis-retinal Light changes the structure of the pigment which causes Rhodopsin to bind with transducin (Gt), a trimeric G-protein Upon bind- ing, Gt looses its alpha subunit which diffuses and binds GMP phosphotase, activating it and eventually leading to signaling.
U1 U2 U5 U4 U6 spliceosome: The spliceosome is a protein, RNA complex responsible for splicing introns out of nascient mRNA during its maturation U1, U2, U4, U5 and U6 are among the different snRNPs present in Eukaryotic nuclei - they consist of both protein and small RNA molecules.
Ubiquitin: Ubiquitin-dependent protein degradation plays a role in many cellular processes including transcriptional regulation, cell cycle progression and DNA repair. Ubiquitin is a highly conserved 8kDa protein whose many cellular functions are medi- ated by its covalent ligation to other proteins.
Apoptosis: Apoptosis, or programmed cell death, plays a fundamental role during tissue development, injury and degeneration The biochemical pathways of programmed cell death are also used to destroy cells with damaged DNA and cells that are infected with viruses. p53 Signaling Pathway: p53 is a transcription factor whose main function is to prevent the cell from progressing through the cell cycle when DNA damage has occurred, p53 may either halt the cell cycle until the DNA can be repaired or else it may cause the cell to undergo apoptosis.
Insulin Signaling Pathway: Insulin, a small protein that acts as a hormone, is secreted by the pancreas in response to increased glucose levels in the blood Most cells of the body have receptors which bind insulin Upon binding of insulin, the cell activates other receptors designed to absorb glucose from the blood stream into the cell. Insulin is a necessary hormone and insulin deficiency or resistance results in diabetes.
The domain expert evaluated sixteen clusters — two clusters randomly chosen for each topic of interest described in Section 3.5.2, e.g, calmodulin, apoptosis, etc Three dif- ferent types of entities (PDB entries, SWISS-PROT entries and GO terms) were examined by the domain expert to determine how many of them were relevant to the topic of interest Although Go was not used to obtain the clusters in any way — therefore, they are not in the clusters, GO terms were assigned to the clusters using SWISS-PROT to GO mappings as described in subsection 3.5.4 Also, overall cluster qualities are reported for each cluster manually examined.
For example, Table 3.2 illustrates the PDB structures in p53 cluster 1674 The scop classifications in this table show that the PDB structures in this cluster were indeed classified as p53 by the sCOP database.
Table 3.3 shows the evaluation results judged by the domain expert For each entity type, a relevancy score between 1 and 10 was assigned where 10 means all entities of that particular type are highly topical and 1 means that none of them are relevant. Almost all entity types for all clusters have high scores Therefore, we can conclude that all the sections evaluated by the expert are highly relevant to the topics considered. Biologists note that the SWISS-PROT to GO mapping is incomplete because not all SWISS-PROT sequences are fully annotated For example, in one cluster for the topic
“apoptosis”, the SWISS-PROT gene for E1B is not annotated with apoptosis even though it is involved in apoptosis Similarly, in another cluster for the topic “apoptosis”, theSWISS-PROT annotation for the CASP-1 genes do not refer to apoptosis, but someMEDLINE entries indicate that it is involved in apoptosis So, since the SWISS-PROT GO
PDB ID | PDB Title SCOP Classification lhs5 NMR SOLUTION STRUCTURE OF DESIGNED p58 | a.53.1.1 p53
1sal HIGH RESOLUTION SOLUTION NMR STRUCTURE | a.53.1.1 p53
OF THE OLIGOMERIZATION DOMAIN OF p53 BY
MULTI-DIMENSIONAL NMR (SAD STRUCTURES)
3sak HIGH RESOLUTION SOLUTION NMR STRUCTURE | a.53.1.1 p53
OF THE OLIGOMERIZATION DOMAIN OF p53 BY
MULTI-DIMENSIONAL NMR (SAC STRUCTURES)
Table 3.2: Some of the PDB structures in p53 cluster 1674 The scop classification of the PDB structures are displayed to show that they are indeed related to p53.
Topic Cluster | PDB | SW | GO term | Overall calmodulin | 1794 10 10 8 10 calmodulin | 1815 5 7 5 5 rhodopsin 1402 10 9 10 10 rhodopsin 1400 3 5 10 7 spliceosome | 1634 10 10 10 10 spliceosome | 1648 N/A 8 6 7 chemotaxis | 1072 9 9 9 9 chemotaxis | 1071 5 6 4 5 apoptosis 1670 10 10 10 10 apoptosis 1669 10 9 10 10 ubiquitin 1665 7 2 8 8 ubiquitin 1666 7 10 9 8 insulin 1473 10 10 10 9 insulin 1472 10 7 10 9 p53 1674 10 8 10 9 p53 1722 7 7 7 8
Table 3.3: Evaluation of sample clusters by the domain expert Scores range from 1 to 10, where 10 means all of the objects are relevant, and 1 means none of them are relevant. annotation is incomplete, the relevance scores that we obtain based on the GO terms through automated means may underestimate the relevance of the cluster contents. Another interesting point is that the domain expert first thought that 30S ribosomal protein in cluster 1665 was unrelated to ubiquitin, therefore assigned a score of 2. However, the immediate links to MEDLINE entries as provided by our system suggested that it should be relevant We asked to reconsider whether 30S ribosomal protein could be related to ubiquitin, and the SWISS-PROT sequences in this cluster were reevaluated. After examining some immediate neighbors (MEDLINE entries) of these entities in the graph, the domain expert found out that they were indeed relevant to ubiquitin, and now believes that the GO terms assigned to these clusters (nucleus and structural constituent of ribosome) are very relevant to ubiquitin This ‘discovery’ aspect of our system is important — it demonstrates that the clusters can bring to light relationships that are not obvious at first glance.
3.5.4 Correlation between Clusters and GO Categories — Go Term As- signment to Clusters
Gene Ontology produces a controlled vocabulary for genes and gene products, called Go [90].2 Go provides three structured networks of defined terms to describe gene product attributes These three Go ontologies are referred to as Biological Process, Molecular Function and Cellular Component.
To show how much correlation we obtained between clusters and GO categories, we assigned GO terms to the clusters using the SWISS-PROT to GO mappings Before explaining how we did this, it is important to note that we did not use GO to construct the graph; we just use it to provide a biological validation.
We mapped SWISS-PROT entries to GO terms as follows SWISS-PROT entries contain several keywords, called SWISS-PROT keywords Figure 3.1 shows the keywords for a sample SWISS-PROT entry Gene Ontology site provides mappings from external classification systems to GO These include SWISS-PROT keywords We downloaded
?http://www.geneontology.org/ ldate: 2002/12/20 10:12:15
!Mapping of SWISS-PROT KEYWORDS to GO terms.
SP_KW:ATP synthesis > GO:ATP biosynthesis ; GŨ:0006754
SP_KW: ATP-binding > GO:ATP binding ; G0:0005524
SP_KW:Acetoin biosynthesis > GO:acetoin biosynthesis ; G0:0045151
SP_KW: Acetylcholine receptor inhibitor > GO:acetylcholine receptor inhibitor ; 6 0:0030550
SP_KW:Actin-binding > GO:actin binding ; G0:0003779
SP_KW: Activator > GO:translation regulator ; G0:0045182
SP_KW:Acute phase > GO:acute-phase response ; G0:0006953
SP_KW:Acyltransferase > GŨ:acyltransferase ; G0:0008415
SP_KW:Albumin > GO:extracellular space ; G0:0005615
SP_KW:Alginate biosynthesis > GO:mating pheromone exporter ; G0:0042141
SP_KW: Alkaloid metabolism > GO:alkaloid metabolism ; G0:0009820
SP_KW: Alkylation > GO:protein amino acid alkylation ; G0:0008213
Text Categorization and Bayesian Logistic Regression
Text Categorization 2 0 ee ee 57 5.2 Bayesian Logistic Regression 0.0000 eee eeae 58 5.2.1 Choice of Hyperparameter 2.0 ee ee 59 5.2.2 Threshold Selection 2.000 0 eee eee 61 5.3 Prior Work 2 2 62 6 Using Domain Knowledge for Text Classification
Text categorization is the problem of classifying text documents into categories (or classes) A typical example is to classify news articles by topic based on their con- tents Another problem is to automatically identify biomedical articles relevant for gene annotation in model organism databases [34].
Text categorization has attracted attention of researchers for more than forty years. Statistical methods have become dominant in the literature lately The basic idea is to use supervised learning to create a classifier from labeled examples We provide an algorithm with a set of labeled documents and it learns a model or a decision rule for classifying future documents This approach gives high accuracy classifiers. Supervised machine learning algorithms such as Naive Bayes [49], decision trees [7], nearest neighbor methods [97], support vector machines (38, 50, 100], regularized logistic regression [100, 27], and boosting [80] have been successfully employed in a wide variety of text categorization problems Sebastiani provides an overview of machine learning techniques for text categorization [82].
A major problem with the supervised learning algorithms is that they require a large number of labeled training examples to learn model parameters accurately However, in practice, it may be difficult to obtain enough labeled examples On the other hand,someone with a need for text categorization often has some kind of domain knowledge informing which words are likely to be good predictors for the possible categories Our approach to solving the problem of limited data is to incorporate domain knowledge into learning We will address this problem in Chapter 6.
Logistic regression is a method to estimate the probability of category membership given an example using this formula: exp(37 z;) _ exp(Ð ”; Ajri,j) Pui = +1|8, 24) = 1 +exp(8 mj) 1+ exp(d.; 4213) where y; encodes the class of example i (positive/relevant = +1, negative/nonrelevant —1) and z¿,; is the value of feature 7 for example i The model parameters § are chosen by supervised learning, i.e by optimizing some function defined on a set of examples for which manually judged values of y; are known Each parameter 6; corresponds to exactly one feature and can be viewed as a “weight” for that feature.
In our work, following the work of Genkin, Lewis and Madigan [27], we adopt a Bayesian framework and choose the ỉ that maximizes the posterior loglikelihood of the data, n
1(8) = (- 3 In(1 + exp(—ỉTzz¿wĂ)) + In p(B), i=0 where ứ(đ) is, for each đ, the prior probability that @ is the correct parameter vector. Note that the feature index i starts from 0 This is because these models have an intercept term, which can also be thought of as corresponding to a feature z;9 which has the value 1.0 for all examples The prior p(đ) encodes what we believe are likely values of đ before seeing the training data.
Logistic regression (24, 51, 61, 81, 99, 100] and, to a lesser degree, the similar pro- bit regression [16], have been widely used in text classification Schutze et al carried out early text categorization experiments using logistic regression [81] Regularization to avoid overfitting has been based on feature selection, early stopping of the fitting process, and/or a quadratic penalty on the size of regression coefficients In particular, regularized logistic regression with Gaussian prior (ridge logistic regression) has been widely used in text categorization [100, 51, 99] Ridge logistic regression can be inter- preted as MAP estimation where p(đ) is a product of univariate Gaussians with mean
0 and a shared variance [61] Recently, Genkin et al [27] showed MAP estimation with a product of univariate Laplace priors, i.e a lasso [91] version of logistic regression, was effective for text categorization.
Ridge logistic regression imposes a univariate Gaussian prior with mean 0 and vari- ance 7; > 0 on each Gj:
Lasso logistic regression imposes a univariate Laplace (double exponential) prior with mean 0 and scale parameter +; > 0 on each đ¿:
(6:15) = 2 exp(—2s165)) (5.2) where the variance of the distribution is 2/ rj.
The mean of 0 encodes our prior belief that 8; will be near 0 The variances 7; or 2/ nN are positive constants we must specify A small variance represents our prior belief that ỉ; is close to 0 A large variance represents a less informative prior belief.
In the absence of prior knowledge, we assume that the variances are the same for all features.
Our experiments in this study use the BBR (Bayesian Binary Regression) package [27].! BBR supports two forms of priors: a separate Gaussian prior for each 8; or a separate Laplace prior for each đ; (The overall prior is the product of the individual priors for feature parameters.) The key difference between the two is that Gaussian priors produce dense parameter vectors with many small but nonzero coefficients, while Laplace priors produce sparse feature vectors with most coefficients identically equal to 0.
The Gaussian and Laplace distributions each have two parameters, which are viewed as hyperparameters when these distributions are used as priors for logistic regression
‘http: //www.stat.rutgers.edu/~madigan/BBR/
The Laplace parameters are the mean /u;, and the scale parameter À;, corresponding to a variance of 2/ ri For both distributions the mean is also the mode, and we prefer this interpretation as it emphasizes the intuition that we use the hyperparameter to specify the most likely value of ỉ; Similarly, for convenience, we will talk of the variance as a hyperparameter when using both the Gaussian and Laplace priors The variance controls how confident we are that 6; is near its modal value, which in turn controls our susceptibility to overfitting We therefore sometimes refer to the variance as the regularization hyperparameter.
When no domain knowledge is available, we assume that all ;'s are 0, and that the regularization hyperparameter is the same for all features (i.e., they share the same prior) We call this the general (or common) prior This leaves a single regularization hyperparameter to be chosen for the general prior.
In this case, rather than specifying the regularization hyperparameter for the general prior manually based on our prior beliefs, we consider a fixed set of hyperparameter values, and choose the one that maximizes the 5-fold cross-validation estimate of mean posterior log-likelihood on the training data The prior variances considered for Laplace and Gaussian priors were
These values correspond to this set of prior standard deviations
The prior on the intercept was handled in the same way as that for other parameters in our experiments.
On the other hand, the use of domain knowledge allows us to specify prior distri- butions with different modes or variances for the domain knowledge terms However, we assume that all features other than domain knowledge features have the same prior:all w;`s are 0, and the regularization hyperparameter is the same for all those features.Setting individual priors based on domain knowledge requires a mode or a domain knowledge relative weight to be specified by the user for each of those features, and a single regularization hyperparameter to be chosen for the general prior We will explain how we set the parameters of individual priors based on domain knowledge in Section 6.2.
Logistic regression models estimate the probability that the example is a positive/relevant example We then must convert this probability to a binary class label The simplest approach is to define a threshold value that the estimated probability must exceed for the test example to be predicted to be relevant.
We tested two approaches to choosing a threshold for a categorization problem: e MEE (Maximum Expected Effectiveness): Choose the threshold that maximizes the expected value of the effectiveness measure of interest on the test set, under the assumption that the estimated class membership probabilities are correct and that the corresponding binary random variables are independent [48]. e TROT (Training set Optimization of Threshold): Choose the threshold that max- imizes the effectiveness measure of interest on the training set.
The effectiveness measures of interest are F1 and T11SU in our work; see Section 6.3.4 for the definitions of these effectiveness measures The MEE results are reported only for T11SU The MEE threshold for the T11SU effectiveness measure is p(y; +1) >= 1/3 on a probability scale, which is approximately 0.333333 and that value is the value we thresholded on in our experiments Note that the MEE thresholding works the same for T11U, T11NU and T11SU Computing the MEE threshold for F1 requires processing test examples as a batch [48] Therefore, we avoid it Both thresholding approaches were implemented outside BBR.
Motivation 2 ee kg kg 64 6.2 Incorporating Domain Knowledge .- - 002 eee 65 6.2.1 No Domain Knowledge 0 004 ee eee 65 6.2.2 Using Domain Knowledge as Examples
Text categorization is the problem of assigning documents into predefined categories. Many machine learning methods have been successfully employed for text categoriza- tion Recently excellent results have been obtained with regularized logistic regression [100, 27] and support vector machines (SVMs) (38, 50, 100] Most text categorization studies used thousands to tens of thousands of training examples to learn accurate classifiers However, in practice, it may be difficult to obtain enough labeled examples to learn model parameters accurately It is usually expensive and time-consuming to rely on human indexers or domain experts to efficiently annotate and categorize doc- uments Although we may not have sufficient training examples, it is possible to find other knowledge that is related to a text categorization task Such knowledge might come from category descriptions meant for manual indexers, reference materials on the topics of interest, lists of features chosen by a domain expert, or many other sources. Since the amount of digital data is rapidly increasing, it is likely that text, databases, or other sources of knowledge that are related to a text categorization problem are easily accessible We refer to this readily available information as “domain knowledge” (or
Bayesian statistics provides a convenient framework for combining domain knowl- edge with training examples [15] The approach produces a posterior distribution for the quantities of interest (e.g., regression coefficients) Per Bayes theorem, the posterior distribution is proportional to the product of a prior distribution and the likelihood func- tion In applications with large numbers of training examples, the likelihood dominates the prior However, with small numbers of training examples, the prior is influential and priors that reflect appropriate knowledge can provide improved predictive performance. Here we apply this approach with logistic regression as our model.
We discussed the Bayesian logistic regression approach in Section 5.2, and previ- ous approaches to integrating domain knowledge in text classification in Section 5.3. Section 6.2 presents our Bayesian approach to incorporate domain knowledge Section 6.3 describes our experimental methods, while Section 6.4 presents the experimental results We find on three test categorization test collections, using three diverse sources of domain knowledge, that domain-specific priors can yield large effectiveness improve- ments A summary of our findings are given in Section 6.5.
In this section we describe our methods for choosing prior distributions for incorporating prior knowledge into learning In our experiments, we look at three diverse sources of such domain knowledge, though we refer to all these sources as “domain knowledge texts.” Such knowledge about categories may be readily available from domain experts, classification taxonomies, reference books, etc Even the words in the category name itself might be good predictors We consider prior knowledge for a category as a set of words related to that category, and represent it as a text for that category, and assume for simplicity there is exactly one domain knowledge text for each class We call a set of such texts for all categories the “domain knowledge corpus.”
For a given class, we distinguish between two sets of words Knowledge words (K Ws) are all those words that occur in the domain knowledge text for the class of interest. Other words (OWs) are all words that occur in the training documents for a particular run, but are not KWs Table 6.1 summarizes the methods discussed in this section.
The first method listed in Table 6.1 was intended to establish a baseline to effective- ness using domain knowledge It learns only from training data and uses no domain knowledge We call this method “No DK.” Any machine learning algorithm can be
| | Method | Description of the method
No DK (baseline) | [OWs, intercept] - mode: 0, variance o? chosen by cross-validation on training examples
DK Examples Like No DK, but treat the domain knowledge text for the class as X positive cross-validation on training data Var OWs, intercept] - mode: 0, variance: ứ?
[KWs] - mode: 0, variance: of = DKRW 77, (CDKRW: 2°) Pair chosen by
Var/TFIDF OWs, intercept] - mode: 0, variance: ơ?
KWs] - mode: 0, variance: 7 = Cox Rw x significance(t;,Q) x ứˆ for term t; and topic Q, and (CDKRW: ỉ?) pair chosen by cross-validation on training data cross-validation on training data Mode OWs, intercept] - mode: 0, variance: a?
KW] - mode: yy = CpkRWw› variance: o*, (CpKRw> o*) pair chosen by pair chosen by cross-validation on training data
Mode/TFIDF [OWs, intercept] - mode: 0, variance: 0?
KWs] - mode: 4; = CDKRW X significance(t;,Q), variance: o7, (DKRW› 7°)
Table 6.1: Summary of tested methods for incorporating domain knowledge into learn- ing CpKRw is a constant specifying the relative weight given domain knowledge. used for this purpose We employ Bayesian logistic regression as well as SVMs with these methods to provide comparisons SVMs can be directly applied.
In our work, ‘No DK” baseline (Table 6.1) with Bayesian logistic regression does not make any distinction between different words, i.e., all words are OWs All words have the same form of prior (always symmetric), same mode (always 0), and same variance.The only hyperparameter to choose is the prior variance Text classification research using regularized logistic regression has usually set all prior modes to 0, and all prior variances to a common value (or has used the equivalent non-Bayesian regularization).Some papers explore several values for the prior variance [51, 100], others use a single value but do not say how it was chosen [61, 99], and others choose the variance by cross-validation on the training set [27] We tried two methods of choosing variance: a cross-validation approach and a norm-based heuristic Most of our experiments used cross-validation approach (Section 5.2.1) to choose a common prior variance for OWs.However, we also conducted an experiment using norm-based heuristic, see Section 7.5.
6.2.2 Using Domain Knowledge as Examples
Another simple baseline is to create X copies of the domain knowledge text for a class and add these copies to the training data as additional positive examples (“DK Exam- ples” in Table 6.1), as in some relevance feedback approaches This can be considered as a straightforward way of using domain knowledge (In our experiments, these values were used for X: 1, 5, 625, 2000, and 5000 However, values greater than 5 did not show increase in effectiveness Therefore, in the result tables, we present results only for X=1 and X=5.) We run these artificial documents through the same tokenizer we are using for documents, see Section 6.3.3 Also, term weighting is the same for documents (see Section 6.3.3) Specifically, the IDF weights of terms are based on the appropriate set of documents of the type being classified, not the other domain knowledge texts. Method “DK Examples” can also be run using any machine learning method We use Bayesian logistic regression and SVMs with this method.
Note that when DK texts are used as artificial examples for training with BBR, they participate in the cross-validation process to select a hyperparameter in the same fashion as normal examples.
The last four methods in Table 6.1 use domain knowledge to define class-specific prior distributions for Bayesian logistic regression The basic idea is that the words in a do- main knowledge text for a category are likely to be positive indications that a document belongs to the corresponding category In a logistic regression model, this means that, compared to other words, domain words are more likely to have a coefficient with a large positive magnitude The way we encode this is by giving the words from domain knowledge texts a prior probability either whose mode is more positive than that for other words or whose variance is higher than that for other words (favoring a larger magnitude).
Each method is specified in terms of the distributional form of priors for KWs and OWs, and the parameters of the appropriate distribution They begin by giving are then given more ability to affect classification by assigning them a larger prior mode or variance than OWs All four methods use a heuristic constant CoKRw, the
“domain knowledge relative weight”, to control how much more influence KWs have. This constant can be set manually or, as in our experiments, chosen by cross-validation on the training set (Section 6.2.3).
We may have descriptions of prior knowledge on many different categories and we may be able to take advantage of information about words across domain knowledge descriptions to help tune the numeric values of hyperparameters Therefore, two of our methods look not just at the domain knowledge text for the target class, but at the texts for other classes, in order to determine how significant to the target class each word in its domain knowledge text is In other words, using domain knowledge can be conceptualized as the following two step process First we analyze the prior knowledge corpus to decide what terms are “significant” for using as prior knowledge, and to compute the “significance” of each term Second, the significances of the selected terms are translated into hyperparameters of the distribution for these terms Note that we assume that there is a single domain knowledge document for each category As a heuristic measure of significance, we use TFIDF weighting (Section 6.3.3) within the domain knowledge corpus: significance(t, Q) = logtf(t, d) x idf(¢), (6.1) where e dis the domain knowledge text for class Q, e logtf(t,d) = 0 if term ¢ does not occur in text d, or 1 + loge(tf(t,d)) if it does, where tf(t,d) is the number of occurrences of t in d, e idf(t) = loge((Nx + 1)/(df(Œ) + 1)), where Nx is the total number of domain knowledge texts used to compute IDF weights, and df(t) is the number of those documents that contain term t.
We now describe the methods.
Experimental Methods 2 0 ee ee ee es 72 1 Alternate Supervised Learning Approaches to Text Classification 72 2 Datasets 0 ee 73
In this section, we describe our experimental methods to studying domain knowledge in logistic regression We compare Bayesian logistic regression with domain knowledge and without using domain knowledge as well as with Support Vector Machines (SVMs), one of the state-of-the-art methods in text categorization After describing SVMs, we describe which document collections and category labels were used, how documents were represented as numeric vectors, and how we measured effectiveness.
6.3.1 Alternate Supervised Learning Approaches to Text Classifica- tion
As a baseline to ensure that logistic regression was producing reasonable classifiers without domain knowledge, we trained support vector machine (SVM) classifiers on all training sets SVMs are one of the most robust and effective approaches to text categorization [50, 82] The basic idea of SVMs is to separate positive and negative examples accurately by maximizing the margin between the two sets of examples.
In our experiments, we used Version 5.0 of SVM_Light software [38, 39] } All http: //svmlight.joachims.org/ parameters were kept at their default values In particular, we used a linear kernel (-t option unspecified), equal weighting of positive and negative examples (—j option unspecified), and the default regularization parameter Ở, which is the reciprocal of the average norm of training examples (—c option unspecified) Keeping the —c option at its default meant that SVM_Light used the default choice (Œ was set to 1.0 for our cosine normalized examples) A wide variety of methods have been proposed for choosing the regularization parameter and threshold value for SVMs, with no agreement on a standard method We used one of the more common methods Thresholding for SVMs is discussed in Section 6.3.5.
Our text classification experiments used three publicly available text categorization datasets for which domain knowledge texts were publicly available Note that, for each dataset, we chose categories that had a relatively large number of positive examples, to give us flexibility to experiment with different training sizes Next we briefly describe these datasets.
TREC Genomics Track Biomedical Journal Articles
This is a collection of full text articles used in TREC 2004 genomics track categoriza- tion experiments (33].? The genomics track itself featured a few, atypical categorization tasks However, all these articles are from journals that are indexed in the National Library of Medicine’s MEDLINE system They therefore have corresponding MED- LINE records that include manually assigned MeSH (Medical Subject Headings) terms. Here we posed as our classification task predicting the presence or absence of selected MeSH headings associated with articles Therefore, we will use the abbreviation “Bio Articles” for this dataset.
Documents We split the Bio Articles data into three 8-month segments We used the first segment for the training and the last segment for testing The middle segment
*http://trec.nist.gov /data/t13_genomics.html
Table 6.2: Full text biomedical journal articles data set (” Bio Articles”) Training sets of various sizes were drawn from the training population of 3742 articles, and classifiers were evaluated on the test set of 4175 articles The development set was set aside for tuning, but was not used in the experiments reported here. was reserved as a development set to play with parameters, tuning, etc However, we did not need any tuning using development data at all Therefore, we did not use such data in our experiments Note that we used “DP - ” field (date of publication) from MEDLINE entries for splitting the data The population sizes are shown in Table 6.2.
We experimented with both full text and abstract representations of articles A portion of a sample full text article is shown in Figure 6.2 The corresponding MEDLINE entry is shown in Figure 6.3.
Categories Medical Subject Headings (MeSH) are organized in 16 trees (called
“categories” by the National Library of Medicine): category A for anatomical terms, category B for organisms, C for diseases, D for drugs and chemicals, etc Each tree consists of subtrees Within each subtree, descriptors are organized hierarchically from most general to most specific in up to eleven hierarchical levels Each MeSH heading has been associated with one or more descriptor tree numbers that indicate its place in the hierarchy [53] The MeSH vocabulary (MeSH 2005 version) contains 22,568 descriptors, and we treat each descriptor as a category for classification experiments An average of
10 MeSH indexing terms are assigned to each MEDLINE citation by NLM indexers, who choose the most specific MeSH heading(s) that describe the concepts discussed A MeSH heading consists of a main heading and optionally subheadings For example, the MeSH descriptors in Figure 6.3 includes “Recombinant Proteins/chemistry” In this example,
“Recombinant Proteins” is the main heading, and “chemistry” is the subheading A total of 17,177 distinct main headings exist in the Bio Articles collection In MeSH, a child tree number shares its left-most digits with its parent tree number, and differs in its three rightmost digits For example, the MeSH tree number for “Hepatocytes” is
ARTICLEENZYME CATALYSIS AND REGULATIONCrystal Structure of 1,3-Glucuronyltran sferase I in Complex with Active Donor Substrate UDP-G1cUALars
C.Pedersen
Thomas
A.Darden
MasahikoNegishi