Luận án tiến sĩ: Using domain knowledge for text mining

Comparing Bayesian logistic regression Laplace and Gaussian priorswith SVMs on three text categorization test collections using large train-ing sets and with no domain knowledge.. ...Mac

Information Retrieval for Bioinflormatlqs

The growth of bioinformatics has resulted in a body of data that is highly inter- connected One reason for this connectivity is the fact that, in the biomedical literature, acceptance of a paper for publication is contingent on the submission of any biological sequences or structures determined or analyzed in the course of the research to the ap- propriate databases The corresponding abstract in MEDLINE, a biomedical literature database, is then annotated with the ID of the sequence or structure Furthermore, the detailed annotation of entries in biological databases provides links between databases. This linking allows researchers to find experimental data very easily once they have identified a paper of interest, or conversely to find an analysis of a particular sequence or structure The National Center for Biotechnology Information (NCBI), part of theNational Library of Medicine, provides online access to MEDLINE abstracts, GenBank sequences, and many other data types, through their Entrez system The ability to browse data and literature is important, but the underlying data has much greater

In the first part of the thesis, we are concerned with using domain knowledge in clustering and information retrieval tasks for bioinformatics We propose to com- bine different knowledge sources in a uniform computational framework We believe that explicitly encoded relationships between abstracts and sequences, abstracts and structures, and between sequences and structures coupled with similarity relationships among the same types of biological objects can be utilized to infer clusters of abstracts, sequences, and structures By representing the network of relationships as a graph, we describe a method for clustering biological data To accomplish this, we first construct a graph of biological sequences, structures and scientific publications with pairwise relationships Then we employ graph partitioning techniques to infer clusters of related articles, sequences and structures.

We also present one application of our approach to the problem of finding scientific papers (MEDLINE abstracts) that describe functions of particular genes, an ad hoc information retrieval task in the TREC 2003 Genomics Track.

Text Classification 2 0 ng cv v.v v k vn S 2 1.3 9.1“ - dd da ẼẶ es 3

Sequence-Sequence Relationships 0004 8 2.3.2 Structure-Structure Relationships

Sequence alignment technique allows us to relate two protein sequences based on similarities of their amino acid sequences [59] It computes a distance between two sequences defined as the minimum number of insertions, deletions or mutations required to trans- form one sequence into the other This can be computed by dynamic programming inO(nm) time, where n and m are the lengths of the two sequences [84] For protein sequences, the probability of a mutation depends on the two amino acids pairs in question.The cost of a mutation is usually defined by a substitution matrix, which gives costs for and mutations and insertions in either sequence make a negative contribution Karlin and Altschul [41] have studied the statistical significance of these similarity scores so that they can be converted to a probability For computing similarity of sequences in our experiments in Chapter 3, we use a publicly available tool, called BLAST (Basic Local Alignment Search Tool) (4].

Techniques for comparing structures try to align the atoms in one structure with the atoms in another structure in order to minimize the distance between the closest pairs of atoms in these two structures In this thesis, we will not compute similarities between structures Instead, we will rely on the existing relationships between structures which are available in public databases such as SCOP (Structural Classification of Proteins) database [58].

Literature-Literature Relationships

Document similarity computation allows us to relate two documents Given a pair of documents in a document collection, their similarity can be computed as follows Each document is represented as a vector of numeric feature values In a document vector, each component corresponds to a distinct term in the document collection and has a value as a function of how often the term corresponding to that component appears in the document and in the whole collection (called term weighting) A commonly used term weighting scheme is TFxIDF (term frequency times inverse document frequency). With this weighting scheme, the weight of a term ¢; in a document d,, w,;, is given by wi = fig X log where ƒ;; is the frequency of term ; in document d;, n; is the number of different documents the term ; appears in, and N is the total number of documents in the collection IDF component of this term weighting scheme increases the weights of vector is normalized to have length one, and the similarity of two vectors is computed as the inner product of the vectors, i.e the cosine of the angle between the vectors [74]. Cosine ranking with TFxIDF weighting performs well in practice Therefore, most information retrieval systems use it for relevance ranking For computing similarity of documents, we use MG software as mentioned above.

Shared keywords constitutes another kind of literature-literature relationship, e.g.,MeSH terms and chemical names assigned manually by human indexers Two documents that share such a keyword have a specific relationship The advantage of linking literature through keywords is that they make the relationship clear to the user.

Structure-Sequence Relationships c 10 2.3.5 Sequence-Literature Relationships

A protein structure always has an associated protein sequence In fact, the 3-D locations of atoms in a structure description are ordered by the primary sequence Given a structure, the sequence can always be determined On the other hand, relating a structure to a given sequence is straightforward if a sequence has a known structure. But this is only the case for 0.4% of sequences (20,000 structures out of 3.5 million sequences) These relationships are encoded in PDB, Protein Data Bank, 3—D structural data of biological macromolecules (proteins and nucleic acids) [11].

Relationships between protein sequences and literature are explicitly encoded in swiss- PROT This is the Swiss Protein Database, a well-annotated protein sequence database [10] Relationships between DNA sequences and literature are explicitly encoded in GenBank, the NIH genetic sequence database In the genomics literature, acceptance of a paper for publication is contingent on the submission of any sequence described in the paper to GenBank For example, 79% of primate sequences in GenBank have links to the literature, and each literature entry is linked to four sequences on average InMEDLINE, the literature database, a maximum of thirty sequence links are recorded,but more links can be inferred by inverting the links from sequence to abstract As an example of sequence-literature links, the paper “Molecular diversity in amino-terminal domains of human calpastatin by exon skipping” contains links to twelve sequences in GenBank, each of which is an example of human calpstatin sequenced as part of the experiments described in the paper Each of the sequences has a corresponding link to the MEDLINE entry for the paper.

The literature entries associated with structural entries usually describe the experimental procedure used to calculate the structure As with sequence data, deposition of the structure with the Protein Data Bank (PDB) is required for publication In addition to describing the structure, the authors often make observations about how the structure clarifies the mechanism of the protein’s enzymatic activity There is only one publication associated with the structure determination.

2.3.7 A Network of Literature, Sequences, and Structures

Using the six relationships defined above, a graph can be constructed with biological objects at the vertices and pairwise relationships labeling the edges Part of this graph already exists at NCBI’s Entrez system (http://www.ncbi.nlm.gov) which allows users to follow the relationships described above Entrez is heavily used by biologists for browsing In this thesis, we propose using the graph representation as the basis for computation In particular, we are interested in inferring clusters of biological objects from a heterogeneous corpus of information by partitioning the graph.

Note that there are two different kinds of relationships that papers, sequences and structures might have to each other The first is similarity, as described in the relationships above The second is a type of relationship explicitly encoded in databases For example, two protein sequences might be connected because they cooperate to perform a cellular activity, e.g., the multiple subunits of the replisome Literature links not only similar proteins but also related proteins A paper describing replication may discuss both proteins involved in the replisome By clustering, it might be possible to construct a broad group of sequences and literature around a particular cellular function.

Given a weighted, undirected graph, the objective of graph partitioning is to partition the graph into k roughly equal parts such that the sum of the weights connecting different parts is minimized, thereby producing subgraphs that are relatively highly connected The graph partitioning problem is NP-complete [26] However, many heuristics have been developed that find a reasonably good partition.

Traditional graph partitioning algorithms compute a partition of a graph by oper- ating directly on the graph, and they are usually slow On the other hand, multilevel graph partitioning algorithms reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then coarsen it to construct a partition for the original graph (31, 42, 43] These algorithms are generally fast and produce high-quality partitions We chose the pmetis program provided by the METIS! software, a publicly available graph partitioning software package [1] The partitioning algorithm used by pmetis is based on multilevel recursive bisection described in [42].

The multilevel recursive bisection method works as follows First, the size of the graph is reduced by collapsing nodes and edges to a few thousand nodes Then the smaller graph is partitioned into two parts Partitioning is repeated by uncoarsening each part one level up until all & subsets are obtained The whole partitioning is done in log(k) phases.

1 Available at http://www-users.cs.umn.edu/~karypis/metis/.

Chapter 3 Clustering in Relational Biological Data

“We understand a complicated system by understanding its simple parts, and by understanding the simple relations between those parts and their immediate neighbors.” Don Knuth

The scientific endeavor of biology has become increasingly reliant on data in electronic form, and it is therefore necessary for biologists to manage and understand large quan- tities of data Publicly available data including biological sequences, biological structures, and literature in the life sciences have grown to such an extent that computing is essential simply to store and access it Here we describe a clustering approach by exploiting the relational structure of biological data to help with the next step: to enhance understanding of the data by combining techniques from information retrieval with those from bioinformatics By computing over a network of biological sequences, structures and literature relationships it is possible to infer clusters of related articles, sequences and structures We describe the general framework and its application to several biological domains.

Clustering is the task of grouping a set of objects into different subsets such that objects belonging to the same cluster are highly similar to each other Conventional clustering algorithms employ distance (or similarity) measure to form the clusters [44].

On the other hand, graph partitioning algorithms exploit the structure of a graph to find highly connected objects Rich relational structure of biological data can be represented as a graph for clustering biological data Clustering biological data would be useful not only for exploring the data but also for discovering hidden relationships behind the raw

Here we describe a technique for clustering of biological objects: sequences, structures and literature We use METIS, a multilevel graph partitioning system, to form the clusters This process identifies subsets of nodes that are highly connected to each other, but are less strongly connected to the rest of the graph Our clustering approach is to find clusters which are disjoint as opposed to clusters that overlap We evaluate the quality of clustering using the existing sequence and structure classifications and show that our clustering solution provide much better clustering than random partitioning. These clusters are formed based on the pairwise relationships among biological data, so we can evaluate their topical cohesiveness by examining independent metadata such as Gene Ontology (GO) annotations and terms in MEDLINE abstracts We also evaluate the clusters by hand for relevance, and find that the clusters are highly topical.

In the next section, we start by describing the databases we used In section 4.5, we describe the construction of a graph from the databases Section 3.4 describes the BioIR system we built In section 4.7, we present and discuss the empirical results to assess the quality of clusters Finally, we end with a summary.

In this section, we will briefly describe the data sources we used to construct our graph.

MEDLINE: MEDLINE is a digital collection of life science literature consisting of over twelve million abstracts It contains some additional information associated with each abstract such as manually assigned MeSH terms and chemical names Moreover, MEDLINE entries contain links to the sequences and structures that the article discuss. For instance, Figure 3.5 shows excerpts from a MEDLINE record that contains references to three structures in PDB In this chapter, we used the MEDLINE abstracts fromMEDLINE 2002 distribution, which are connected to either SWISS-PROT or PDB entries.This amounted to 83,609 MEDLINE abstracts We chose to work with these abstracts because we wanted to focus on organizing information about proteins.

SWISS-PROT: The Swiss Protein Database (SWISS-PROT) is a curated protein sequence database [10] The database contains high-quality annotation including de- scriptions of each protein’s function SWISS-PROT entries are cross-referenced to several other databases, including MEDLINE, PROSITE and the PDB We used the whole swiss- PROT database, Release 41 of February 2003, containing 120,704 protein sequences. Figure 3.1 illustrates a portion of a sample SWISS-PROT entry Figure 3.2 shows the distribution of the number of references to particular MEDLINE entries (degree) for the protein sequence entries in SWISS-PROT.

A Network of Literature, Sequences, and Structures