Luận án tiến sĩ: Identification of secondary and tertiary motifs in DNA sequences through naive Bayesian text classification

This study describes the effectiveness of using a nạve Bayesian text classifier to identify secondary and tertiary protein motifs in DNA sequences.. Statement of the Problem The problem

INTRODUCTION

Some of the most active research in statistics is Bayesian where determining prior events remains a valid approach to the discovery process used by scientists (Jaynes, 1979; Zhang, Mukherjee, Ghosh, & Wu, 2006) The largest and most inviting field for the application of statistical analysis is biology (Jaynes, 1979; Rodrigez-Esteban, Iossifov, & Rzhetsky, 2006) For example, the human genome is made up of 23 pairs of chromosomes, each containing molecules called deoxyribonucleic acid (DNA) DNA molecules are shaped in the form of a twisted ladder with sugar and phosphate molecular components forming the sides of the ladder and the pairs of nucleotide bases forming the ladder rungs The nucleotide bases are made up of guanine (G), adenine (A), cytosine (C), and thymine (T) (Mangalam et al., 2001) The human genome consists of approximately 3 billion base pairs of DNA making up nearly 100,000 genes (Rockett, 2000) Of the 3 billion base pairs that make up the human genome, only about 1% code for proteins (Swope, 2001)

With the existence of varied heterogeneous remote and local data sources and the need for complex analyses of the data, several software platforms and application frameworks are needed to facilitate a better understanding of genomic data (Swope, 2001) These will need to combine graphical interfaces and complex analytical and data- mining tools with Web-based access to one or more remote data sources on the Internet The benefit will be to allow comparison of unknown protein sequences within a database of sequences from other organisms that are better understood Through cross-over, mutation, and fitness, nature reuses what it has learned from the design of simple organisms such as bacteria (Swope) From this knowledge, scientists are able to deduce the function of other proteins by searching for the similar genes or proteins in the databases of other genes and proteins worldwide Yet inferring biologically meaningful information from warehoused data requires sophisticated data-mining techniques

(Cummings & Relman, 2000) This research focused on one such data-mining technique as illustrated in Figure 1

To move in silico research in the direction of improved decision support, the ability to take raw data and determine relevant patterns is important in order to design potential cures for diseases by improving the drug development lifecycle This study describes the effectiveness of using a nạve Bayesian text classifier to identify secondary and tertiary protein motifs in DNA sequences Bayes’s rule was used to determine data categories via probability A nạve Bayesian text classifier is a machine-learning algorithm that uses an automated means of determining metadata about data Categories are represented by a collection of motifs and their frequencies; frequency is the number of times each motif is identified in the data used to train the classifier

Statement of the Problem The problem addressed in this research focused on identification of secondary and tertiary protein motifs in DNA sequences via nạve Bayesian text classification

Statistical hypothesis testing for decision making in in silico essentially involves class comparison, where several experimental groups are directly analyzed Pattern recognition, on the other hand, involves class prediction where a range of supervised multivariate techniques is used or class discovery where unsupervised multivariate techniques are used The problem with either approach is that biologists tend to want data analysis to be like “a laboratory protocol—a series of steps that, if followed faithfully, guarantee to produce the correct answer to their experimental question” (Morrison &

Ellis, 2003, p 358) This, however, is not possible given that data analysis involves detecting and displaying those patterns that are present in the data without actually knowing what the patterns are in advance Here, trial and error play major roles in determining appropriate analyses and evaluating their results

In pattern recognition and prediction, a different solution is required that uses logical analysis to determine prior states in the system under consideration (Jaynes,

1979) This is in part due to the inherent complexity involved in the data Researchers frequently do not know which patterns to predict in the dataset and as a result are unable to be explicit about which patterns should be interpreted as biologically meaningful This process involves searching for patterns that might exist in the database which could then be interpreted post hoc for meaningfulness In this case, the answer to the experimental question results not from the answer to an “explicit statistical question” but from the search patterns in the dataset (Morrison & Ellis, 2003, p 364) With the complexity of the data high, researchers are required to engage in data mining with a particular emphasis on pattern analysis The problem with this approach is that there is no single technique that can be generally recommended to mine the data for patterns (Morrison &

Ellis, 2003) This leaves the in silico practice open to different methodologies and potential solutions where in silico uses an experimental techniques performed on a computer or via a computational simulation

Not knowing a priori which techniques are specific enough to find those patterns that might be in the data creates an inherent problem for researchers The situation arises as a result of there being no single pattern that can be expected in the data If there are many possible patterns in the data, then there must be many possible mathematical techniques for finding those patterns For example, “choosing the technique that is suggested to be best under the widest range of possible circumstances sounds like a reasonable criterion of choice, but this is not necessarily a good idea” (Morrison & Ellis,

2003, p 364) In this example, there is no question that the data will have their own characteristics due to various circumstances involving the experimental conditions, the way the data are quantified, and even the experimental question being asked The difficulty with relying on a rigid protocol is that there is no guarantee that the correct answer to an experimental question will be produced thereby causing the results to fall short of expectations (Morrison & Ellis) The study presented here describes the effectiveness of using a nạve Bayesian text classification algorithm to identify secondary and tertiary motifs in DNA sequences and addresses the lack of efficiency in identifying motifs in DNA sequences

Background of the Problem Shah, Passovets, Kim, Ellrott, Wang, Vokler, LoCascio, Xu, and Xu (2003) found that technology and experimental techniques by themselves are not enough to keep pace with the production rate of protein sequences in order to analyze and predict the function of proteins These high production rates of protein sequences that result from microarray experiments are generating large volumes of complex data in an effort to identify which genes might be overexpressed or underexpressed under experimental conditions This effort, however, is only part of the story when it comes to the practice of in silico research The other side involves the analysis of data, where preprocessing of the raw data for quality control is used in combination with standardization to ensure data uniformity throughout the dataset This is followed by formal quantitative analyses involving either statistical hypothesis testing or multivariate pattern recognition

For the general computational biologist, a pattern is predicted to occur in the dataset where the averages of the observations in two experimental groups are different from each other and a single, repeatable mathematical test can be used to evaluate whether a pattern exists The logic used in in silico research involves inductive arguments where the specific instance of the sample is used to generalize about the population In other words, recognizing the virtue of deductive logic in experimentation requires inductive logic to analyze the data This approach is not without its problems Inductive reasoning does not provide the formal proof needed as evidence in support of any one particular hypothesis In addition, there is the issue where no matter how much evidence is gathered in support of a particular hypothesis; it is difficult to be certain that this same evidence would not equally support any number of unknown hypotheses (Morrison &

Given that large amounts of data are accumulated—in the range of hundreds of terabytes per day—regarding genetic information, it is important to make use of the available datato resolve pressing health issues (Guan & Bell, 2004) To accomplish this end, the use of data mining—a computer-based technique—to discover interesting, useful, and unknown patterns from massive databases is important (Guan & Bell, p 120)

It is through effective data-mining efforts that are used to exploit similarities between

DNA sequences that advances in in silico research are possible The major concentration for this segment of research is identifying motifs, or sequences, in groups of genes (Guan

& Bell, 2004) These motifs contain identical or similar sequential patterns and are extracted using computer algorithms (Duda, Hart, & Stork, 2001)

Nature of the Study With in silico research, two approaches to data analysis are commonly used: univariate data analysis and multivariate pattern recognition (Morrison & Ellis, 2003)

For univariate data analysis, one mathematical pattern is examined at a time With multivariate data analysis, the emphasis is on examining common patterns The usual hypothesis-testing framework requires the prediction of an explicit pattern in the data set and the fact that the data are inherently complex; it is difficult to actually predict a particular pattern that is biologically meaningful (Morrison & Ellis) The approach used in this study focused on searching for those patterns that might exist in the dataset, which might then be interpreted post hoc In other words, the descriptive experiment tested a set of articulated hypotheses about potential DNA patterns The tool used involved a self- learning e-mail spam algorithm that combined pattern identification with pattern classification Answers to the experimental questions in this research involved the identification of secondary and tertiary motifs in DNA sequences through nạve Bayesian text classification that result from a search for patterns identified in the dataset (Morrison

1 What secondary and tertiary motifs can be found in DNA sequences using machine learning?

2 What syntax infrastructure of secondary and tertiary motifs in DNA sequences can be determined using machine learning techniques?

3 What benefit is gained by using machine learning techniques to provide effective decision support for applied computational biologists?

Purpose of the Study This study describes the effectiveness of using a nạve Bayesian text classification algorithm to identify secondary and tertiary motifs in DNA sequences The major technical challenges associated with in silico research include poorly abstracted public domain software and the associated research techniques that result in a status quo of low- productivity for the field (Spengler, 2000) This situation exists despite the hundreds of millions of dollars spent each year to adapt computational software to high-throughput architectures For both the short- and long-term, it is understood that in silico research will not deliver drugs; it will, however, be able to shorten drug discovery cycles and reduce potential product risks As a result, the meaningful value of in silico research tends to be in decision support, and expectations may well center on this area in order to develop the proper decision support tools With this focus in mind, the study describes the effectiveness of using a nạve Bayesian text classifier to identify secondary and tertiary protein motifs in DNA sequences This is essential to establish fewer or better- designed lab experiments based on the questions asked, not necessarily the data stored

Theoretical Framework Detection of increased or decreased expression in genes is usually demonstrated by establishing an arbitrary baseline for the expression level However, there is no theoretical basis for choosing one baseline over another (Morrison & Ellis, 2003) To align the process with a more scientific approach, researchers prefer more objective statistical methods to provide evidence for their experiments To circumvent the issue related to choosing one baseline over another, the researcher must use statistical analyses to either confirm or deny the prespecified mathematical pattern or inferential data analysis for use in an in silico study The in silico intention is to assess whether an observed change is likely to represent a real biological pattern rather than some random accident (Morrison & Ellis) To a large degree, the in silico intent is easier to conceptualize than it is to execute

Some would argue that it might be possible to establish a set of minimum standards falling into one of two categories—those involving supervised machine learning techniques in which search patterns are supervised by an analyst who suggests which patterns are important, and those involving unsupervised techniques where mathematical algorithms are used to search for patterns (Morrison & Ellis, 2003)

Literature Review

A literature review of in silico research is presented in this chapter Several library-based and Web-based resources were used to identify those areas for inclusion Examples of the online resources used include Academic Search Premier, Educational Resource Information Center, Medline, Computer Science Index, and Computer Source

Overview of Computational Sequencing Shah et al (2003) concluded that over 1,000 genomes will be sequenced and their genes predicted using complex and multifaceted protein structure prediction processes in the years to come More and more researchers have become involved in these processes with expectations of discovering new DNA patterns or motifs (Black & Doerge, 2002) Some of the challenges facing these scientists include how to quickly derive the biological functions of these genes as well as recognize important functional aspects using sequence-based techniques (Shah et al.)

Even though sequence-based homology search methods can provide functional information only at a low-resolution level—little has been inferred, from this information, about the mechanism of how a gene’s biological function is realized (Shah et al., 2003) Shah et al discovered that a protein could have multiple structural domains making prediction of the whole structure with multiple domains could prove to be computationally difficult given the current stage of technology Despite their discovery, a major question remains unanswered regarding whether or not protein functionality exists independent of the structural domain?

Shah et al (2003) found that with the number of computational tools available on the Internet for protein prediction and characterization, several methods have been used to solve a specific class of characterization problems In some cases, it may be required to run different and independent tools and use their results to validate the prediction accuracy (Shah et al.) Even by combining lab techniques, however, researchers have not been unable to keep up with the rate of mega information accumulation

(Hatzivassiloglou, Duboue, & Rzhetsky, 2001) At times, the process has required different prediction tools to be run multiple times using different sets of parameters in order to make a sensible prediction (Shah et al.) To assist in the prediction process, a z- score has been used as an indicator for the confidence of prediction (Shah et al.) Shah et al found that the standard physical protocol used in the lab generally included the following steps:

1 Preprocessing and identification of protein domains, identification and removal of signal peptides, and protein secondary structure prediction;

2 Collection of functional/structural information of a prediction target through various database searches;

3 Protein triage and classification of target proteins into membrane proteins and soluble proteins, with or without close structural homologues;

4 Protein fold recognition for identification of native-like folds and generation of sequence-structure alignments, using threading techniques and sequence-based approaches;

5 Protein structure prediction for generation of detailed atomic structure models, base on threading alignments;

6 Structure quality assessment to evaluate the packing and backbone conformations, and the sterochemical quality of a predicted structure; and

7 Prediction result validation through comparing predicted structures and collected structural and functional information for consistency check; an iterative prediction/refinement process will be invoked if a significant inconsistency or poor structural quality is detected (p 1987)

Beyond the physical steps outlined by Shah et al (2003), statistical methods are also used to determine whether certain phenomena are significant or whether they are simply an aberration due to the variability of the experimental process as identified by

Black and Doerge (2002) Through applied statistical methods, replication has been used in the lab to increase the ability of an experiment to detect genes undergoing differential expression and improve the chance of observing genes which are differentially expressed

The key is to determine how much replication is actually required In some cases, experiments have suffered from a lack of replication and resulted in difficulty gaining information about the sample in the experiment, which in turn makes it difficult to determine whether observed expressions are real or simply due to chance To a large degree, it has been important to not only determine which genes might undergo significant differential expression but also to investigate whether true differential expression can be detected (Black & Doerge)

Supervised and Unsupervised Learning From a traditional machine learning and genetic research approach,

Hatzivassiloglou et al (2001) concluded that the processing features involved focus not on just words around a particular term, but also on positional and morphological information They noted that it was important that learning features not just words around a particular term be used since labeled genes and proteins for the learning phase requires laborious efforts by domain experts In order to identify specific types of content where a system could classify a term as a gene, protein, or mRNA, they used positional and morphological information in their learning algorithm to generalize about the occurrences Hatzivassiloglou et al used this as an example of unsupervised learning, where only the raw text and no human input was made available to the system

Even though unsupervised learning is a popular tool, Clare and King (2002) found supervised learning to be the most frequent method used to analyze microarray expression data A drawback to this approach, however, is the need for labeled training examples, which usually could only be obtained by manual annotation For Danilova,

Lyubestsky, and Gelfand (2003), recognition of common regulatory signals in sets of

DNA sequence fragments was determined to be an old problem related to the field of computational molecular biology Yet Hatzivassiloglou et al (2001) found the best method to circumvent this drawback was through the use of nạve Bayesian learning The method aimed to assign a term occurrence to the class that maximized the machine learning algorithm for that occurrence

In search of an optimal solution, Horvitz, Breese, and Henrion (1988) identified the need to extend techniques developed in decision science to artificial intelligence

Since decisions underlie any action that a problem solver may take in determining objectives, needed action, timetables, obstacles, and alternative objectives, there is

“ubiquity of uncertainty associated with incomplete models” (Horvitz et al., 1988, p 1)

Through probability, there exists a language for describing uncertainty and efforts in this area extend this language to make statements about what alternative actions are and how alternative actions are valued Horvitz et al concluded that from the decision-theoretic perspective, reasoning about complex decisions under uncertainty cannot avoid making assumptions about prior belief and independence, whether implicit or explicit

Machine learning has played a role as expert system where the emphasis has been on inference and decision making under uncertainty By expert system, Horvitz et al

(1988) meant a reasoning system that performed at a level comparable to or better than a human did within a particular area of expertise Even so, with the development of heuristic inference methods there is less of a concern with normative optimality, methods for decision, and inference under uncertainty As a result, machine learning researchers focused on the role of representing and reasoning with large amounts of subject matter expertise Yet inconsistencies existed in popular heuristic schemes such as the “rule- based approach to reasoning under uncertainty” (Horvitz et al., p 38)

Baldi and Pollastri (2004) identified a fundamental problem associated with systematic design, training, and application of neural network architectures to real-life problems Through the use of neural network machine learning, the prediction process improved when a principled methodology to design complex recurrent systems was trained to address real-world problems effectively As a result, they argued that through a recursive approach, neural network machine learning techniques have the ability to process the data structures with two- and three-dimensional graphical supports This is particularly important in that the recursive neural network machine learning techniques were found to be related to graphical models as well as specifically related to Bayesian networks (Baldi & Pollastri)

METHODOLOGY

procedure, and sample are also discussed A detailed description of the treatment is included in this chapter to emphasize its relevance for the study How the data were collected and analyzed is also provided in this chapter

Research Design and Approach This study is quantitative, with a nạve Bayesian text classification algorithm used as the primary data collection technique The complete algorithm is described later in this chapter An experimental design was used because it applies to the ability to test purposeful changes made to input variables of a process or system in order to observe and identify the reasons for changes that result in the output response (Montgomery, 2005) Similar studies using algorithms by Lowd and Domingos (2005) and Zhang, Zhang, and Yang (2003) addressed the issue of construct validity Based on the results of those studies, this researcher was confident the algorithm would be able to classify proteins similarly The study here describes the effectiveness of a nạve Bayesian text classifier to identify secondary and tertiary protein motifs in DNA sequences using an experimental design so the results might be generalized to other protein sequences in terms of external validity Evidence that the conclusions drawn have validity and applicability across related in silico research is provided through the study (Shadish, Cook, & Campbell, 2002) Other research designs were considered—hybrid experimental and quasi- experimental designs; however, they were determined to be less optimal for the type of research proposed in terms of testing and random assignment respectively

Target Population, Setting, and Sampling The target population for this research was a list of 44 eukaryotic sequenced genomes available from publicly available sources, as summarized in Appendix A The listing comprised the sampling frame as of the time this study was conducted Access to the sequenced data was made via the public Web sites that link to the respective DNA database repositories The researcher used a simple random sampling process to select the data sets for the study The sequenced data sets were assigned a number from 1 to 44

An indexing algorithm was applied to a total of 41 DNA data sets for the study This number of data sets allowed for the minimal accepted number needed when using a random sampling protocol (Ewens & Grant, 2001) To obtain the minimal accepted number needed, Siegle (2004) used a process to determine the needed sample size of a randomly chosen sample for a given population To obtain a chi-square of 1 degree of freedom relative to a level of confidence at 95%, a margin of error at 4%, and a population of 44, the sample size should be 41 (Siegle)

Algorithm and Treatment The self-learning e-mail spam algorithm used the hash associative array %motifs to store the motifs counts for each motif and for each category (Graham-Cumming, 2005) The hash was stored to disk using a Perl construct called a "tie" that, when used with the DB_File module, resulted in the hash being stored automatically in a file called

"motifs.db" so that its contents could exist between the invocations that were used:

#!/usr/local/bin/perl use DB_File; my %motif; tie %motif, 'DB_File', 'motif.db'; (Graham-Cumming, p 3)

The hash keys are strings of the form category-motif (Graham-Cumming, 2005) For example, if the motif "AAAAAAAAAA" appears in the category "primary" with a count of three, there will be a hash entry with key "AAAAAAAAAA-primary" and a value “This data structure contains enough information to compute the probability of a sequence and do a nạve Bayesian classification” (Graham-Cumming)

The subroutine parse_file read the sequence to be classified or trained and filled in a hash called %motif_in_file that mapped motifs to the count of the number of times that motif appeared in the sequence (Graham-Cumming, 2005) The subroutine used a simple regular expression to extract every 3- to 44-letter motif that is followed by white space; in a real classifier, this motif splitting could be made more complex by accounting for unique motifs: sub parse_file

{ my ( $file ) = @_; my %motif_counts; open FILE, "

Tiêu đề	Identification of Secondary and Tertiary Motifs in DNA Sequences through Naive Bayesian Text Classification
Tác giả	Rodney V. Villalobos
Người hướng dẫn	Ruth Maurer, Committee Chairperson, Applied Management and Decision Sciences Faculty, Raghu Korrapati, Committee Member, Applied Management and Decision Sciences Faculty, Louis Taylor, Committee Member, Applied Management and Decision Sciences Faculty
Trường học	Walden University
Chuyên ngành	Applied Management and Decision Sciences
Thể loại	Dissertation
Năm xuất bản	2007

Định dạng
Số trang	93
Dung lượng	321,31 KB