Luận án tiến sĩ: Predicting gene structure in eukaryotic genomes

Computa-This dissertation addresses the problem of computational gene prediction in eukaryoticgenomes, presenting a framework for predicting precise single isoform protein coding genes i

Trang 1

PREDICTING GENE STRUCTURE IN EUKARYOTIC

GENOMES

by Jonathan Edward Allen

A dissertation submitted to The Johns Hopkins University in conformity with the

requirements for the degree of Doctor of Philosophy

Baltimore, Maryland

September, 2006

Trang 2

UMI Number: 3240661

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copysubmitted Broken or indistinct print, colored or poor quality illustrations andphotographs, print bleed-through, substandard margins, and improperalignment can adversely affect reproduction

In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted Also, if unauthorizedcopyright material had to be removed, a note will indicate the deletion

®UMI

ProQuest Information and Learning Company

300 North Zeeb RoadP.O Box 1346

Trang 3

Obtaining the complete set of proteins for each eukaryotic organism is an important step

in the quest to understand how life evolves and functions The complex physiology of karyotic cells, however, makes direct observation of proteins and their parent genes difficult

eu-to achieve An organism’s genome provides the raw data that contains the set of instructionsfor generating the complete set of proteins, providing the potential to obtain a complete list

of proteins without having to rely exclusively on direct observations in the cell tional gene prediction systems, therefore, play an important role in compiling sets of putativeproteins for each sequenced genome

Computa-This dissertation addresses the problem of computational gene prediction in eukaryoticgenomes, presenting a framework for predicting precise single isoform protein coding genes inlong contiguous stretches of DNA The framework is extended to predict overlapping alterna-tively spliced exons in known protein coding regions A main contribution of this work is toapply classifier stacking with sequential inference, for the first time, to the gene finding prob-lem and to develop a phylogenetic generalized hidden Markov model for the alternative splicesite prediction problem First a linear weighting scheme is developed, which is extended to_ a statistical prediction model The statistical model is then transformed to a new sequentialinference model to predict alternatively spliced exons

Trang 4

Prediction accuracy of the single isoform gene prediction methods are tested on threeeukaryotic genomes: Arabidopsis thaliana, Oryza sativa and human Application of the geneprediction methods are examined in other eukaryotic genomes The alternatively spliced exonprediction model is tested in four Drosophila species under a variety of input conditions.Incorporating multiple sources of gene structure evidence is shown to substantially im-

proveme single isoform gene prediction accuracy with performance beginning to rival the

accuracy of expert human annotators Results from the alternative exon prediction ments demonstrate the potential to reliably predict new alternatively spliced forms of knowngenes The use of cross-species sequence conservation information is shown to enhance the

experi-precision of alternatively spliced exon prediction.

Adviser: Steven L Salzberg

Readers: Steven L Salzberg and Jason M Eisner

Trang 5

I would like to thank my adviser Steven L Salzberg, for his guidance, patience and support

and for giving me the freedom to pursue challenging research problems Dr Salzberg has

made many helpful suggestions, which improved the quality of my work over the last several

years Thank you to J ason Eisner for informing me of important related work in machine

learning and natural language processing I would also like to thank other members of Dr.Salzberg’s group including Mihaela Pertea and William H Majoros with whom I had manyenlightening discussions on gene finding work I also benefited from the many useful discus-sions on bioinformatics topics with other members of the group including Pawel Gajer, Maria

D Ermolaeva, Arthur L Delcher and Mihai Pop Thanks to many people at The Institutefor Genomic Research which were very helpful in providing useful data to work on including

Brian Haas, Bernard Suh, Chunhui Yu, Sam Angioli, Ahwui Wang, Robin Buell, Malcolm

Gardner, Jane Carlton, Elodie Ghedon and Brendon Loftus.

| I would also like to thank Harold Gainer for advice and providing me with an interestingbiological problem to work on and I thank S Rao Kosaraju for his positive supervision onthis project Thank you to Marvin Cook for many productive study sessions, which helped

me get more out of many of the courses we took together

Thank you to my wife Safia Ahmed Omar for her love and support and helping me to keep

Trang 6

my life in proper perspective Thanks to my family Leah Lewis, Karen Kramer, Wise D Allen

and especially my parents Wise and Joan Allen Without their support and encouragement

my educational pursuits would not have been possible

Trang 7

4 Prediction of Alternatively Spliced Exons

ii

iv

viii

xi

Trang 8

4.4.1 Graphical Model 2 Q Q Q Q Q Q Q n vn g g v.v va4.4.2 A Generalized Hidden Markov Model 4.4.3 A Phylogenetic Generalized Hidden Markov Model

5 Automated Gene Structure Annotation

5.2 Gene Structure Annotation Applications 0.000.004}5.2.1 Arabidopsis thaliana 2

5.2.3 Gene Structure Comparison 0.00000 0006.3 Human eee5.3.1 Testing on the ENCODE Regions 0.0000.5.3.2 Evaluation of Evidence Tracks 0 0000000 45.3.3 Performance Comparison 0 00000 cece eee

6 Alternative Exon Prediction Performance

Trang 9

The set of class labels that describe a local sequence interval used to constructgene models on the positive strand, denoted by the “+” symbol The non-coding label applies to both strands Labels reflect partial and complete exons.Each entry asserts whether the condition in that column must be true (1) orfalse (0) 10 additional class labels are used to represent strand specific labels

on the negative strand 1 Q Q Q Q ng v g va va

Performance of the gene predictors on 1783 genes SC = Statistical Combiner;SC-g = SC combining gene prediction programs only; LC2 = Linear Combinerusing sequence alignments; LC1 = Linear Combiner using gene prediction pro-grams only; GA = GlimmerM; GM = GeneMark.hmm; GS = Genscan+ Thecolumns are: number of whole genes correctly predicted (Correct Gene); num-ber of genes completely missed (Missed Gene); correctly predicted exons out ofthe 7510 total (Correct Exons); number of exons completely missed (ME); Pre-dicted exons overlapping a gene region but do not overlap a true exon (InsertedExons); percentage of protein coding nucleotides correctly detected (Nucl Sn).Breakdown of combiner predictions when matching exactly 3, 2, 1 or 0 geneprediction programs The first column (Combiner) refers to the four combiners.The second column (# of GP) refers to the number matching gene predictionprograms The third column and fourth column count the number of times thecombiner prediction is correct (CG) and not entirely correct (WG) The fifthcolumn is the percentage of correct predictions co.The number of gene models each gene finder exclusively predicts correctly intest set 2 - dd ẼäăẶẼ

125

Trang 10

Glim-= LC2 using three gene prediction programs; LC1-3 Glim-= LC1 using three geneprediction programs; TS = Twinscan; GM2 = newer GlimmerM output Thethree prediction programs used by SC-3, LC2-3 and LC1-3 are Twinscan, Gen-eMark.hmm and newer GlimmerM (GM2) 00.Performance comparison of JIGSAW and SC-5 (from Table ð.4) JIGSAW performance in Oryza sativa Sn (sensitivity) = percentage of test setcorrectly predicted Sp (specificity) = percentage of predictions that are cor-rect Performance measured on three criteria: Genes, Exons and Nucleotides(Nucl) All results shown as percentages 2 0 ee eeGene structure comparison Each entry contains two values “A/M” with Abeing the average and M being the mean value The Exon / Intron column ismedian exon length divided by median intron length, JIGSAW using gene finders and non-Human EST data Results show sensi-tivity (Sn) and specificity (Sp) measured on Genes, Exons and Nucleotides(Nucl) All results shown as percentages 2 eeResults of applying JIGSAW with all available evidence *KnownGene predictsmultiple transcripts per gene locus with a transcript specificity of 47%.

Comparison of EGASP prediction performance for exons and protein codingnucleotides among the different prediction methods Sensitivity (Sn) and Speci-ficity (Sp) is given The F-score is shown for the nucleotide predictions

EGASP prediction performance for Genes and gene transcripts (Gene Trans)measuring sensitivity (Sn) and specificity (Sp) The F-score is given for theGene predictions Transcript to Gene Ratio shows the number of transcripts

Percentage of D melanogaster annotated di-nucleotides conserved in D ulans (Dsim), D yakuba (Dyak) and D erecta (Dere) Di-nucleotides areseparated according to splicing type: Acceptor (Acc) and Donor (donor) andsplicing event type: alternative (Alt) or constitutive (Con) Pseudo splice sitesare included for reference 2 eePercentage of D melanogaster annotated exons missing at least one splice site

sim-in D simulans (Dsim), D yakuba (Dyak) and D erecta (Dere) Percentagesare organized by exon type: constitutive exons (CS) cassette exons (CE), exonswith multiple splice sites (MS) and exons with intron retention (IR) The secondnumber associated with the MS and IR rows is the percentage of exons wherethe non-conserved splice site is constitutive (used in all isoforms) .Results are shown for 8 versions of ExAlt using different combinations of infor-mant species plus Genscan The informant species are D simulans (sim), D.yakuba (yak) and D erecta (ere) ExAlt-ab initio uses no informant species.Sensitivity (Sn) and Specificity (Sp) is shown for Exons (both acceptor anddonor is correct), Splice site (an acceptor or donor site is correct) and protein-_coding nucleotide predictions (Nucl) 1 Q Q

146

Trang 11

at most 1 exon predicted per test sequence (ExAlt-Frame-Single) Rows

6-8 show ExAlt performance using no gene structure information with defaultparameters (ExAlt-Default), no informant species (ExAlt-Default-ab initio)and at most 1 exon prediction per test sequence (ExAlt-Default-Single)

Exon prediction accuracy from Table 6.4 separated by exon splicing event

ExAlt results on the initial training and testing set in percentages Includednext to each measurement is the difference in percentage points compared toperformance in the held out set in Table6.4

158 160

161

Trang 12

List of Figures

1.1 Aschematic of double stranded DNA Each nucleotide is represented by an “L”shaped box and includes a 5 carbon sugar molecule (S), a phosphate residue(P) and one of four nitrogenous bases (a, c, t and g) Dashed lines indicatehydrogen bonds between two bases, c-g pairs form three hydrogen bonds anda-t pairs form two hydrogen bonds The 3’ and 3’ labels denote the orientation

1.2 Example of protein coding gene structure Gene contains two exons and oneintron The initial exon includes an untranslated region (5’ UTR) and thetranslated region The terminal-exon consists of a translated region endingwith a stop codon followed by an untranslated region (3’ UTR) Transcriptionbegins at the transcription start site, which is preceded by the promoter region 41.3 Finite State Automaton for recognizing protein coding genes For clarity, boxesare used to group a collection of nodes in the model Transitions leaving abox indicate a transition leaving each respective node in the box Transitionsarriving at a box indicate transitions arriving at each respective node in thebox The major components of a protein coding gene are included in the FSA:start codon, stop codon, amino acid codons (codong, codon, and codong) introns(introno, intron, and introng) and intergenic sequence Each state in the FSA

is unique, and is labeled with the nucleotide (given in lower case) read whentaking a transition into that state Additional states can be added to recognizeadditional gene structure features, such as promoters and untranslated regions =91.4 Generalized hidden Markov model of protein coding genes using the compo-nents described in Figure 1.3 .0 00000000 eee va 18

2.1 - Gene structure evidence from the UCSC human genome annotation database

(chromosome 20) Each row shows evidence generated from a distinct source 31

2.2 The number of correct and incorrect (number in parentheses) whole gene model predictions shared among the three prediction programs: GlimmerM (GA),Genscan+ (GS) and GeneMark.hmm (GM) from a test set of 1783 genes “In-correct gene” refers to cases where all coding exons in the gene are in perfectagreement among the gene finders but not with the true gene 32

Trang 13

2.3 Partitioned output from three evidence types: splice site predictions, genepredictions and sequence alignments The five sources of evidence listed inorder from top to bottom are: output from a splice site prediction program(SP); a gene prediction program (GP1) with exon confidence scores 0.9 and0.89; a gene prediction program (GP2) with no confidence scores; 89% and 45%identity alignments from a protein database, which make up a single evidencesource and a 32% and 20% identity alignments from an EST database Thegenome sequence is divided into intervals 7ị, ,/; defined by each potentialboundary 21, 2%2, ,%3 The predicted splice site at x5 is associated with 7s 2.4 An example of four overlapping candidate gene models G1-4 The exons areassumed to be part of the same reading frame In this example, if the evidenceonly predicts G1 and G2, the Linear Combiner scores G3 or G4 if either model

2.5 Gene prediction in genomic sequence S of length N on overlapping subsequences

So, S1, Sz and S3 Each subsequence 5S; is of length K and overlaps the adjacentsubsequences S;_; and $;., with length V The last subsequence $3 may be of

2.6 Merging overlapping gene predictions The first two examples (Example 1 andExample 2) show cases where two overlapping exons are merged into a singlegene Example 3 shows an example where the two exons can not be mergedinto a single gene list (Exon labels are described in the text.)

3.1 Gene prediction model for predicting genes on both strands simultaneously .3.2 Representation of four sources of gene structure evidence mapping to genomesequence S Two gene prediction programs (GP1 and GP2), a cDNA alignment

with 86% identity to S and an EST alignment with 95% identity to S Examples

of the six features, start (sta), stop (stp), coding (cod), intron (inr), donor (don)and acceptor(acc) encoded in feature vectors are shown The predicted exonboundaries are ko, ,kg ee3.3 Example sequence parse with evidence Sequence § is partitioned into seg-ments, to, t, and t2 with state assignments, go, gi and ga respectively k marks_a position in 5 and the dashed box highlights the evidence overlapping the firstinterval from position bp) to€p 2 2 ee3.4 Schematic of the JIGSAW training procedure Feature vectors are collectedfrom m examples and separated according to each of the six gene feature types.Decision trees are induced for each of the separated training sets .3.5 The plot at the top of the figure shows the accuracy of predictions based onalignments to non-human sequences that overlap a gene finder prediction Eachpoint is a pair of alignments observed in training and their percent identity tothe genomic sequence ’+’ points are classified as “accurate” and ’x’ points areclassified “inaccurate.” The two lines correspond to the internal nodes in thedecision tree shown at the bottom of the ñgur

4.1 Three forms of alternative splicing predicted by ExAlt: Intron Retention (IR),Cassette Exon (CE) and Multiple Splice sites (MS) 0 4.2 Exon splicing stages (Image based on [98].) See text for a description of the

43

Trang 14

Model from Figure 4.6 expanded to include alternative splicing in terminaland initial exons and candidate single exon genes Blue and beige states reflectpossible protein coding exons (beige) or partial protein coding exons (blue) andrepresent three states for each state shown, one for each of the three codingphases Special states “Beg” and “End” show the respective begin and end

Classification accuracy of the four sequence models: constitutive and tive intron intervals and constitutive and alternative exon intervals are shown

alterna-in respective order P-values are respectively 0.05, 0.1, 0.1 and 0.05 Histogram of donor (first row) and acceptor (second row) splice site scoresusing the WAM trained from examples of exons with constitutive splice sites(first column) and exons with alternative splice sites (second column) Y axisshows relative frequency, x-axis shows log-odds score False Donors (Acceptors)colored red is the distribution of scores for GT (AG) di-nucleotides presumed to

be non donor (acceptor) sites based on the published annotation ConstitutiveDonors (Acceptors) colored green is the distribution of scores for donor siteswith no alternative splice site Alternative Donors (Acceptors) colored blue isthe distribution of scores for donor sites with alternative splice sites Phylogenetic tree used by ExAlt Each branch ¿ has a branch length of };.Overlapping donor signal windows .0000 0000.%

Distribution of D melanogaster / D erecta pairwise percent identity values forprotein coding exons categorized by constitutive exons (red), and alternativelyspliced exons (green) Each point in the graph represents the percentage ofcases (y-axis) with at least the percent identity specified by the value on the

Distribution of D melanogaster / D erecta pairwise percent identity valuesfor introns divided into constitutive (red), and alternatively spliced (green) Distribution of percent identity ratio scores for protein coding-exons versus theadjacent intron in pairwise comparisons of D simulans (red) and D erecta(green), ee,

Trang 15

Chapter 1

Introduction

A genome is the entire collection of genetic material found in each cell of an organism.

Genetic material consists of the genes and related DNA elements, which store informationused to synthesize the macro molecules critical to cell life Since the advent of high throughputgenome sequencing methods, vast quantities of genomic data are now publicly available foranalysis Two types of large scale sequencing projects are contributing to the availability ofbiological sequence data One effort lies in the work to sequence whole genomes Examplesinclude the genomes of mammals such as Homo sapien (human) [7ð, 173] and Mus musculus(mouse) [28], as well as many other species across the animal kingdom, including Drosophilamelanogaster (fruit fly) [2], Plasmodium falciparum (the parasite causing Malaria) [113] andhundreds of bacteria The quality of genome sequence data ranges from nearly completely

sequenced genomes, where every nucleotide in each chromosome is accounted for with very

low error rates, to low coverage random shotgun sequencing, which yields short genomicfragments and retains a larger number of sequencing errors

The second major effort contributing to new biological sequence data is the high

Trang 16

through-put sequencing of functional elements in the cell and in particular molecules transcribedfrom the genome These gene “expression” studies provide important physical evidence ofphysiological function associated with specific regions in the genome As with the genome se-quence data, quality and coverage varies depending on the techniques used Full length cDNAprojects capture complete transcripts with minimal error rates, while “tagging” approachescapture parts of transcripts (potentially with more errors) Examples include Expressed Se-quence Tags (ESTs) [1], Serial Analysis of Gene Expression (SAGE) [172] and Cap Analysis

Gene Expression (CAGE) [154]

This thesis addresses the challenge of developing computational methods to predict

pre-cise protein coding gene structure using these two key data sources: genome sequence andgene expression sequence Three new algorithms are introduced to address the gene structureprediction problem and are described in Chapters 2, 3 and 4 respectively Chapter 5 presents

the results and procedures for automated genome annotation using the gene prediction

algo-rithms described in Chapter 2 and Chapter 3 Chapter 6 gives results for alternative splice

site prediction using the algorithms described in Chapter 5 Chapter 7 contains concludingremarks and directions for future research The remainder of this chapter provides an in-troduction to the problem of predicting gene structure and reports on previous work in the >field and introduces the computational framework, which serves as a basis for the algorithmsdeveloped in the later chapters

1.1 Background

A gene is the entire deoxyribonucleic acid (DNA) sequence required for synthesis of afunctional protein or functional ribonucleic acid (RNA) molecule [98] The term “functional”

Trang 17

is defined to be a naturally occurring molecule, which affects cell physiology [98] The genes

are contained within long double helical strands of DNA called chromosomes A chromosome

is a double stranded sequence of connected nucleotides comprised of a 5 carbon sugar, a

phosphate residue and one of four nitrogenous bases attached to each sugar - Adenine (A),

Guanine (G), Cytosine (C) and Thymine (T) [175] A schematic is shown in Figure 1.1.Individual nucleotides are fused together by phosphate bonds to make a single strand of

DNA Each base interacts primarily with just one of the other three bases, Adenine bonds

with Thymine and Guanine bonds with Cytosine Two DN A strands bind together according

to these base pairing interactions to form the double helix structure of double stranded DNA.

The double strand interaction is shown in Figure 14, Each strand of DNA has an orientation

defined by the relative positions of the carbon atoms in the DNA’s sugar-phosphate backbone

Trang 18

transeription start site poly A signal"pyrimidine rich

promoter region \ start codon donor \ acospto stop pan |

tata \

transcription factor © S'UTR ‡ {branch ste 3'UTR

binding sites exon intron exon

Figure 1.2: Example of protein coding gene structure Gene contains two exons and oneintron The initial exon includes an untranslated region (5’ UTR) and the translated region .The terminal exon consists of a translated region ending with a stop codon followed by anuntranslated region (3’ UTR) Transcription begins at the transcription start site, which ispreceded by the promoter region

The carbon atoms are labeled 1’ - 5’ and the two strands of DNA, which form the double helixare anti-parallel - one strand is oriented in the 3’ to 5’ direction and the other strand is oriented

in the 5’ to 3’ direction This reflects the relative positioning of the carbon atoms on the sugarmolecule with respect to each strand When referencing a location in a chromosome, the term

“base pair” is used to refer to the two nucleotides from each strand which pair together Forexample, the size of the human genome is cited as containing approximately 3,000 million basepairs [75], with the actual genome being twice this number in terms of individual nucleotides.Because of the fixed set of base pairing rules, A bonds with T and G bonds with C, onestrand of DNA can be recovered from examining the contents of the other strand Therefore,the genomic sequence is represented throughout this thesis using a single string of lettersrepresenting the nucleotides assembled in the 5’ to 3’ direction The terms “positive” and

“negative” strand is used to distinguish between the two strands when needed

Figure 1.2 shows the distinct components of protein coding gene structure located in thechromosome, which are responsible for facilitating gene transcription into messenger RNA(mRNA) The gene begins with a promoter region, which contains nucleotide sequence thatbinds to the transcription complex marking the beginning of a transcribed gene region, which

Trang 19

occurs on a single strand The promoter region includes a TATA sequence and a conservedprotein binding sequence motif found upstream In addition to the proximal-promoter, distaltranscription ‘enhancer’ and ’silencer’ elements can occur over 50 kilobases upstream of thetranscription start site and the presence or absence of transcription factors binding to thesesites facilitate initiation of transcription The exon intervals consist of codons, which are DNAtriplets that encode for one of the 20 amino acids, which ultimately define the derived proteinsequence The transcription start site is upstream of the ATG “start” codon, which marks

the initiation of translation from codons to amino acids with initiation further facilitated by

surrounding consensus (Kozak) sequence The exons in the gene are potentially interrupted

by introns, which are removed from the final functional mRNA form The introns begin with

a ‘donor’ di-nucleotide site, commonly GT and ends with an ‘acceptor’ di-nucleotide site,

commonly AG A “stop” codon marks the termination of translation, which is followed by

an untranslated region Upon termination of transcription, the 3’ end of the pre-mRNA iscleaved and spliced to a poly A tail (an Adenine nucleotide polymer) The bottom image inFigure 1.2 shows the final processed mRNA, which includes the leader 5’ untranslated region(UTR) and the 3’ UTR, which is exported outside the nucleus for translation by a complexcollection of molecules called the ribosome The key component of translation, which is ofconcern in predicting gene structure, is identifying the correct “reading frame” The processedmRNA shown in Figure 1.2 is translated from start codon to stop codon, which implies a singlereading frame and since the codons are nucleotide triplets the length of the mRNA from startcodon to stop codon should be a multiple of three Note that when the gene is transcribedinto mRNA, when the nucleotide in the gene specifies a Thymine, the equivalent ribonucleicacid used to form the nascent mRNA is Uracil (chemically similar but distinct to Thymine)

Trang 20

For clarity throughout the thesis Thymine is referenced with the understanding that the base

is always substituted for Uracil in RNA |

Contents of a chromosome typically include non-protein coding RNA, which in some casesuse the same transcription and splicing signals as protein coding genes minus the codonsused to assemble an amino acid sequence While non-protein coding gene prediction remains

an important and challenging problem, it is somewhat distinct from protein-coding genefinding The two ubiquitous non-protein coding genes, transfer RNA (tRNA) and ribosomalRNA (rRNA), are highly conserved in secondary structure among closely related species,thus computational approaches distinct from those used in eukaryotic protein coding geneprediction, have been highly effective in finding these genes An abundance of other non-coding RNA species are known such as microRNA Like tRNA and rRNA, microRNA genefinding search for specific sequence conservation patterns [65] For a review of non-proteincoding gene prediction see [41]

This thesis deals specifically with predicting gene structure in protein coding genes, whichremains a difficult problem to solve Two specific aspects of protein coding gene structureprediction are addressed: single isoform gene structure prediction and alternative exon isoformprediction The thesis claim is two-fold: first, using a diverse array of evidence sourcesyields more accurate predictions than single evidence prediction sources and second, explicitmodeling of alternative splicing accurately identifies multiple overlapping functional genes.Classifier stacking with sequential inference is shown to be an effective method for improvingaccuracy in gene structure prediction

The remainder of this chapter defines the single isoform gene structure prediction problem,which serves as a basis for problems addressed in later chapters and gives background on

Trang 21

previous work in the area Two recent gene finding reviews provide an overview of progress ingene prediction over the last ten years [17, 18] The purpose of the overview given here is todescribe in more detail the specific computational formulation used in previous gene findingprograms, which provide a sequential inference framework for the algorithms described inlater chapters.

1.1.1 Problem Definition

A genomic sequence is a double stranded DNA molecule with genes potentially occurring

on both strands The single isoform computational protein-coding gene finding problem isdefined as follows: given an input genomic sequence, output a label for each nucleotide in thesequence describing its functional relationship to the protein coding gene structure (as shown

in Figure 1.2) The problem is to predict the location of each gene in the input sequence andfor each gene, identify the start codon, stop codon, each amino acid codon, the splice sites,the introns and the intergenic sequence This can be formulated as a structured classification

problem, in which class labels are assigned to each nucleotide in the sequence, subject to the

global constraints of protein coding gene structure Programs that address this problem are

referred to as computational ‘gene finders’: The term ab initio is used to refer to gene findersthat rely exclusively on information extracted from a single genome, rather than informationobtained from other sources

1.2 Computational Framework for Gene Prediction

The earliest successful gene finding methods used the syntax of gene structure to labelthe DNA sequence [36] This syntax is defined using a Finite State Automaton (FSA) to

Trang 22

recognize valid DNA sequences that contain protein coding genes The FSA in Figure 1.3recognizes genes on a single DNA strand, beginning with a start codon, ending with a stop

codon and possibly interrupted by one or more introns In keeping with the convention of the

gene finder programs covered in this review, only the exons, introns, intergenic sequence, and

the GT/AG consensus splice sites are shown States can be added to recognize other features

in the gene (e.g promoter and polyadenylation signal sequence) as well as states to recognize

genes on the opposite strand of a double stranded DNA molecule States in Figure 1.3 arelabeled with nucleotides given in lower case to maintain compactness in the figure, with theequivalent nucleotides referenced in the text as capital letters (for clarity)

The FSA is a 5-tuple Mrg4 = (Q, 40, ga, Ð, ổ):

e Q is a set of states

® g € @ is a unique initial state

® ga € Q is the final (accepting) state

Xu = {A,C,G,T} is the alphabet (DNA bases)

6:Qx {XU$} — Q is a transition from state q/ to g reading nucleotide n € Ð from aninput sequence

For notational convenience, all input sequences end with the end of string symbol $ and alltransitions leading to g4 are defined g4 = 6(q,$) for all ạ € Q Furthermore, assume there is

a transition from the start state go to each state shown in Figure 1.3 and a transition fromeach state in Figure 1.3 to the accepting state ga This definition allows for the recognition

of partial genes, which can occur when analyzing a short genomic contig that contains onlypart of a gene Given an input genomic sequence §, the goal is to find a series of transitions

Trang 24

from the initial state go to the accepting state g4, where each nucleotide in S is read by atransition If such a series of transitions exist, S is recognized by the FSA More formally,let S{i] be the nucleotide in S$ at position i, taking transition g = 6(q%-1,5 [¢ — 1]) from

state qj-1 to q and reading nucleotide S[i — 1] (i > 1) For a sequence of length L, (L-2

nucleotides plus the end of sequence symbol $), the sequence parse is the sequence of states

Ó = 40,91, 92) 5 4r-1,94 With transitions q: = ô(qo, S[0]), 92 = 6(41, S[1)), ga = ô(qr—t, 8)

reading input 6 = $/0], S/1], ,S[Z — 2],$ Different portions of the model in Figure 1.3represent different parts of gene structure Three codon types are labeled in Figure 1.3, codong,

codon, and codon2, which represent the relative position of the codon when interrupted by an

intron Since the codon is a nucleotide triplet, when the first base of a nucleotide following anintron represents the first position of the triplet it is said to be in phase 0 When the base is_the second nucleotide in the triplet it is in phase 1 and when the first base of the exon is thelast nucleotide of the codon triplet, it is in phase 2 Note that the start codon by definitionalways begins in phase 0 In general, the term phase is defined to be the index (0, 1 or 2)into a codon

Consider the example of deciding whether sequence

S = ATGGCCTGT ATAACTAGAACTCT AG$

is valid input for the single isoform gene structure prediction problem One valid parse for

S is to transition from the start state go to the start codon state labeled A followed by thestates labeled T and G in Figure 1.2 and take the transition into the states in codong (labeled

G, C, and C), to codon (labeled T’) into the intron, (labeled G, T, A, T, A, A, Œ T, A andG) and transition to the rest of the codon (labeled A and A) to codong (labeled C, T andC) to the stop codon (labeled T, A, G) It is also possible for the sequence to be accepted by

Trang 25

taking some other series of transitions in the FSA The sequence S could be an intron and avalid sequence of transitions can be taken using the intron associated states (labeled introno,intron, and intron) Therefore, the model is ambiguous since multiple parses can be used

to read the same sequence

To resolve the ambiguity in the model, the earliest gene finding approaches, which formedthe basis for more recent gene finding programs, searched for the most probable parse usingstatistics collected from training sets Examples, include the programs HMMgene (89] andVEIL [66] Given input sequence S, the goal is to find the optimal parse: ¢/ = arg max, P(¢, S$).The joint probability of the parse and the sequence is given by the equation P(¢,S) =P(S|¢) x P(@) The parse ¢ that maximizes the probability of generating S is

j= arg max P(65|ð) x P(9).

P(®) is a parameter for modeling a priori knowledge of the model structure P (S|) is thesequence modeling parameter, which defines the probability of the model emitting sequence

S In general, it is not practical to estimate parameters to the model P(S|¢) x P(@) since the

parameter space is exponential in size with respect to model structure, |Q|* (|Q| being the

number of states in the model) and sequence, 44 The joint probability of the parse and the

sequence is approximated using a hidden Markov model (HMM) to assume each nucleotide

is dependent only on the current state and the previous state Thus, we can re-formulate theprevious model Mrs shown in Figure 1.3 to predict sequence using statistics compiled from

a training set of previously known protein coding genes

The transition function 6 from Mega is replaced by parameters to predict the probability —

of generating nucleotide n taking transition from state q/ to g In the previously defined FSAeach transition reads exactly one type of nucleotide For example in Figure 1.3 the transition

Trang 26

within the start codon state transition from state “a” to state “g” reads nucleotide G In thenew model, a transition from state g’ to state g outputs all values n € XO with probabilityP(n|q) where © = {A, C, G,7, $}.

The model Mrga is redefined to be the 5-tuple My = (Q,q0,¢4,5,6), which is a

hid-den Markov model (HMM), similar to the FSA but replacing the transition function ổ with.

transition and output probability parameters 6 For a given state transition sequence theprobability of the parse is

g to maximize the following:

D(k,q) = max {D(k — 1,q/) x P(nelg) x P(a|g)}.

Each cell in the matrix D keeps a pointer to the previous state which lead to maximizing thescore in the current state The optimal parse is recovered by tracing back through the linksbeginning at state ga The runtime cost is O(L x |Q|*) where È is the sequence length and

|Q| is the number of states in Q Looking at each nucleotide independently of the originating

sequence, P(n,|g) acts as a classifier with g* = argmax, P(n¿|q) being the classification

assigned to the fixed nucleotide n, enumerating over all possible state labels g Classificationdecisions are only made once the dynamic programming matrix is filled and the highest scoringparse is recovered

Trang 27

Figure 1.4: Generalized hidden Markov model of protein coding genes using the componentsdescribed in Figure 1.3.

1.2.1 Generalized Hidden Markov Models

Improvements were made to statistical gene finders by observing that the distribution ofexon lengths is not geometrically distributed, a fact implicitly assumed in the hidden Markovmodel (22, 92] An explicit exon length probability distribution is modeled using a generalized

hidden Markov model (GHMM) (also referred to as a semi hidden Markov model [23] or an

explicit duration hidden Markov model [141]) Examples of this approach were first introduced

in the programs Genscan [22] and Genie [91] The GHMM model Ä⁄ w’ equivalent to the hiddenMarkov model My is shown in Figure 1.4 Figure 1.4 expands on the previous HMM, allowingstates to output more than one nucleotide rather than a single nucleotide Sequence “signals”are defined to be fixed length features in the sequence and refer to the start codon, stop codonand the consensus splice sites Each state is an interval between signals in the sequence Forexample, the “Initial” state is an exon, which begins with a start codon (ATG) and ends justupstream of the donor site (a consensus GT signal) The equivalent in Figure 1.2 is to begin

in the start codon (ATG) and transition into one of the “codon” states and end in one of thethree intron states The intron state depends on whether the exon ends precisely at a codonboundary or whether the codon continues into the next exon This is represented in Figure

Trang 28

1.4 by the three transitions into states: Intron 1, Intron 2 and Intron 3 The equivalent parsefor sequence S = ATGOCCTGT ATAACT AGAACTCT AGS using the previous example,but with the GHMM model is go,q = Initial, qv = Intron3,qg3 = Terminal,q, = $ The

alternative parse labeling the sequence S as an intron sequence is now represented with one

of the three “Intron” states.

State 4 previously output a single nucleotide, now let the state output a string of cleotides y = 1o, ,y)-1 where I(y) is the number of nucleotides More formally, let

nu-y =8 [j,k] denote the subsequence in S from position j to & inclusive and = ŠJa¿_ + 1, a,

be the subsequence in § output after taking transition g_¡ to q;, where a; = œ_¡ +1 (y;) and

ao = 0 (i > 1) The model is defined My, = (Q, 0, ga; 5, R, O, L) where

e Fis the set of transition probabilities with the probability of taking transition g' to ø

- input S = 1,9a, ,„ 1S

P(5 =i, , YolO = Gos G1-+) Go-1, 94) = T] Polysla:) x Pr(gila—1) x Pr„00/)) (1.1)

¿=1

Trang 29

where N = Xf_ol(y;) Each state outputs a sequence of explicit length /(¿) with each y;defined to be a non-overlapping subsequence in S If the output lengths are fixed, the model

can be converted into the previously defined HMM Notice that there is still an implied

geometric distribution in the model, however, it now applies to modeling the number of exons

per gene in a multi-exon gene ©

An analog to the Viterbi algorithm, which will be described in more detail in Chapter

2 and Chapter ở is used to find the most probable sequence parse The drawback to this

approach is that the inference algorithm is slower than the Viterbi algorithm When naively

implemented, the runtime cost is O(N? -|Q|?) where N is the length of the sequence and |Q| isthe number of states in the model In the earliest days of genomic sequencing projects it wasnot uncommon to handle thousands of short sequences, which may contain at most one geneper sequence or even part of a gene With improvements to the sequencing technology and

the development of assembly algorithms [133], it is now common to have access to sequences,

which range in length from 10,000 bases to sequence with millions of bases in length To

improve the efficiency of the inference algorithm, the intergenic states and the intron states

are assumed to have lengths, which are geometrically distributed [23, 92]

More recent work has shown that in fact, modeling intron length can improve

predic-tion performance This approach was implemented in the program Augustus [164j and also

incorporated in the gene prediction program Twinscan [166] The efficiency trade off is dled by defining length cutoffs, in which a geometric length model is used for very longintron sequences The GHMM formalism has been widely used among many gene findersincluding SNAP [86], GlimmerHMM [103], GeneZilla [103], Phat [26], Fgenesh [146] andGeneMark.hmm [102], with each program implementing different statistical sequence model-

Trang 30

han-ing methods In Equation 1.1 the details of sequence modelhan-ing in the GHMM are abstracted

in the term P(y;|q;) Useful sequence modeling techniques currently used are briefly reviewed.Not all gene finders use hidden Markov models, notable exceptions are GeneID [59] and

GlimmerM [130], which use a dynamic programming based inference algorithm, which serves

as the basis for the algorithm described in more detail in Chapter 2 Both programs still

use the statistical sequence modeling methods described in the next section The principal

difference is the lack of a formal graphical model for prediction and the lack of state transition

probabilities

1.2.2 Statistical Sequence Modeling

Sequence emission probabilities are divided into two categories, the “signals”, which refer

to the start codon, stop codon and consensus splice sites, and “content sensors”, which modelthe variable length portions of the sequence and include the exon, intron and intergenicintervals [106] In the simplest form, each nucleotide is emitted from the model with thefirst order dependencies encoded in the hidden Markov model shown in Figure 1.3 Usingthe generalized hidden Markov model, the statistical sequence can model higher orders of theform:

k

P(yla:) = [[ P(S[ml|S[7,m — 1)

m=j

where y; is a subsequence in S from j to k and the probability of emitting nucleotide S[m]

is dependent on the previous nucleotide subsequence S[j,m— 1] Since the the parameterspace is exponential with respect to sequence length, gene finding software use fixed orderMarkov models In principle, higher order dependencies can be incorporated into the hiddenMarkov model framework shown in Figure 1.3 by adding states, however, with the increased

Trang 31

number of states comes an increase in running time and memory requirements In practice,gene finding programs such as Genscan and Genie use 5th order 3-periodic inhomogeneousMarkov models for the protein coding intervals If state g; represents a coding exon, which

in Figure 1.4 is one of the six states: Internal 1, Internal 2, Internal 3, Initial, Terminal orSingle, the probability the of the exon interval in $ from index 3 to & is:

k

II ? (SImil|S[m — 1], S[m — 2], , S[m — o])

ma=j,r=m rmoởd 3

where r refers to the codon phase, which is the relative position (0, 1 or 2) in the codon and

o is order of the Markov model Higher order models are at least as accurate as a lower ordermodel when there is sufficient data to accurately estimate the parameters of the model (147)

With newly sequenced genomes and limited training sets the availability of large training setscan not be assumed |

One approach to get around this problem is interpolated Markov models (IMMs), whichhave been used in the gene finders GlimmerM [147], GlimmerHMM [103], GeneZilla [103] andEuGene [152] GlimmerM for example, uses 8 different 3-periodic Markov models labeled |

0 through 8, where the number denotes the order of the model (e.g 1st, 2nd order Markovchains) The goal is to use the highest order model whenever possible Since in many cases

_ the amount of training data is limited, the probability estimates for the higher order models

are less reliable In these cases the IMM looks at the lower order models to compensate for

the lack of training data A distinct IMM is defined for each coding phase r The choice

of models for a given phase is determined by an interpolation of the probabilities of Markovmodels from order 0 to 8 defined recursively as:

IMM"(S,m,k) = w(S,m — 1, k) x P,(S[m]|Sïm — 1], ,$[m — k])

+ —(S,m — 1,k)) x IM My_1(S,m,k — 1)

Trang 32

with mono-nucleotide frequencies used when k=0 The weighting function: w(S,m, k) pares the nucleic acid distribution of the kth order Markov model and the kth-1 order Markovmodel using a x2 test If the function c(Sjz, 0|) returns the number of times the substringS[z, 1] is observed in a training set and - is the string concatenation operator, then

com-v= ¥ (c@ - S[m — 1 — k,m]) — c(b - Sim — 2 — k,m)))?

be{A,C,G,7} cíb : Sim — 2 — k,m])When the higher order model does not significantly differ from the lower order model by afixed threshold confidence value, the function w(5,m,k) returns 1, otherwise

1—(c/400) 3` c(S[fm—1—k,m—1}-b)

be(A,C,G,7}

is returned where c is the x2 confidence value

Intergenic and intron sequences do not display the same coding phase pattern found inexons, and therefore it is more appropriate to use fixed-order homogeneous Markov models.The sequence signals, “start codon” “stop codon”, and splice sites are typically modeled

as first order inhomogeneous models, using a fixed sequence window These are also calledconditional probability matrices [148] or weight array models [189], which take the form of

k+u

P*(Sine)) TI P~*(S[ml|S[m- +)

i=k+1

where w is the length of the sequence window and the term i-k refers to the relative position in

the window Note that Oth order models are referred to as position weight matrices (PWMs).

It has been shown that local dependencies can not be assumed to accurately model certainfeatures in the sequence [24] For example, the base pairing of small nuclear RNA to thedonor splice site (the splicing process is reviewed in Chapter 4), can have compensatorybase pairing, so that if mutations occur in one region limiting the strength of the chemicalbonds, compensatory base pairing is observed in another location [22] Other “semi-local”

Trang 33

affects are observed, for example modeling codons but ignoring the third position of the codonwhen it does not impact the encoded amino acid These ideas are implemented in the genefinder GLIMMER [83] as interpolated context models (ICMs) and as maximal dependencedecomposition (MDD) trees in Genscan The basic idea is to define a model for each relativeposition of interest For example in the 3-periodic Markov models, three models are definedfor each position in the codon In the conditional probability matrices or (WAMs), eachcolumn in the matrix constitutes a separate model The version implemented in Genscan [23]and used as part of the splice site prediction program GeneSplicer [129] program is a binarydecision tree A path from the root to a leaf defines a context, where each node is a nucleotideoccurring at a position in the sequence A leaf in the tree contains all training examples thathave nucleotides at the positions specified at the nodes in the path from the root to theleaf Rather than the traditional kth order Markov model P"(S{i]|S[i — 1), ,S[i — 1 — kì)the result is P"(S[i]|T(S,i—1,k)), where the function T is defined by the decision tree and

the probability of the base at position i is dependent potentially on the identity of bases at

non-adjacent locations within the fixed window |

Let X; be the discrete random variable taking on the value of one of the four nucleotides

(A,C,G,T) with some probability at position 7 in the window Intuitively, the goal is to

measure the degree of statistical independence between each position ¡ and position j (for

i # 7) in the sequence, which can be done using the x? test comparing the distributions X;

and Xj The interpolated context Models (ICMs) used in GLIMMER [33] use a 4-ary tree

rather than a binary tree The ICMs model codons rather than splice signals, meaning there

is greater variation in nucleotide type at each position In the case of a donor site modelfor example where the subsequence forms a base-pairing with the snRNA U1 (see Chapter 4

Trang 34

for further description of splicing process) there is frequently one base at each position that

occurs far more frequently In this case it makes sense to divide the data according to thoseexamples that match the consensus base and those that do not, rather than storing a separateentry for all four nucleotide types A second difference between the MDD and ICM approach

is in the selection criteria for partitioning the data The ICM model uses mutual information,rather than the x? statistic There are three stopping criteria for the tree building procedure:1) the distribution of nucleotide values at position j appears to be independent of i then treebuilding terminates 2) the depth of the tree is j and 3) the size of the remaining training setfalls below a fixed threshold (prevent unreliable probability estimates) Pseudo code for theAlgorithm 1 Tree building procedure for maximal decomposition dependency The “par-tition” function divides the training data into two sets Dị and Dạ with D, containing theexamples with the consensus nucleotide Cy at position i’, and set D> containing the remaining

examples

build-tree(D)

U = MaryD jes XD (Ci, Xj)

(Dị, Da) = partition(D, Ci)

if not valid stopping criteria on D, then

build-tree(D,)

if not valid stopping criteria on D2 then

build-tree( D2)

return (D,, Da)

tree building procedure is shown in Algorithm 1 Each leaf in the tree includes a probability

in the form of a position weight matrix, which estimates the probability distribution of eachnucleotide at each position in the window using a Oth order Markov model Alternate tree

Trang 35

construction criteria have been tested and used in the gene finding program SLAM, for a

review see [193]

1.2.3 Parameter Estimation

There have been some attempts to use unsupervised learning methods to estimate

pa-rameters of hidden Markov models using the Baum-Welch algorithm [66] However, given

the availability of labeled examples of gene structure, essentially all of the published genefinding methods make use of supervised learning A maximum likelihood approach is used toiterate over a labeled training set Example genes are taken from initial expression studies

or a homologue to a sequenced gene of a closely related species Problems, however, can

still arise when analyzing novel genomes with limited amounts of labeled training data Two

studies show how bootstrapping methods can provide a useful complementary approach to

supervised learning The first bootstrapping method was introduced in the gene finder SNAP

[86], which begins with model parameters estimated from labeled training data from anotherorganism A second bootstrapping method was introduced in the latest version of Gene-

Mark.hmm [101], which initializes model parameters to default initial values Once initial

parameters are set, both programs predict genes in the genome using an inference algorithm

At each iteration, the newly predicted gene set is taken to be the labeled training set and used

to estimate a new set of parameters Once the newly predicted gene set matches the previous

training set within some fixed threshold, the training procedure terminates and outputs the

gene prediction model with the current parameter estimates

Trang 36

1.2.4 Integration of Extrinsic Evidence

When trying to come up with an exhaustive list of genes for an organism it is important

to make use of all the available evidence since the presence of a gene may be detected using

one method and missed by another An extrinsic evidence source refers to gene structure

ev-idence originating from a source other than the information contained within the organism’s

genomic DNA The most reliable evidence still comes from isolating RNA in the cell and

es-tablishing its identity through various sequencing methods This is the ideal form of evidencebecause the gene is actually being observed “in action” This experimental evidence used forgene structure prediction comes in the form of full length cDNAs and ESTs Even after a

mRNA is isolated and sequenced it can still be a non-trivial task to identify the originatinglocation in the organism’s genome Problems can occur for several reason: poly-ploidy or

population mutation occur so that differences are observed within the sequenced population

leading to differences between the reference genome and the isolated RNA A second

prob-lem is genome duplication Many genomes have had major genome duplication events (seeArabidopsis thaliana for example [74]) leading to duplicate regions in the genome, which givethe appearance of a single RNA sequence having originated from multiple locations Finally,although DNA sequencing is fairly accurate, sequencing errors can occur when using the high

throughput methods such as EST sequencing

In addition to using evidence of expressed genes from the same organism, it is also useful

to consider evidence of gene expression from other organisms The term ortholog describes

two genes occurring in two separate species, which have a common ancestor [47] Mapping

the evidence from a gene expressed in the related organism to the organism under study,requires searching for the matching orthologous gene The problem becomes more difficult as

Trang 37

the possibility for greater differences between the two related genes increase.

Another source of evidence is gene expression from paralogs Paralogs are genes which

occur within the same species and share a common gene ancestor [47] A fourth approach is touse sequence similarity as evidence for the existence of a gene Use of an arbitrary similarity

measure is justified by the hypothesis that regardless of the evolutionary history, when asequence shows substantial similarity to a known gene, the sequence is more likely to share a

similar function One example of this approach is the searching of protein domain databases

for similarities with genomic sequence used in Genie [93] The evolutionary argument for theexistence of a gene can be confounded by the occurrence of pseudo genes Pseudo genes are

genes that are no longer functional having arisen from a duplication event in which one gene

retains function and the pseudo gene accumulates nucleotide mutations rendering the gene

non-functional [192] Despite the accumulation of mutations much of the pseudo gene retainssequence common to the functioning paralog, making it difficult to distinguish the pseudo

gene and the functional gene Another important extrinsic source of evidence is the use of

protein sequences which are usually derived from translating sequenced mRNA Comparing

amino acid sequences can potentially reveal more distant evolutionary relationships since

an accumulation of mutations at the DNA level may not affect the encoded protein The

algorithms reviewed here incorporate evidence of cDNAs, proteins and genomic DNA from

other organisms, into gene finding programs used to predict gene structure.

Many of the early programs used sequence conservation to identify candidate orthologous

exons, which were assembled into gene structures using dynamic programming algorithms

Examples are found in CEM [9], ROSETTA [12] and SGP-1 [179] Much of the recent genefinding work has focused on incorporating extrinsic sources of evidence into the HMM and

Trang 38

GHMM framework.

One probabilistic approach to incorporating extrinsic evidence is to use a pair hidden

Markov Model (pair-HMM), an approach taken in the program Doublescan [109] Recall the

single species hidden Markov model formulation P(SiFlla)x< P(qiq’), which is used to model the

probability of emitting nucleotide S|k] at position k in sequence S The alphabet © previouslydefined to be the set of all possible nucleic acids is now expanded to include an “insertion” symbol and the probabilistic formulation is changed to P(S[k1], Sa|ka||g) x P(qig’) whereS1[kij, Seiko] are a pair of nucleotides, one occurring in sequence S; and the other occurring

in sequence 52 or a nucleotide paired with an insertion symbol The indices k, and ke are thepositions in the two respective sequences S$, and Sp The index, k, and ke is only incremented

when a state emits the nucleotide in the sequence and not when an insertion symbol is

emitted, the insertion symbol can be thought of as emitting a “null” symbol [77], which is

not observed The two sequences are simultaneously aligned, scoring the degree of sequence

conservation between the two sequences, while scoring the likelihood of the gene structures

predicted to occur in the two sequences The program Projector [110] is an extension of thepair-HMM implemented in Doublescan In Projector, rather than assume two anonymousDNA sequences as input, an annotated gene structure for one sequence is included as input.The idea is that if a gene is known to exist in one species and there is evidence for the precisegene structure boundaries in the sequence, a rigorous probabilistic approach can be used to

accurately identify the precise gene structure in the related species

Generalized pair hidden Markov models have also been used to take advantage of eling explicit sequence interval lengths, just as the GHMMs were used in the ab initio case

mod-This is the approach taken by SLAM [3] and more recently TWAIN [104] In this approach

Trang 39

the generalized hidden Markov model is transformed to output pairs of sequence (including

insertion symbols) of explicit variable length of the form

x

TL Pslas) x P(l~) x P(0))

where x is the number of hidden states taken to output both pairs of sequence and each state

& emits a pair of sequences y/

In addition to aligning two genomic sequences, predictions are sometimes based on aligning

cDNA and protein sequences to genomic DNA Protein to genomic sequence alignments are

specifically addressed in SLAM and GeneWise 114) Both methods incorporate concepts

in-troduced in profile hidden Markov models [40], which use “insertion”, “deletion” and “match”states to model the probability of conservation patterns occurring independent of the sequence

itself A sub-model P(s|s/), of the sequence pair model P(1;|g¡) is introduced to estimate theprobability of arriving in state s given the previous state s/ where s and s/ are one of the three

states defined by the alignment sequence conservation information, “insertion”, “deletion” or

“match/mismatch”

The computational cost to output two sequences in parallel rises considerably from the

single sequence case The dominating cost becomes the product of the lengths of the two input

sequences, which in the context of predicting the locations of genes in long contiguous stretches

of genomic sequence is prohibitive Therefore, in both the pair-HMM and the pair-GHMMmethods it is necessary to first generate approximate alignments to limit the search space of

possible sequence alignments Overall, runtime in the pair-GHMM is O(D*|Q|? N,N) where

D is a maximum length on the emitted sequences, |Q| is the number of states and N, and

Nz are the lengths of the two respective evaluated sequences S; and S» [123] Limiting the

length of N; and N2 and imposing a limit on the number of consecutive insertion/deletions

Trang 40

permitted between two sequences is employed in each of the pair-HMM and pair-GHMMprograms to allow for annotating large genomes Despite improved prediction accuracy of

the generalized pair-HMM over the pair-HMM in simultaneously predicting gene structure

in two genomes, an advantage of the pair-HMM approach is its computationally feasible to

predict intron loss/gain [109] Intron loss/gain is a prevalent feature of evolution [144] andoccurs when an intron is either inserted into an ancestral exon or removed from an ancestral

sequence (with the resulting gene structure retained in subsequent generations) The result

is, when predicting the orthologous genes in the two related species, it is possible for one

sequence to contain one long contiguous exon with the matching exon in the other sequence

being interrupted by an intron Similarly exon loss/gain is known to occur and in these

cases an intron occurs in one sequence, where the matching intron from the other sequence

is interrupted by an exon

Another computational approach to using extrinsic evidence is to assume that a sequencealignment is given as input Genie [93] is the earliest published report to use extrinsic evidencewithin the GHMM framework, but was followed by many other programs including Fgenesh+

[146], Genomescan [184] and Twinscan [87], each of which use the GHMM framework and

the closely related approach of SGP-2 [125] In Genie, an exon coding interval is scoredbased on a statistical codon model plus the scores computed from matched protein domain

profiles [159] and blastx [56] alignments Protein information is used when a protein match

aligns to two adjacent regions implying an intron When such a case arises, Genie requiresthat the upstream exon be paired with a downstream exon overlapping the matching protein_alignment

Genomescan uses blastx to find matches to sequences in a protein database and the

Tiêu đề	Predicting Gene Structure in Eukaryotic Genomes
Tác giả	Jonathan Edward Allen
Người hướng dẫn	Steven L. Salzberg, Jason M. Eisner
Trường học	The Johns Hopkins University
Chuyên ngành	Doctor of Philosophy
Thể loại	dissertation
Năm xuất bản	2006
Thành phố	Baltimore

Định dạng
Số trang	215
Dung lượng	22,5 MB