Computa-This dissertation addresses the problem of computational gene prediction in eukaryoticgenomes, presenting a framework for predicting precise single isoform protein coding genes i
Trang 1PREDICTING GENE STRUCTURE IN EUKARYOTIC
GENOMES
by Jonathan Edward Allen
A dissertation submitted to The Johns Hopkins University in conformity with the
requirements for the degree of Doctor of Philosophy
Baltimore, Maryland
September, 2006
© Jonathan Edward Allen 2006All rights reserved
Trang 2UMI Number: 3240661
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copysubmitted Broken or indistinct print, colored or poor quality illustrations andphotographs, print bleed-through, substandard margins, and improperalignment can adversely affect reproduction
In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted Also, if unauthorizedcopyright material had to be removed, a note will indicate the deletion
®UMI
UMI Microform 3240661Copyright 2007 by ProQuest Information and Learning Company.All rights reserved This microform edition is protected againstunauthorized copying under Title 17, United States Code
ProQuest Information and Learning Company
300 North Zeeb RoadP.O Box 1346
Trang 3Obtaining the complete set of proteins for each eukaryotic organism is an important step
in the quest to understand how life evolves and functions The complex physiology of karyotic cells, however, makes direct observation of proteins and their parent genes difficult
eu-to achieve An organism’s genome provides the raw data that contains the set of instructionsfor generating the complete set of proteins, providing the potential to obtain a complete list
of proteins without having to rely exclusively on direct observations in the cell tional gene prediction systems, therefore, play an important role in compiling sets of putativeproteins for each sequenced genome
Computa-This dissertation addresses the problem of computational gene prediction in eukaryoticgenomes, presenting a framework for predicting precise single isoform protein coding genes inlong contiguous stretches of DNA The framework is extended to predict overlapping alterna-tively spliced exons in known protein coding regions A main contribution of this work is toapply classifier stacking with sequential inference, for the first time, to the gene finding prob-lem and to develop a phylogenetic generalized hidden Markov model for the alternative splicesite prediction problem First a linear weighting scheme is developed, which is extended to_ a statistical prediction model The statistical model is then transformed to a new sequentialinference model to predict alternatively spliced exons
Trang 4Prediction accuracy of the single isoform gene prediction methods are tested on threeeukaryotic genomes: Arabidopsis thaliana, Oryza sativa and human Application of the geneprediction methods are examined in other eukaryotic genomes The alternatively spliced exonprediction model is tested in four Drosophila species under a variety of input conditions.Incorporating multiple sources of gene structure evidence is shown to substantially im-
proveme single isoform gene prediction accuracy with performance beginning to rival the
accuracy of expert human annotators Results from the alternative exon prediction ments demonstrate the potential to reliably predict new alternatively spliced forms of knowngenes The use of cross-species sequence conservation information is shown to enhance the
experi-precision of alternatively spliced exon prediction.
Adviser: Steven L Salzberg
Readers: Steven L Salzberg and Jason M Eisner
Trang 5I would like to thank my adviser Steven L Salzberg, for his guidance, patience and support
and for giving me the freedom to pursue challenging research problems Dr Salzberg has
made many helpful suggestions, which improved the quality of my work over the last several
years Thank you to J ason Eisner for informing me of important related work in machine
learning and natural language processing I would also like to thank other members of Dr.Salzberg’s group including Mihaela Pertea and William H Majoros with whom I had manyenlightening discussions on gene finding work I also benefited from the many useful discus-sions on bioinformatics topics with other members of the group including Pawel Gajer, Maria
D Ermolaeva, Arthur L Delcher and Mihai Pop Thanks to many people at The Institutefor Genomic Research which were very helpful in providing useful data to work on including
Brian Haas, Bernard Suh, Chunhui Yu, Sam Angioli, Ahwui Wang, Robin Buell, Malcolm
Gardner, Jane Carlton, Elodie Ghedon and Brendon Loftus.
| I would also like to thank Harold Gainer for advice and providing me with an interestingbiological problem to work on and I thank S Rao Kosaraju for his positive supervision onthis project Thank you to Marvin Cook for many productive study sessions, which helped
me get more out of many of the courses we took together
Thank you to my wife Safia Ahmed Omar for her love and support and helping me to keep
Trang 6my life in proper perspective Thanks to my family Leah Lewis, Karen Kramer, Wise D Allen
and especially my parents Wise and Joan Allen Without their support and encouragement
my educational pursuits would not have been possible
Trang 74 Prediction of Alternatively Spliced Exons
ii
iv
viii
xi
Trang 84.4.1 Graphical Model 2 Q Q Q Q Q Q Q n vn g g v.v va4.4.2 A Generalized Hidden Markov Model 4.4.3 A Phylogenetic Generalized Hidden Markov Model
5 Automated Gene Structure Annotation
5.2 Gene Structure Annotation Applications 0.000.004}5.2.1 Arabidopsis thaliana 2
5.2.3 Gene Structure Comparison 0.00000 0006.3 Human eee5.3.1 Testing on the ENCODE Regions 0.0000.5.3.2 Evaluation of Evidence Tracks 0 0000000 45.3.3 Performance Comparison 0 00000 cece eee
6 Alternative Exon Prediction Performance
Trang 9The set of class labels that describe a local sequence interval used to constructgene models on the positive strand, denoted by the “+” symbol The non-coding label applies to both strands Labels reflect partial and complete exons.Each entry asserts whether the condition in that column must be true (1) orfalse (0) 10 additional class labels are used to represent strand specific labels
on the negative strand 1 Q Q Q Q ng v g va va
Performance of the gene predictors on 1783 genes SC = Statistical Combiner;SC-g = SC combining gene prediction programs only; LC2 = Linear Combinerusing sequence alignments; LC1 = Linear Combiner using gene prediction pro-grams only; GA = GlimmerM; GM = GeneMark.hmm; GS = Genscan+ Thecolumns are: number of whole genes correctly predicted (Correct Gene); num-ber of genes completely missed (Missed Gene); correctly predicted exons out ofthe 7510 total (Correct Exons); number of exons completely missed (ME); Pre-dicted exons overlapping a gene region but do not overlap a true exon (InsertedExons); percentage of protein coding nucleotides correctly detected (Nucl Sn).Breakdown of combiner predictions when matching exactly 3, 2, 1 or 0 geneprediction programs The first column (Combiner) refers to the four combiners.The second column (# of GP) refers to the number matching gene predictionprograms The third column and fourth column count the number of times thecombiner prediction is correct (CG) and not entirely correct (WG) The fifthcolumn is the percentage of correct predictions co.The number of gene models each gene finder exclusively predicts correctly intest set 2 - dd ẼäăẶẼ
125
Trang 10Glim-= LC2 using three gene prediction programs; LC1-3 Glim-= LC1 using three geneprediction programs; TS = Twinscan; GM2 = newer GlimmerM output Thethree prediction programs used by SC-3, LC2-3 and LC1-3 are Twinscan, Gen-eMark.hmm and newer GlimmerM (GM2) 00.Performance comparison of JIGSAW and SC-5 (from Table ð.4) JIGSAW performance in Oryza sativa Sn (sensitivity) = percentage of test setcorrectly predicted Sp (specificity) = percentage of predictions that are cor-rect Performance measured on three criteria: Genes, Exons and Nucleotides(Nucl) All results shown as percentages 2 0 ee eeGene structure comparison Each entry contains two values “A/M” with Abeing the average and M being the mean value The Exon / Intron column ismedian exon length divided by median intron length, JIGSAW using gene finders and non-Human EST data Results show sensi-tivity (Sn) and specificity (Sp) measured on Genes, Exons and Nucleotides(Nucl) All results shown as percentages 2 eeResults of applying JIGSAW with all available evidence *KnownGene predictsmultiple transcripts per gene locus with a transcript specificity of 47%.
Comparison of EGASP prediction performance for exons and protein codingnucleotides among the different prediction methods Sensitivity (Sn) and Speci-ficity (Sp) is given The F-score is shown for the nucleotide predictions
EGASP prediction performance for Genes and gene transcripts (Gene Trans)measuring sensitivity (Sn) and specificity (Sp) The F-score is given for theGene predictions Transcript to Gene Ratio shows the number of transcripts
Percentage of D melanogaster annotated di-nucleotides conserved in D ulans (Dsim), D yakuba (Dyak) and D erecta (Dere) Di-nucleotides areseparated according to splicing type: Acceptor (Acc) and Donor (donor) andsplicing event type: alternative (Alt) or constitutive (Con) Pseudo splice sitesare included for reference 2 eePercentage of D melanogaster annotated exons missing at least one splice site
sim-in D simulans (Dsim), D yakuba (Dyak) and D erecta (Dere) Percentagesare organized by exon type: constitutive exons (CS) cassette exons (CE), exonswith multiple splice sites (MS) and exons with intron retention (IR) The secondnumber associated with the MS and IR rows is the percentage of exons wherethe non-conserved splice site is constitutive (used in all isoforms) .Results are shown for 8 versions of ExAlt using different combinations of infor-mant species plus Genscan The informant species are D simulans (sim), D.yakuba (yak) and D erecta (ere) ExAlt-ab initio uses no informant species.Sensitivity (Sn) and Specificity (Sp) is shown for Exons (both acceptor anddonor is correct), Splice site (an acceptor or donor site is correct) and protein-_coding nucleotide predictions (Nucl) 1 Q Q
146
Trang 11at most 1 exon predicted per test sequence (ExAlt-Frame-Single) Rows
6-8 show ExAlt performance using no gene structure information with defaultparameters (ExAlt-Default), no informant species (ExAlt-Default-ab initio)and at most 1 exon prediction per test sequence (ExAlt-Default-Single)
Exon prediction accuracy from Table 6.4 separated by exon splicing event
ExAlt results on the initial training and testing set in percentages Includednext to each measurement is the difference in percentage points compared toperformance in the held out set in Table6.4
158 160
161
Trang 12List of Figures
1.1 Aschematic of double stranded DNA Each nucleotide is represented by an “L”shaped box and includes a 5 carbon sugar molecule (S), a phosphate residue(P) and one of four nitrogenous bases (a, c, t and g) Dashed lines indicatehydrogen bonds between two bases, c-g pairs form three hydrogen bonds anda-t pairs form two hydrogen bonds The 3’ and 3’ labels denote the orientation
1.2 Example of protein coding gene structure Gene contains two exons and oneintron The initial exon includes an untranslated region (5’ UTR) and thetranslated region The terminal-exon consists of a translated region endingwith a stop codon followed by an untranslated region (3’ UTR) Transcriptionbegins at the transcription start site, which is preceded by the promoter region 41.3 Finite State Automaton for recognizing protein coding genes For clarity, boxesare used to group a collection of nodes in the model Transitions leaving abox indicate a transition leaving each respective node in the box Transitionsarriving at a box indicate transitions arriving at each respective node in thebox The major components of a protein coding gene are included in the FSA:start codon, stop codon, amino acid codons (codong, codon, and codong) introns(introno, intron, and introng) and intergenic sequence Each state in the FSA
is unique, and is labeled with the nucleotide (given in lower case) read whentaking a transition into that state Additional states can be added to recognizeadditional gene structure features, such as promoters and untranslated regions =91.4 Generalized hidden Markov model of protein coding genes using the compo-nents described in Figure 1.3 .0 00000000 eee va 18
2.1 - Gene structure evidence from the UCSC human genome annotation database
(chromosome 20) Each row shows evidence generated from a distinct source 31
2.2 The number of correct and incorrect (number in parentheses) whole gene model predictions shared among the three prediction programs: GlimmerM (GA),Genscan+ (GS) and GeneMark.hmm (GM) from a test set of 1783 genes “In-correct gene” refers to cases where all coding exons in the gene are in perfectagreement among the gene finders but not with the true gene 32
Trang 132.3 Partitioned output from three evidence types: splice site predictions, genepredictions and sequence alignments The five sources of evidence listed inorder from top to bottom are: output from a splice site prediction program(SP); a gene prediction program (GP1) with exon confidence scores 0.9 and0.89; a gene prediction program (GP2) with no confidence scores; 89% and 45%identity alignments from a protein database, which make up a single evidencesource and a 32% and 20% identity alignments from an EST database Thegenome sequence is divided into intervals 7ị, ,/; defined by each potentialboundary 21, 2%2, ,%3 The predicted splice site at x5 is associated with 7s 2.4 An example of four overlapping candidate gene models G1-4 The exons areassumed to be part of the same reading frame In this example, if the evidenceonly predicts G1 and G2, the Linear Combiner scores G3 or G4 if either model
2.5 Gene prediction in genomic sequence S of length N on overlapping subsequences
So, S1, Sz and S3 Each subsequence 5S; is of length K and overlaps the adjacentsubsequences S;_; and $;., with length V The last subsequence $3 may be of
2.6 Merging overlapping gene predictions The first two examples (Example 1 andExample 2) show cases where two overlapping exons are merged into a singlegene Example 3 shows an example where the two exons can not be mergedinto a single gene list (Exon labels are described in the text.)
3.1 Gene prediction model for predicting genes on both strands simultaneously .3.2 Representation of four sources of gene structure evidence mapping to genomesequence S Two gene prediction programs (GP1 and GP2), a cDNA alignment
with 86% identity to S and an EST alignment with 95% identity to S Examples
of the six features, start (sta), stop (stp), coding (cod), intron (inr), donor (don)and acceptor(acc) encoded in feature vectors are shown The predicted exonboundaries are ko, ,kg ee3.3 Example sequence parse with evidence Sequence § is partitioned into seg-ments, to, t, and t2 with state assignments, go, gi and ga respectively k marks_a position in 5 and the dashed box highlights the evidence overlapping the firstinterval from position bp) to€p 2 2 ee3.4 Schematic of the JIGSAW training procedure Feature vectors are collectedfrom m examples and separated according to each of the six gene feature types.Decision trees are induced for each of the separated training sets .3.5 The plot at the top of the figure shows the accuracy of predictions based onalignments to non-human sequences that overlap a gene finder prediction Eachpoint is a pair of alignments observed in training and their percent identity tothe genomic sequence ’+’ points are classified as “accurate” and ’x’ points areclassified “inaccurate.” The two lines correspond to the internal nodes in thedecision tree shown at the bottom of the ñgur
4.1 Three forms of alternative splicing predicted by ExAlt: Intron Retention (IR),Cassette Exon (CE) and Multiple Splice sites (MS) 0 4.2 Exon splicing stages (Image based on [98].) See text for a description of the
43
Trang 14Model from Figure 4.6 expanded to include alternative splicing in terminaland initial exons and candidate single exon genes Blue and beige states reflectpossible protein coding exons (beige) or partial protein coding exons (blue) andrepresent three states for each state shown, one for each of the three codingphases Special states “Beg” and “End” show the respective begin and end
Classification accuracy of the four sequence models: constitutive and tive intron intervals and constitutive and alternative exon intervals are shown
alterna-in respective order P-values are respectively 0.05, 0.1, 0.1 and 0.05 Histogram of donor (first row) and acceptor (second row) splice site scoresusing the WAM trained from examples of exons with constitutive splice sites(first column) and exons with alternative splice sites (second column) Y axisshows relative frequency, x-axis shows log-odds score False Donors (Acceptors)colored red is the distribution of scores for GT (AG) di-nucleotides presumed to
be non donor (acceptor) sites based on the published annotation ConstitutiveDonors (Acceptors) colored green is the distribution of scores for donor siteswith no alternative splice site Alternative Donors (Acceptors) colored blue isthe distribution of scores for donor sites with alternative splice sites Phylogenetic tree used by ExAlt Each branch ¿ has a branch length of };.Overlapping donor signal windows .0000 0000.%
Distribution of D melanogaster / D erecta pairwise percent identity values forprotein coding exons categorized by constitutive exons (red), and alternativelyspliced exons (green) Each point in the graph represents the percentage ofcases (y-axis) with at least the percent identity specified by the value on the
Distribution of D melanogaster / D erecta pairwise percent identity valuesfor introns divided into constitutive (red), and alternatively spliced (green) Distribution of percent identity ratio scores for protein coding-exons versus theadjacent intron in pairwise comparisons of D simulans (red) and D erecta(green), ee,
Trang 15Chapter 1
Introduction
A genome is the entire collection of genetic material found in each cell of an organism.
Genetic material consists of the genes and related DNA elements, which store informationused to synthesize the macro molecules critical to cell life Since the advent of high throughputgenome sequencing methods, vast quantities of genomic data are now publicly available foranalysis Two types of large scale sequencing projects are contributing to the availability ofbiological sequence data One effort lies in the work to sequence whole genomes Examplesinclude the genomes of mammals such as Homo sapien (human) [7ð, 173] and Mus musculus(mouse) [28], as well as many other species across the animal kingdom, including Drosophilamelanogaster (fruit fly) [2], Plasmodium falciparum (the parasite causing Malaria) [113] andhundreds of bacteria The quality of genome sequence data ranges from nearly completely
sequenced genomes, where every nucleotide in each chromosome is accounted for with very
low error rates, to low coverage random shotgun sequencing, which yields short genomicfragments and retains a larger number of sequencing errors
The second major effort contributing to new biological sequence data is the high
Trang 16through-put sequencing of functional elements in the cell and in particular molecules transcribedfrom the genome These gene “expression” studies provide important physical evidence ofphysiological function associated with specific regions in the genome As with the genome se-quence data, quality and coverage varies depending on the techniques used Full length cDNAprojects capture complete transcripts with minimal error rates, while “tagging” approachescapture parts of transcripts (potentially with more errors) Examples include Expressed Se-quence Tags (ESTs) [1], Serial Analysis of Gene Expression (SAGE) [172] and Cap Analysis
Gene Expression (CAGE) [154]
This thesis addresses the challenge of developing computational methods to predict
pre-cise protein coding gene structure using these two key data sources: genome sequence andgene expression sequence Three new algorithms are introduced to address the gene structureprediction problem and are described in Chapters 2, 3 and 4 respectively Chapter 5 presents
the results and procedures for automated genome annotation using the gene prediction
algo-rithms described in Chapter 2 and Chapter 3 Chapter 6 gives results for alternative splice
site prediction using the algorithms described in Chapter 5 Chapter 7 contains concludingremarks and directions for future research The remainder of this chapter provides an in-troduction to the problem of predicting gene structure and reports on previous work in the >field and introduces the computational framework, which serves as a basis for the algorithmsdeveloped in the later chapters
1.1 Background
A gene is the entire deoxyribonucleic acid (DNA) sequence required for synthesis of afunctional protein or functional ribonucleic acid (RNA) molecule [98] The term “functional”
Trang 17is defined to be a naturally occurring molecule, which affects cell physiology [98] The genes
are contained within long double helical strands of DNA called chromosomes A chromosome
is a double stranded sequence of connected nucleotides comprised of a 5 carbon sugar, a
phosphate residue and one of four nitrogenous bases attached to each sugar - Adenine (A),
Guanine (G), Cytosine (C) and Thymine (T) [175] A schematic is shown in Figure 1.1.Individual nucleotides are fused together by phosphate bonds to make a single strand of
DNA Each base interacts primarily with just one of the other three bases, Adenine bonds
with Thymine and Guanine bonds with Cytosine Two DN A strands bind together according
to these base pairing interactions to form the double helix structure of double stranded DNA.
The double strand interaction is shown in Figure 14, Each strand of DNA has an orientation
defined by the relative positions of the carbon atoms in the DNA’s sugar-phosphate backbone
Trang 18transeription start site poly A signal"pyrimidine rich
promoter region \ start codon donor \ acospto stop pan |
tata \
transcription factor © S'UTR ‡ {branch ste 3'UTR
binding sites exon intron exon
Figure 1.2: Example of protein coding gene structure Gene contains two exons and oneintron The initial exon includes an untranslated region (5’ UTR) and the translated region .The terminal exon consists of a translated region ending with a stop codon followed by anuntranslated region (3’ UTR) Transcription begins at the transcription start site, which ispreceded by the promoter region
The carbon atoms are labeled 1’ - 5’ and the two strands of DNA, which form the double helixare anti-parallel - one strand is oriented in the 3’ to 5’ direction and the other strand is oriented
in the 5’ to 3’ direction This reflects the relative positioning of the carbon atoms on the sugarmolecule with respect to each strand When referencing a location in a chromosome, the term
“base pair” is used to refer to the two nucleotides from each strand which pair together Forexample, the size of the human genome is cited as containing approximately 3,000 million basepairs [75], with the actual genome being twice this number in terms of individual nucleotides.Because of the fixed set of base pairing rules, A bonds with T and G bonds with C, onestrand of DNA can be recovered from examining the contents of the other strand Therefore,the genomic sequence is represented throughout this thesis using a single string of lettersrepresenting the nucleotides assembled in the 5’ to 3’ direction The terms “positive” and
“negative” strand is used to distinguish between the two strands when needed
Figure 1.2 shows the distinct components of protein coding gene structure located in thechromosome, which are responsible for facilitating gene transcription into messenger RNA(mRNA) The gene begins with a promoter region, which contains nucleotide sequence thatbinds to the transcription complex marking the beginning of a transcribed gene region, which
Trang 19occurs on a single strand The promoter region includes a TATA sequence and a conservedprotein binding sequence motif found upstream In addition to the proximal-promoter, distaltranscription ‘enhancer’ and ’silencer’ elements can occur over 50 kilobases upstream of thetranscription start site and the presence or absence of transcription factors binding to thesesites facilitate initiation of transcription The exon intervals consist of codons, which are DNAtriplets that encode for one of the 20 amino acids, which ultimately define the derived proteinsequence The transcription start site is upstream of the ATG “start” codon, which marks
the initiation of translation from codons to amino acids with initiation further facilitated by
surrounding consensus (Kozak) sequence The exons in the gene are potentially interrupted
by introns, which are removed from the final functional mRNA form The introns begin with
a ‘donor’ di-nucleotide site, commonly GT and ends with an ‘acceptor’ di-nucleotide site,
commonly AG A “stop” codon marks the termination of translation, which is followed by
an untranslated region Upon termination of transcription, the 3’ end of the pre-mRNA iscleaved and spliced to a poly A tail (an Adenine nucleotide polymer) The bottom image inFigure 1.2 shows the final processed mRNA, which includes the leader 5’ untranslated region(UTR) and the 3’ UTR, which is exported outside the nucleus for translation by a complexcollection of molecules called the ribosome The key component of translation, which is ofconcern in predicting gene structure, is identifying the correct “reading frame” The processedmRNA shown in Figure 1.2 is translated from start codon to stop codon, which implies a singlereading frame and since the codons are nucleotide triplets the length of the mRNA from startcodon to stop codon should be a multiple of three Note that when the gene is transcribedinto mRNA, when the nucleotide in the gene specifies a Thymine, the equivalent ribonucleicacid used to form the nascent mRNA is Uracil (chemically similar but distinct to Thymine)
Trang 20For clarity throughout the thesis Thymine is referenced with the understanding that the base
is always substituted for Uracil in RNA |
Contents of a chromosome typically include non-protein coding RNA, which in some casesuse the same transcription and splicing signals as protein coding genes minus the codonsused to assemble an amino acid sequence While non-protein coding gene prediction remains
an important and challenging problem, it is somewhat distinct from protein-coding genefinding The two ubiquitous non-protein coding genes, transfer RNA (tRNA) and ribosomalRNA (rRNA), are highly conserved in secondary structure among closely related species,thus computational approaches distinct from those used in eukaryotic protein coding geneprediction, have been highly effective in finding these genes An abundance of other non-coding RNA species are known such as microRNA Like tRNA and rRNA, microRNA genefinding search for specific sequence conservation patterns [65] For a review of non-proteincoding gene prediction see [41]
This thesis deals specifically with predicting gene structure in protein coding genes, whichremains a difficult problem to solve Two specific aspects of protein coding gene structureprediction are addressed: single isoform gene structure prediction and alternative exon isoformprediction The thesis claim is two-fold: first, using a diverse array of evidence sourcesyields more accurate predictions than single evidence prediction sources and second, explicitmodeling of alternative splicing accurately identifies multiple overlapping functional genes.Classifier stacking with sequential inference is shown to be an effective method for improvingaccuracy in gene structure prediction
The remainder of this chapter defines the single isoform gene structure prediction problem,which serves as a basis for problems addressed in later chapters and gives background on
Trang 21previous work in the area Two recent gene finding reviews provide an overview of progress ingene prediction over the last ten years [17, 18] The purpose of the overview given here is todescribe in more detail the specific computational formulation used in previous gene findingprograms, which provide a sequential inference framework for the algorithms described inlater chapters.
1.1.1 Problem Definition
A genomic sequence is a double stranded DNA molecule with genes potentially occurring
on both strands The single isoform computational protein-coding gene finding problem isdefined as follows: given an input genomic sequence, output a label for each nucleotide in thesequence describing its functional relationship to the protein coding gene structure (as shown
in Figure 1.2) The problem is to predict the location of each gene in the input sequence andfor each gene, identify the start codon, stop codon, each amino acid codon, the splice sites,the introns and the intergenic sequence This can be formulated as a structured classification
problem, in which class labels are assigned to each nucleotide in the sequence, subject to the
global constraints of protein coding gene structure Programs that address this problem are
referred to as computational ‘gene finders’: The term ab initio is used to refer to gene findersthat rely exclusively on information extracted from a single genome, rather than informationobtained from other sources
1.2 Computational Framework for Gene Prediction
The earliest successful gene finding methods used the syntax of gene structure to labelthe DNA sequence [36] This syntax is defined using a Finite State Automaton (FSA) to
Trang 22recognize valid DNA sequences that contain protein coding genes The FSA in Figure 1.3recognizes genes on a single DNA strand, beginning with a start codon, ending with a stop
codon and possibly interrupted by one or more introns In keeping with the convention of the
gene finder programs covered in this review, only the exons, introns, intergenic sequence, and
the GT/AG consensus splice sites are shown States can be added to recognize other features
in the gene (e.g promoter and polyadenylation signal sequence) as well as states to recognize
genes on the opposite strand of a double stranded DNA molecule States in Figure 1.3 arelabeled with nucleotides given in lower case to maintain compactness in the figure, with theequivalent nucleotides referenced in the text as capital letters (for clarity)
The FSA is a 5-tuple Mrg4 = (Q, 40, ga, Ð, ổ):
e Q is a set of states
® g € @ is a unique initial state
® ga € Q is the final (accepting) state
Xu = {A,C,G,T} is the alphabet (DNA bases)
6:Qx {XU$} — Q is a transition from state q/ to g reading nucleotide n € Ð from aninput sequence
For notational convenience, all input sequences end with the end of string symbol $ and alltransitions leading to g4 are defined g4 = 6(q,$) for all ạ € Q Furthermore, assume there is
a transition from the start state go to each state shown in Figure 1.3 and a transition fromeach state in Figure 1.3 to the accepting state ga This definition allows for the recognition
of partial genes, which can occur when analyzing a short genomic contig that contains onlypart of a gene Given an input genomic sequence §, the goal is to find a series of transitions
Trang 24from the initial state go to the accepting state g4, where each nucleotide in S is read by atransition If such a series of transitions exist, S is recognized by the FSA More formally,let S{i] be the nucleotide in S$ at position i, taking transition g = 6(q%-1,5 [¢ — 1]) from
state qj-1 to q and reading nucleotide S[i — 1] (i > 1) For a sequence of length L, (L-2
nucleotides plus the end of sequence symbol $), the sequence parse is the sequence of states
Ó = 40,91, 92) 5 4r-1,94 With transitions q: = ô(qo, S[0]), 92 = 6(41, S[1)), ga = ô(qr—t, 8)
reading input 6 = $/0], S/1], ,S[Z — 2],$ Different portions of the model in Figure 1.3represent different parts of gene structure Three codon types are labeled in Figure 1.3, codong,
codon, and codon2, which represent the relative position of the codon when interrupted by an
intron Since the codon is a nucleotide triplet, when the first base of a nucleotide following anintron represents the first position of the triplet it is said to be in phase 0 When the base is_the second nucleotide in the triplet it is in phase 1 and when the first base of the exon is thelast nucleotide of the codon triplet, it is in phase 2 Note that the start codon by definitionalways begins in phase 0 In general, the term phase is defined to be the index (0, 1 or 2)into a codon
Consider the example of deciding whether sequence
S = ATGGCCTGT ATAACTAGAACTCT AG$
is valid input for the single isoform gene structure prediction problem One valid parse for
S is to transition from the start state go to the start codon state labeled A followed by thestates labeled T and G in Figure 1.2 and take the transition into the states in codong (labeled
G, C, and C), to codon (labeled T’) into the intron, (labeled G, T, A, T, A, A, Œ T, A andG) and transition to the rest of the codon (labeled A and A) to codong (labeled C, T andC) to the stop codon (labeled T, A, G) It is also possible for the sequence to be accepted by
Trang 25taking some other series of transitions in the FSA The sequence S could be an intron and avalid sequence of transitions can be taken using the intron associated states (labeled introno,intron, and intron) Therefore, the model is ambiguous since multiple parses can be used
to read the same sequence
To resolve the ambiguity in the model, the earliest gene finding approaches, which formedthe basis for more recent gene finding programs, searched for the most probable parse usingstatistics collected from training sets Examples, include the programs HMMgene (89] andVEIL [66] Given input sequence S, the goal is to find the optimal parse: ¢/ = arg max, P(¢, S$).The joint probability of the parse and the sequence is given by the equation P(¢,S) =P(S|¢) x P(@) The parse ¢ that maximizes the probability of generating S is
j= arg max P(65|ð) x P(9).
P(®) is a parameter for modeling a priori knowledge of the model structure P (S|) is thesequence modeling parameter, which defines the probability of the model emitting sequence
S In general, it is not practical to estimate parameters to the model P(S|¢) x P(@) since the
parameter space is exponential in size with respect to model structure, |Q|* (|Q| being the
number of states in the model) and sequence, 44 The joint probability of the parse and the
sequence is approximated using a hidden Markov model (HMM) to assume each nucleotide
is dependent only on the current state and the previous state Thus, we can re-formulate theprevious model Mrs shown in Figure 1.3 to predict sequence using statistics compiled from
a training set of previously known protein coding genes
The transition function 6 from Mega is replaced by parameters to predict the probability —
of generating nucleotide n taking transition from state q/ to g In the previously defined FSAeach transition reads exactly one type of nucleotide For example in Figure 1.3 the transition
Trang 26within the start codon state transition from state “a” to state “g” reads nucleotide G In thenew model, a transition from state g’ to state g outputs all values n € XO with probabilityP(n|q) where © = {A, C, G,7, $}.
The model Mrga is redefined to be the 5-tuple My = (Q,q0,¢4,5,6), which is a
hid-den Markov model (HMM), similar to the FSA but replacing the transition function ổ with.
transition and output probability parameters 6 For a given state transition sequence theprobability of the parse is
g to maximize the following:
D(k,q) = max {D(k — 1,q/) x P(nelg) x P(a|g)}.
Each cell in the matrix D keeps a pointer to the previous state which lead to maximizing thescore in the current state The optimal parse is recovered by tracing back through the linksbeginning at state ga The runtime cost is O(L x |Q|*) where È is the sequence length and
|Q| is the number of states in Q Looking at each nucleotide independently of the originating
sequence, P(n,|g) acts as a classifier with g* = argmax, P(n¿|q) being the classification
assigned to the fixed nucleotide n, enumerating over all possible state labels g Classificationdecisions are only made once the dynamic programming matrix is filled and the highest scoringparse is recovered
Trang 27Figure 1.4: Generalized hidden Markov model of protein coding genes using the componentsdescribed in Figure 1.3.
1.2.1 Generalized Hidden Markov Models
Improvements were made to statistical gene finders by observing that the distribution ofexon lengths is not geometrically distributed, a fact implicitly assumed in the hidden Markovmodel (22, 92] An explicit exon length probability distribution is modeled using a generalized
hidden Markov model (GHMM) (also referred to as a semi hidden Markov model [23] or an
explicit duration hidden Markov model [141]) Examples of this approach were first introduced
in the programs Genscan [22] and Genie [91] The GHMM model Ä⁄ w’ equivalent to the hiddenMarkov model My is shown in Figure 1.4 Figure 1.4 expands on the previous HMM, allowingstates to output more than one nucleotide rather than a single nucleotide Sequence “signals”are defined to be fixed length features in the sequence and refer to the start codon, stop codonand the consensus splice sites Each state is an interval between signals in the sequence Forexample, the “Initial” state is an exon, which begins with a start codon (ATG) and ends justupstream of the donor site (a consensus GT signal) The equivalent in Figure 1.2 is to begin
in the start codon (ATG) and transition into one of the “codon” states and end in one of thethree intron states The intron state depends on whether the exon ends precisely at a codonboundary or whether the codon continues into the next exon This is represented in Figure
Trang 281.4 by the three transitions into states: Intron 1, Intron 2 and Intron 3 The equivalent parsefor sequence S = ATGOCCTGT ATAACT AGAACTCT AGS using the previous example,but with the GHMM model is go,q = Initial, qv = Intron3,qg3 = Terminal,q, = $ The
alternative parse labeling the sequence S as an intron sequence is now represented with one
of the three “Intron” states.
State 4 previously output a single nucleotide, now let the state output a string of cleotides y = 1o, ,y)-1 where I(y) is the number of nucleotides More formally, let
nu-y =8 [j,k] denote the subsequence in S from position j to & inclusive and = ŠJa¿_ + 1, a,
be the subsequence in § output after taking transition g_¡ to q;, where a; = œ_¡ +1 (y;) and
ao = 0 (i > 1) The model is defined My, = (Q, 0, ga; 5, R, O, L) where
e Fis the set of transition probabilities with the probability of taking transition g' to ø
- input S = 1,9a, ,„ 1S
P(5 =i, , YolO = Gos G1-+) Go-1, 94) = T] Polysla:) x Pr(gila—1) x Pr„00/)) (1.1)
¿=1
Trang 29where N = Xf_ol(y;) Each state outputs a sequence of explicit length /(¿) with each y;defined to be a non-overlapping subsequence in S If the output lengths are fixed, the model
can be converted into the previously defined HMM Notice that there is still an implied
geometric distribution in the model, however, it now applies to modeling the number of exons
per gene in a multi-exon gene ©
An analog to the Viterbi algorithm, which will be described in more detail in Chapter
2 and Chapter ở is used to find the most probable sequence parse The drawback to this
approach is that the inference algorithm is slower than the Viterbi algorithm When naively
implemented, the runtime cost is O(N? -|Q|?) where N is the length of the sequence and |Q| isthe number of states in the model In the earliest days of genomic sequencing projects it wasnot uncommon to handle thousands of short sequences, which may contain at most one geneper sequence or even part of a gene With improvements to the sequencing technology and
the development of assembly algorithms [133], it is now common to have access to sequences,
which range in length from 10,000 bases to sequence with millions of bases in length To
improve the efficiency of the inference algorithm, the intergenic states and the intron states
are assumed to have lengths, which are geometrically distributed [23, 92]
More recent work has shown that in fact, modeling intron length can improve
predic-tion performance This approach was implemented in the program Augustus [164j and also
incorporated in the gene prediction program Twinscan [166] The efficiency trade off is dled by defining length cutoffs, in which a geometric length model is used for very longintron sequences The GHMM formalism has been widely used among many gene findersincluding SNAP [86], GlimmerHMM [103], GeneZilla [103], Phat [26], Fgenesh [146] andGeneMark.hmm [102], with each program implementing different statistical sequence model-
Trang 30han-ing methods In Equation 1.1 the details of sequence modelhan-ing in the GHMM are abstracted
in the term P(y;|q;) Useful sequence modeling techniques currently used are briefly reviewed.Not all gene finders use hidden Markov models, notable exceptions are GeneID [59] and
GlimmerM [130], which use a dynamic programming based inference algorithm, which serves
as the basis for the algorithm described in more detail in Chapter 2 Both programs still
use the statistical sequence modeling methods described in the next section The principal
difference is the lack of a formal graphical model for prediction and the lack of state transition
probabilities
1.2.2 Statistical Sequence Modeling
Sequence emission probabilities are divided into two categories, the “signals”, which refer
to the start codon, stop codon and consensus splice sites, and “content sensors”, which modelthe variable length portions of the sequence and include the exon, intron and intergenicintervals [106] In the simplest form, each nucleotide is emitted from the model with thefirst order dependencies encoded in the hidden Markov model shown in Figure 1.3 Usingthe generalized hidden Markov model, the statistical sequence can model higher orders of theform:
k
P(yla:) = [[ P(S[ml|S[7,m — 1)
m=j
where y; is a subsequence in S from j to k and the probability of emitting nucleotide S[m]
is dependent on the previous nucleotide subsequence S[j,m— 1] Since the the parameterspace is exponential with respect to sequence length, gene finding software use fixed orderMarkov models In principle, higher order dependencies can be incorporated into the hiddenMarkov model framework shown in Figure 1.3 by adding states, however, with the increased
Trang 31number of states comes an increase in running time and memory requirements In practice,gene finding programs such as Genscan and Genie use 5th order 3-periodic inhomogeneousMarkov models for the protein coding intervals If state g; represents a coding exon, which
in Figure 1.4 is one of the six states: Internal 1, Internal 2, Internal 3, Initial, Terminal orSingle, the probability the of the exon interval in $ from index 3 to & is:
k
II ? (SImil|S[m — 1], S[m — 2], , S[m — o])
ma=j,r=m rmoởd 3
where r refers to the codon phase, which is the relative position (0, 1 or 2) in the codon and
o is order of the Markov model Higher order models are at least as accurate as a lower ordermodel when there is sufficient data to accurately estimate the parameters of the model (147)
With newly sequenced genomes and limited training sets the availability of large training setscan not be assumed |
One approach to get around this problem is interpolated Markov models (IMMs), whichhave been used in the gene finders GlimmerM [147], GlimmerHMM [103], GeneZilla [103] andEuGene [152] GlimmerM for example, uses 8 different 3-periodic Markov models labeled |
0 through 8, where the number denotes the order of the model (e.g 1st, 2nd order Markovchains) The goal is to use the highest order model whenever possible Since in many cases
_ the amount of training data is limited, the probability estimates for the higher order models
are less reliable In these cases the IMM looks at the lower order models to compensate for
the lack of training data A distinct IMM is defined for each coding phase r The choice
of models for a given phase is determined by an interpolation of the probabilities of Markovmodels from order 0 to 8 defined recursively as:
IMM"(S,m,k) = w(S,m — 1, k) x P,(S[m]|Sïm — 1], ,$[m — k])
+ —(S,m — 1,k)) x IM My_1(S,m,k — 1)
Trang 32with mono-nucleotide frequencies used when k=0 The weighting function: w(S,m, k) pares the nucleic acid distribution of the kth order Markov model and the kth-1 order Markovmodel using a x2 test If the function c(Sjz, 0|) returns the number of times the substringS[z, 1] is observed in a training set and - is the string concatenation operator, then
com-v= ¥ (c@ - S[m — 1 — k,m]) — c(b - Sim — 2 — k,m)))?
be{A,C,G,7} cíb : Sim — 2 — k,m])When the higher order model does not significantly differ from the lower order model by afixed threshold confidence value, the function w(5,m,k) returns 1, otherwise
1—(c/400) 3` c(S[fm—1—k,m—1}-b)
be(A,C,G,7}
is returned where c is the x2 confidence value
Intergenic and intron sequences do not display the same coding phase pattern found inexons, and therefore it is more appropriate to use fixed-order homogeneous Markov models.The sequence signals, “start codon” “stop codon”, and splice sites are typically modeled
as first order inhomogeneous models, using a fixed sequence window These are also calledconditional probability matrices [148] or weight array models [189], which take the form of
k+u
P*(Sine)) TI P~*(S[ml|S[m- +)
i=k+1
where w is the length of the sequence window and the term i-k refers to the relative position in
the window Note that Oth order models are referred to as position weight matrices (PWMs).
It has been shown that local dependencies can not be assumed to accurately model certainfeatures in the sequence [24] For example, the base pairing of small nuclear RNA to thedonor splice site (the splicing process is reviewed in Chapter 4), can have compensatorybase pairing, so that if mutations occur in one region limiting the strength of the chemicalbonds, compensatory base pairing is observed in another location [22] Other “semi-local”
Trang 33affects are observed, for example modeling codons but ignoring the third position of the codonwhen it does not impact the encoded amino acid These ideas are implemented in the genefinder GLIMMER [83] as interpolated context models (ICMs) and as maximal dependencedecomposition (MDD) trees in Genscan The basic idea is to define a model for each relativeposition of interest For example in the 3-periodic Markov models, three models are definedfor each position in the codon In the conditional probability matrices or (WAMs), eachcolumn in the matrix constitutes a separate model The version implemented in Genscan [23]and used as part of the splice site prediction program GeneSplicer [129] program is a binarydecision tree A path from the root to a leaf defines a context, where each node is a nucleotideoccurring at a position in the sequence A leaf in the tree contains all training examples thathave nucleotides at the positions specified at the nodes in the path from the root to theleaf Rather than the traditional kth order Markov model P"(S{i]|S[i — 1), ,S[i — 1 — kì)the result is P"(S[i]|T(S,i—1,k)), where the function T is defined by the decision tree and
the probability of the base at position i is dependent potentially on the identity of bases at
non-adjacent locations within the fixed window |
Let X; be the discrete random variable taking on the value of one of the four nucleotides
(A,C,G,T) with some probability at position 7 in the window Intuitively, the goal is to
measure the degree of statistical independence between each position ¡ and position j (for
i # 7) in the sequence, which can be done using the x? test comparing the distributions X;
and Xj The interpolated context Models (ICMs) used in GLIMMER [33] use a 4-ary tree
rather than a binary tree The ICMs model codons rather than splice signals, meaning there
is greater variation in nucleotide type at each position In the case of a donor site modelfor example where the subsequence forms a base-pairing with the snRNA U1 (see Chapter 4
Trang 34for further description of splicing process) there is frequently one base at each position that
occurs far more frequently In this case it makes sense to divide the data according to thoseexamples that match the consensus base and those that do not, rather than storing a separateentry for all four nucleotide types A second difference between the MDD and ICM approach
is in the selection criteria for partitioning the data The ICM model uses mutual information,rather than the x? statistic There are three stopping criteria for the tree building procedure:1) the distribution of nucleotide values at position j appears to be independent of i then treebuilding terminates 2) the depth of the tree is j and 3) the size of the remaining training setfalls below a fixed threshold (prevent unreliable probability estimates) Pseudo code for theAlgorithm 1 Tree building procedure for maximal decomposition dependency The “par-tition” function divides the training data into two sets Dị and Dạ with D, containing theexamples with the consensus nucleotide Cy at position i’, and set D> containing the remaining
examples
build-tree(D)
U = MaryD jes XD (Ci, Xj)
(Dị, Da) = partition(D, Ci)
if not valid stopping criteria on D, then
build-tree(D,)
if not valid stopping criteria on D2 then
build-tree( D2)
return (D,, Da)
tree building procedure is shown in Algorithm 1 Each leaf in the tree includes a probability
in the form of a position weight matrix, which estimates the probability distribution of eachnucleotide at each position in the window using a Oth order Markov model Alternate tree
Trang 35construction criteria have been tested and used in the gene finding program SLAM, for a
review see [193]
1.2.3 Parameter Estimation
There have been some attempts to use unsupervised learning methods to estimate
pa-rameters of hidden Markov models using the Baum-Welch algorithm [66] However, given
the availability of labeled examples of gene structure, essentially all of the published genefinding methods make use of supervised learning A maximum likelihood approach is used toiterate over a labeled training set Example genes are taken from initial expression studies
or a homologue to a sequenced gene of a closely related species Problems, however, can
still arise when analyzing novel genomes with limited amounts of labeled training data Two
studies show how bootstrapping methods can provide a useful complementary approach to
supervised learning The first bootstrapping method was introduced in the gene finder SNAP
[86], which begins with model parameters estimated from labeled training data from anotherorganism A second bootstrapping method was introduced in the latest version of Gene-
Mark.hmm [101], which initializes model parameters to default initial values Once initial
parameters are set, both programs predict genes in the genome using an inference algorithm
At each iteration, the newly predicted gene set is taken to be the labeled training set and used
to estimate a new set of parameters Once the newly predicted gene set matches the previous
training set within some fixed threshold, the training procedure terminates and outputs the
gene prediction model with the current parameter estimates
Trang 361.2.4 Integration of Extrinsic Evidence
When trying to come up with an exhaustive list of genes for an organism it is important
to make use of all the available evidence since the presence of a gene may be detected using
one method and missed by another An extrinsic evidence source refers to gene structure
ev-idence originating from a source other than the information contained within the organism’s
genomic DNA The most reliable evidence still comes from isolating RNA in the cell and
es-tablishing its identity through various sequencing methods This is the ideal form of evidencebecause the gene is actually being observed “in action” This experimental evidence used forgene structure prediction comes in the form of full length cDNAs and ESTs Even after a
mRNA is isolated and sequenced it can still be a non-trivial task to identify the originatinglocation in the organism’s genome Problems can occur for several reason: poly-ploidy or
population mutation occur so that differences are observed within the sequenced population
leading to differences between the reference genome and the isolated RNA A second
prob-lem is genome duplication Many genomes have had major genome duplication events (seeArabidopsis thaliana for example [74]) leading to duplicate regions in the genome, which givethe appearance of a single RNA sequence having originated from multiple locations Finally,although DNA sequencing is fairly accurate, sequencing errors can occur when using the high
throughput methods such as EST sequencing
In addition to using evidence of expressed genes from the same organism, it is also useful
to consider evidence of gene expression from other organisms The term ortholog describes
two genes occurring in two separate species, which have a common ancestor [47] Mapping
the evidence from a gene expressed in the related organism to the organism under study,requires searching for the matching orthologous gene The problem becomes more difficult as
Trang 37the possibility for greater differences between the two related genes increase.
Another source of evidence is gene expression from paralogs Paralogs are genes which
occur within the same species and share a common gene ancestor [47] A fourth approach is touse sequence similarity as evidence for the existence of a gene Use of an arbitrary similarity
measure is justified by the hypothesis that regardless of the evolutionary history, when asequence shows substantial similarity to a known gene, the sequence is more likely to share a
similar function One example of this approach is the searching of protein domain databases
for similarities with genomic sequence used in Genie [93] The evolutionary argument for theexistence of a gene can be confounded by the occurrence of pseudo genes Pseudo genes are
genes that are no longer functional having arisen from a duplication event in which one gene
retains function and the pseudo gene accumulates nucleotide mutations rendering the gene
non-functional [192] Despite the accumulation of mutations much of the pseudo gene retainssequence common to the functioning paralog, making it difficult to distinguish the pseudo
gene and the functional gene Another important extrinsic source of evidence is the use of
protein sequences which are usually derived from translating sequenced mRNA Comparing
amino acid sequences can potentially reveal more distant evolutionary relationships since
an accumulation of mutations at the DNA level may not affect the encoded protein The
algorithms reviewed here incorporate evidence of cDNAs, proteins and genomic DNA from
other organisms, into gene finding programs used to predict gene structure.
Many of the early programs used sequence conservation to identify candidate orthologous
exons, which were assembled into gene structures using dynamic programming algorithms
Examples are found in CEM [9], ROSETTA [12] and SGP-1 [179] Much of the recent genefinding work has focused on incorporating extrinsic sources of evidence into the HMM and
Trang 38GHMM framework.
One probabilistic approach to incorporating extrinsic evidence is to use a pair hidden
Markov Model (pair-HMM), an approach taken in the program Doublescan [109] Recall the
single species hidden Markov model formulation P(SiFlla)x< P(qiq’), which is used to model the
probability of emitting nucleotide S|k] at position k in sequence S The alphabet © previouslydefined to be the set of all possible nucleic acids is now expanded to include an “insertion” symbol and the probabilistic formulation is changed to P(S[k1], Sa|ka||g) x P(qig’) whereS1[kij, Seiko] are a pair of nucleotides, one occurring in sequence S; and the other occurring
in sequence 52 or a nucleotide paired with an insertion symbol The indices k, and ke are thepositions in the two respective sequences S$, and Sp The index, k, and ke is only incremented
when a state emits the nucleotide in the sequence and not when an insertion symbol is
emitted, the insertion symbol can be thought of as emitting a “null” symbol [77], which is
not observed The two sequences are simultaneously aligned, scoring the degree of sequence
conservation between the two sequences, while scoring the likelihood of the gene structures
predicted to occur in the two sequences The program Projector [110] is an extension of thepair-HMM implemented in Doublescan In Projector, rather than assume two anonymousDNA sequences as input, an annotated gene structure for one sequence is included as input.The idea is that if a gene is known to exist in one species and there is evidence for the precisegene structure boundaries in the sequence, a rigorous probabilistic approach can be used to
accurately identify the precise gene structure in the related species
Generalized pair hidden Markov models have also been used to take advantage of eling explicit sequence interval lengths, just as the GHMMs were used in the ab initio case
mod-This is the approach taken by SLAM [3] and more recently TWAIN [104] In this approach
Trang 39the generalized hidden Markov model is transformed to output pairs of sequence (including
insertion symbols) of explicit variable length of the form
x
TL Pslas) x P(l~) x P(0))
where x is the number of hidden states taken to output both pairs of sequence and each state
& emits a pair of sequences y/
In addition to aligning two genomic sequences, predictions are sometimes based on aligning
cDNA and protein sequences to genomic DNA Protein to genomic sequence alignments are
specifically addressed in SLAM and GeneWise 114) Both methods incorporate concepts
in-troduced in profile hidden Markov models [40], which use “insertion”, “deletion” and “match”states to model the probability of conservation patterns occurring independent of the sequence
itself A sub-model P(s|s/), of the sequence pair model P(1;|g¡) is introduced to estimate theprobability of arriving in state s given the previous state s/ where s and s/ are one of the three
states defined by the alignment sequence conservation information, “insertion”, “deletion” or
“match/mismatch”
The computational cost to output two sequences in parallel rises considerably from the
single sequence case The dominating cost becomes the product of the lengths of the two input
sequences, which in the context of predicting the locations of genes in long contiguous stretches
of genomic sequence is prohibitive Therefore, in both the pair-HMM and the pair-GHMMmethods it is necessary to first generate approximate alignments to limit the search space of
possible sequence alignments Overall, runtime in the pair-GHMM is O(D*|Q|? N,N) where
D is a maximum length on the emitted sequences, |Q| is the number of states and N, and
Nz are the lengths of the two respective evaluated sequences S; and S» [123] Limiting the
length of N; and N2 and imposing a limit on the number of consecutive insertion/deletions
Trang 40permitted between two sequences is employed in each of the pair-HMM and pair-GHMMprograms to allow for annotating large genomes Despite improved prediction accuracy of
the generalized pair-HMM over the pair-HMM in simultaneously predicting gene structure
in two genomes, an advantage of the pair-HMM approach is its computationally feasible to
predict intron loss/gain [109] Intron loss/gain is a prevalent feature of evolution [144] andoccurs when an intron is either inserted into an ancestral exon or removed from an ancestral
sequence (with the resulting gene structure retained in subsequent generations) The result
is, when predicting the orthologous genes in the two related species, it is possible for one
sequence to contain one long contiguous exon with the matching exon in the other sequence
being interrupted by an intron Similarly exon loss/gain is known to occur and in these
cases an intron occurs in one sequence, where the matching intron from the other sequence
is interrupted by an exon
Another computational approach to using extrinsic evidence is to assume that a sequencealignment is given as input Genie [93] is the earliest published report to use extrinsic evidencewithin the GHMM framework, but was followed by many other programs including Fgenesh+
[146], Genomescan [184] and Twinscan [87], each of which use the GHMM framework and
the closely related approach of SGP-2 [125] In Genie, an exon coding interval is scoredbased on a statistical codon model plus the scores computed from matched protein domain
profiles [159] and blastx [56] alignments Protein information is used when a protein match
aligns to two adjacent regions implying an intron When such a case arises, Genie requiresthat the upstream exon be paired with a downstream exon overlapping the matching protein_alignment
Genomescan uses blastx to find matches to sequences in a protein database and the