An integrated computational pipeline and database supporting whole genome sequence annotation

An integrated computational pipeline and database supporting whole genome sequence annotation An integrated computational pipeline and database supporting whole genome sequence annotation C.J. Mungall (3, 5), S. Misra (1, 4), B.P. Berman (1), J. Carlson (2), E. Frise (2), N. Harris (2, 4), B. Marshall (1), S. Shu (1, 4), E. Smith (1, 4), C. Wiel (1, 4), G. Rubin (1, 2, 3, 4), and S.E. Lewis (1, 4) Department of Molecular and Cellular Biology, University of California, Berkeley, CA Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, CA Howard Hughes Medical Institute FlyBaseBerkeley, University of California, Berkeley, CA Corresponding author Corresponding author: Christopher J. Mungall Email: cjm@fruitfly.org Phone: 5104866217 FAX: 5104866798 University of California Life Sciences Addition, Rm. 539 Berkeley, CA 947203200 USA 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation ABSTRACT Background Any largescale genome annotation project requires a computational pipeline that can coordinate a wide range of sequence analyses and a database that can monitor the pipeline and store the results it generates. The compute pipeline must be as sensitive as possible to avoid overlooking information and yet selective enough to avoid introducing extraneous information The data management infrastructure must be capable of tracking the entire annotation process as well as storing and displaying the results in a way that accurately reflects the underlying biology. Results We present a case study of our experiences in annotating the Drosophila genome sequence. The key decisions and choices for construction of a genomic analysis and data management system are presented as well as a critical evaluation of our current process We describe several new open source software tools and a database schema to support largescale genome annotation Conclusions We have developed an integrated and reusable software system for whole genome annotation. The key contributing factors to overall annotation quality are marshalling highquality, clean sequences for alignments and achieving flexibility in the design and architecture of the system 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation BACKGROUND The information held in genomic sequence is encoded and to understand the import of the sequence we must therefore first assess and describe this primary data. This initial computational assessment generates some measure of the biologically relevant characteristics, for example coding potential or sequence similarity, present in the sequence. Because of the amount of sequence to be examined and the volume of data generated these measures must be automatically computed and carefully filtered. At the time we launched this effort there were a few other computational pipelines available and we investigated the possibility of reusing one of these for our project Unfortunately, this was not possible primarily because of differences in strategic approach, although other factors such as scalability, availability, and customizability also played a role. Many of these pipelines are intended primarily for ad hoc queries from individual users and would not scale up sufficiently; while useful they were not under consideration for the comprehensive analysis of an entire genome [ 1, 2, 3]. For whole genome analysis there are essentially three different strategies: a computational synthesis to predict the best gene models, aggregations of community contributed analyses that the person viewing the data integrates visually, and curation by experts using a full trail of evidence to support an integrated assessment. Groups that are charged with rapidly providing a dispersed community with finished genome annotations have chosen a purely computational route; examples of this strategy are Ensembl [4], NCBI [5], and Celera [6]. Aggregative approaches adapt well to the dynamics of collaborative groups who are focused on sharing results as they develop; examples of this strategy are the University of California Santa Cruz (UCSC) viewer [ 7] and the Distributed Annotation System (DAS) [8] For organisms with wellestablished and cohesive communities the demand is for carefully reviewed and qualified annotations; the representatives of this approach are two of the oldest genome community databases, ACeDB for C. elegans [9] and FlyBase for Drosophila [10]. Our decision was to proceed directly towards the goal of actively examining every gene and feature of the genome to improve the quality of the annotations. The prerequisites for this goal are a computational pipeline, a database, and an editing tool for the experts This paper discusses our solution of the first two requirements. The editing tool, Apollo, is described in an accompanying paper [11]. Our long term goal is to provide a set of open source software tools to support largescale genome annotation Our primary design requirement was flexibility so that the pipeline could easily be attuned to the needs of the curators. For example using unique data sets for comparisons such as; direct submissions from individual researchers; sequences generated by our internal EST and cDNA projects [ 12]; and custom configurations of sequences from the public databases (a detailed description of the data sets used is available in Misra et al [13]) The aim was to provide the biological experts with every salient piece of information possible and then enable them to efficiently summarize this information manually. 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation RESULTS The sequence data sets are the primary input into the pipeline. There are three different categories we will discuss: the Drosophila genomic sequence that we are trying detect features upon; expressed sequences and other sequences that are of Drosophila origin; and informative sequences from other species. Drosophila genomic sequence The release 3 genomic sequence was generated using Bacterial Artificial Chromosome (BAC) clones that formed a complete tiling path across the genome [ 14]. We used the BAC sequences to assemble a single continuous sequence for each chromosome arm to verify order and overlaps. This was accomplished using inhouse software that utilized tiling path data from physical mapping work (a combination of in situ and Sequence Tag Sites [STS] mapping) to chain BAC sequences together. At a certain point, the assembly for each chromosomal arm was frozen, because all possible gaps were filled and equally because it is essential for annotation that the underlying sequence is stable. This then became the release 3 sequence. Choosing the unit of annotation, both for the pipeline as well as the curators, is a 'chicken and egg' problem There are two contrary and arbitrary breakdowns (BAC sequences and public sequence accessions) and the one biological breakdown (protein coding gene region). Ideally we would annotate using the biological breakdown, but initially there is no way of knowing this and so the entire process must be bootstrapped up from the arbitrary breakdowns. We considered using the BAC sequences directly as the pipeline input, which has the advantage of one less processing step. Ultimately however, we rejected this idea because the BAC sequences are relatively short and contain random portions of the genome and thus there is a high probability of splitting the exons from a single gene onto multiple BAC sequences and as a consequence, complicating the annotation of these genes Instead the main sequence unit we used in our genomic pipeline was the Genbank accession. These are usually of a size manageable by most analysis programs (around 300 kilobases), but we still faced the issue of genes straddling these arbitrary units. As our solution we carried out a twostep analysis. First we fed the BAC sequences into a preanalysis pipeline, which is a lightweight version of the full annotation pipeline. This gave us a rough idea of where the various genes were located. We then projected these analysis results from BAC clone coordinates into coordinates on the full arm sequence assembly. This step was followed by the use of another inhouse software tool to divide up the arm sequence, trying to simultaneously optimize two constraints: One constraint is correspondence to the preexisting release 2 accessions in Genbank/EMBL/DDBJ [ 15, 16 17 , ]; The other constraint is avoiding the creation of gene models that straddle the boundaries between two accessions, as determined by the rough preanalysis of the BAC sequences. Because this was an approximation the cuts are later refine by the curators and Genbank During the annotation process, if a curator discovers that a unit was divided wrongly, and in fact it breaks a gene, they request an extension sufficient to cover the gene Once extended, the sequence is reanalyzed and exported again All 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation extensions were made to the right, to avoid complicated coordinate adjustments. Further adjustments were made by Genbank to ensure that, to the degree it was possible, that genes remained on the same sequence accession It is these divisions of the genome sequence that are then fed into the full annotation automated pipeline This compute can take up to a week to complete for a full chromosome arm Drosophila specific sequences To reannotate a genome to sufficient detail, an extensive set of sequences were necessary for sequence alignments and searches for homologous sequences First, we collected the nucleic acid sequences of all of the Release 2 Drosophila predicted genes, to align them to the new finished genomic sequence for use as a starting point for the Release 3 annotations. Second, we built Drosophila nucleic acid sequence datasets of fulllength cDNA and EST data from three different sources: the Berkeley Drosophila Genome Project (BDGP), public Genbank submissions, and reviewed error reports sent to FlyBase directly and recorded as personal communications. From what was available at the BDGP fulllength cDNA project we took care to include all sequence reads, including those that were not yet assembled, so as to have the most comprehensive and uptodate information possible. We pulled from Genbank all Drosophila entries held in dbEST and the nucleic acid sequences from the INV division excluding our own BDGP submissions and sequences held in other divisions, like genome survey sequences (GSS). The nonBDGP EST sequences were added to the BDGP EST sequences and provided us with a comprehensive EST set. The nucleic acid sequences from the INV dataset contained a redundant mix of complete and partial cDNAs as well as genomic sequences; in future, we plan to isolate the cDNA sequences using feature table information from the genomic sequence submissions (this set was not combined with BDGP fulllength cDNA sequences). The larger FlyBase research community sent complete and partial cDNA sequences and protein sequences to FlyBase as error reports and these were manually collected and placed into datasets for pipeline analysis [ 18]. As a group, these cDNA and EST sequence sets, when aligned to the genomic sequence, were the key to improving the annotations by sensitively revealing the exonintron structure of genes Third, along with the EST and complete cDNA sequences from the BDGP, FlyBase reviewed and collated sequences from the scientific community as Annotated Reference Gene Sequences (ARGS). These manually created sequences integrate information from the literature with every Genbank submission available for a particular gene to offer a goldstandard annotation for a gene and these were utilized wherever possible Fourth, we obtained a curated amino acid set of those Drosophila translations supported by experimental evidence, to find proteins related to paralogs elsewhere in the fly genome. In order to avoid using previous predictions as evidence for the new release, which would be a circular argument for annotation, we limited these to SWISSPROT and SpTrEMBL proteins supported by independent experimental evidence [19]. 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation Fifth, we retrieved nonproteincoding nucleic acid sequences for D Drosophila tRNAs, snRNAs, snoRNAs, and microRNAs from Genbank via FlyBase and used these to manually generate independent datasets for each category [ 20]. The tRNA set was made into a comprehensive set by utilizing coordinates of previously identified tRNAs [ 21]. The genomic analysis of transposable elements is described separately [ 22], but these data provided the sequences that the program RepeatMasker [ 23] used prior to running BLASTX. We also have two other types of sequence information available. One is the STS and BAC end sequences used for physical mapping and the other is the flanking sequences from P element insertion events that are part of the mutagenesis project These sequences were also aligned to the genome by the pipeline, but were not used directly during annotation Other organism sequences To look for crossspecies sequence similarity, we wanted to use the BLASTX program in conjunction with protein datasets that would be current and comprehensive but also as nonredundant and biologically accurate as possible We decided to use the SPTR datasets [24] that supplements the manually annotated SWISSPROT protein dataset [ 25] with SpTrEMBL (computationally annotated proteins) and TrEMBLNew (proteins from the past week that are not yet in SpTrEMBL), but excludes RemTrEMBL, (patent data and synthetic protein sequences) In order to ensure we had the best match from a variety of model organisms, we split SPTR and used the following subdivisions for separate BLASTX analyses: rodents, primates, C elegans, S cerevisiae, plants, other invertebrates, and other vertebrates. We also obtained from Genbank the nucleic acid Mus musculus UniGene set and the insectencoded sequences from dbEST, to look for similarities by TBLASTX that might not be identified by BLASTX searching of proteins. We originally used all of dbEST in our pipeline, but later decided to remove most ESTs in order to lower the compute load As TBLASTX must translate both query and subject sequences, it is highly compute intensive and the other EST alignments added little new information to the overall analysis The task monitoring and scheduling Pipeline Software Infrastructure There are three major infrastructure components of the pipeline: the database, the Perl module (this module is named Pipeline::*), and sufficient computational power including a job management system to allocate this resource. The database is crucial because it maintains a persistence record reflecting the current state all of the tasks that are in progress. Maintaining the system state in a database is a much more robust and resilient approach than simply using a filing system because it offers transactionlocking mechanism to ensure that a series of operations are always fully completed. We used a MySQL [26] database to manage these large number of analyses run against the genome, 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation transcriptome and proteome The Perl modules provide an application programmer interface (API) that is used to launch and monitor jobs, retrieve results and support other interactions with the database As in inexpensive solution to satisfy the computational requirements we built a Beowulf cluster MySQL is an open source “structured query language” (SQL) database that has the advantage of being fast, free and simple to maintain It had several disadvantages compared to other SQL (Structured Query Language) databases, in that it only implements a subset of the SQL standard, and lacks many other special features found in other database systems. An SQL database manages data as a collection of tables. Each table has a fixed set of columns (also called fields) and usually corresponds to a particular concept in the domain being modeled Tables can be crossreferenced by using primary and foreign key fields. The database tables can be queried using the SQL language, which allows the dynamic combination of data from different tables [ 27]. A collection of these tables is called a database schema, and a particular instantiation of that schema with the tables populated is a database There are four basic abstractions that all components of the pipeline system operate upon, these are: a sequence, a job, an analysis, and a batch. A sequence is defined as a string of amino or nucleic acids held either in the database or as an entry in a FASTA file (usually both) A job is an instance of a particular program being run to analyze a particular sequence, for example running BLAST to compare one sequence to a peptide set is considered a single job. Jobs can be chained together. If a job A is dependent on the output of job B then the pipeline software will not launch job A until job B is complete This is the situation, for instance, with programs that require masked sequence as input An analysis is a collection of jobs being analyzed with one program using the same arguments against a set of sequences. Lastly, batch is a collection of analyses a user launches simultaneously. Jobs, analyses and batches all have a state attribute that is used to track their progress through the pipeline (figure 1). In terms of analyses, the state is the same as the state of the slowest job in that analysis, and for batches, the state is the same as the slowest analysis in that batch. The three applications that use the Perl API are the pipe_launcher.pl script, the flyshell interactive command line interpreter, and the Internet browser front end Both pipe_launcher.pl and flyshell provide pipeline users with a powerful variety of ways to launch and monitor jobs, analyses and batches and are useful to both those with a basic understanding of Unix and bioinformatics tools as well as to those with a strong knowledge of objectoriented Perl The web front end is used for monitoring the progress of the jobs in the pipeline. pipe_launcher.pl—is a command line tool that is useful for both programmers and non programmers To launch jobs, users create configuration files that specify input data sources and any number of analyses to be performed on each of these data sources, along with the arguments for each of the analyses. Most of these specifications can be overridden with command line options This allows each user to create a library of configuration files for sending off large batches of jobs that they can alter with command line arguments when necessary. pipe_launcher.pl returns the batch identifier generated by the database to the user. To monitor jobs in progress, the batch identifier can be used 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation in a variety of commands, such as monitor, batch, deletebatch and query_batch. flyshell.pl—provides a more flexible interface to power users who are familiar with object oriented Perl flyshell.pl is an interactive command line Perl interpreter that presents the gadfly and pipeline APIs to the end user. web front end—allows convenient, browserbased access for end users to follow analyses status An HTML form allows users to query the pipeline database by job, analysis or batch identifier, as well as by sequence identifier. The user can drill down through batches and analyses to get to individual jobs and get the status, raw job output and error files of each job. This window on the pipeline has proven to be a useful tool for quickly viewing results Once a job is finished (in the database the job’s state is set to FIN), the raw results are recorded in the database and may be retrieved through the web interface or through either Perl interface. Following this the raw results are parsed, filtered, and stored in the chosen Gadfly database (and the job’s state is set to PROCD). At this point a GAME xml representation of the processed data can be similarly be retrieved through either the Perl or web interfaces Analysis software The pipeline involves numerous computational analyses that generate data as might be expected. What is perhaps less obvious is that there is also a need to screen and filter data and this is equally important to the system. There are two primary reasons for this, one is to increase the efficiency of the pipeline by reducing the amount of data that compute intensive tasks must process, another is to increase the signal to noise ratio by eliminating results that lack content Sim4wrap—Sim4 [28] is a highly useful and largely accurate way of aligning cDNA and EST sequences against the genome Unfortunately it is highly compute expensive, compared with BLASTN. To make the most use of our resources, we split the alignment of Drosophila cDNA and EST sequences into two serial tasks and wrote a utility program (Sim4wrap) for this purpose. Sim4wrap executes a first pass using BLASTN, using our genomic scaffold as the query sequence and the transcript sequences as the subject database We run BLASTN with the "B 0" option, as we are only interested in the summary part of the blast report, not in the high scoring pairs (HSPs) portion where the alignments are shown From this BLAST report summary Sim4wrap parses out the sequences identifiers and filters the original database to produce a temporary Fasta data file that contains only these sequences. Finally we run sim4 again using the genomic sequence as the query and the minimal set of sequences that we have culled as the subject Autopromote—The Drosophila genome is not a blank slate because there are previous annotations from the release 2 genomic sequence. Therefore, before the curation of a chromosome arm begins, we first "autopromote" the release 2 annotations and certain computational analysis results to the status of full annotations This speeds the annotation process by providing a starting point for the curators to work from. 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation The autopromotion software must be able to synthesize different analysis result tiers, some of which may be conflicting. Our autopromotion software is a component within the Gadfly software collection. It works by building a graph of exonlevel intersections between all relevant result features. Different result features are weighted differently, and the intersection graph forms a voting network. In order to resolve conflicts over whether a set of predictions should be multiple split genes or a single merged gene, we only allow one vote per feature, and voters must mutually support each one another This analysis includes an automated check to see if any transcripts from the previous release are no longer present after this process Berkeley Output Parser (BOP) Filtering—All BLAST jobs were run with very non restrictive parameters in order to capture as much information as possible and then results were filtered. The reason for taking this approach is that the genes found on genomic sequence is not uniformly represented in the public databases Genes that are richly covered in the public databases are often immediately adjacent to regions for which there are currently few distant homologies are available. Because it is difficult to normalize Genbank across species but still allow these faint signals to come through we set the number of allowed alignments very high Sim4 was used strictly for alignments to Drosophila sequences and for this reason we wanted to apply stringent measures before accepting an alignment. The available limits to the filters are controlled by parameters passed into the program. For sim4 these include: • Score is the minimum percent identity that is required to retain an HSP or alignment. The default value is 95% • Coverage is a percentage of the total length of the sequence that is aligned to the genomic. Any alignments that are less than this percentage length are eliminated • Length is an absolute minimum length in base pairs required to accept a span regardless of percent identity or percent length • Join 5’ and 3’, is a Boolean operation and is used for EST data. If it is true BOP will do two things. First if will reverse the orientation of any hits where the name of the sequence contains the phrase 3prime. Second, it will merge all alignments where the prefixes of the name are the same. Originally this was used solely for the 5’ and 3’ ESTs that were available, however when we introduced the internal sequencing reads from the cDNA project into the pipeline this portion of code became an alternate means of effectively assemble the cDNA sequence. Using the intersection of each individual sequence alignment on the genomic a single virtual cDNA sequence was constructed and this alignment was provided for annotation • Reverse 3’, is another Boolean parameter used solely for EST data Those sequences analyzed where the name ends in the suffix that is provided as the parameter argument will be reverse complemented • Discontinuity sets a maximum gap length in the aligned EST or cDNA sequence The primary aim of this parameter is to eliminate chimeric clones 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation • Remove polyA tail is a Boolean to indicate that short terminal HSPs consisting primarily of runs of a single base are to be removed For BLAST these filtering options are available: • Remove low complexity, this is specified as a repeat word size (# of consecutive bases or amino acids) and the second is a threshold The alignment is compressed using Huffman encoding to a bit length and any hit where all HSP spans have a score lower than this value is discarded. Larger word sizes should have larger thresholds. • Minimum expectation, offers a simple cutoff for HSP Any HSP with an expectation greater than this value is deleted. The default is generous, it is 1.0 • Maximum depth specifies the maximal number of matches that are allowed in a given genomic region. The default is 10 overlapping alignments. This parameter applies to both BLAST and sim4. The aim is to avoid excess reporting of matches in regions that are highly represented in the aligned data set (e.g. from a non normalized EST library) In addition there is a standard filter for BLAST that eliminates ‘shadow’ matches. These are weak alignments to the same sequence in the same location on the reverse strand of the genomic. BLAST matches are also reorganized if necessary to assure that the HSPs are in sequential order along the length of the sequence. For example, a duplicated gene may appear in a BLAST report as a single alignment, each of which indicate dual HSPs to two different regions on the genomic, but to the same single portion of the sequence In these cases the alignment is split into separate alignments to the genomic and each of which have the aligned sequence present just a single time in the HSPs In our pipeline the processing of Drosophila EST and cDNA sequences required a minimum percent identity of 95% and a percent length of 80% to retain a match. In addition, the 5’ and 3’ ESTs from a single cDNA clone were joined, polyA tails removed, and the depth at a single genomic location was limited to 300 matches We filtered BLASTX alignments using a minimum expect value of 1.0e4, removed repetitive HSPs, removed ‘shadows’ and kept the depth to no more than 50 matches in the same genomic location. BOP EST grouping—Another tactic for condensing primary results, but without removing any information, is to reconstruct all logically possible alternate transcripts from the raw EST alignments. This additional piece of code was added to BOP as well First the set of EST that overlap are collected, from these a tree(s) is built where each node is comprised of the set of spans from these ESTs that share splice junctions. The possible transcripts are the number of paths through this tree(s). This analysis produced an additional set of alignments augmenting the original EST alignments External pipelines We believe that it is imperative for any annotation to utilize every possible bit of useful information. Thanks to the generosity of three external groups we were able to have results from the Celera, NCBI, and Ensembl pipelines incorporated into our database for 10 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation 3 of the 5 chromosome arms (2L, 2R, and 3R) and present this to the curators in addition to the results from our internal pipeline Hardware To carry out the genomic analyses we first built a Beowulf cluster. A Beowulf cluster is a collection of compute nodes that are interconnected with a network. The sole purpose of these nodes and the network is providing compute cycles. The nodes themselves are inexpensive, offtheshelf processor chips, connected using standard networking technology, and running open source software When these components are put together a low cost, but high performance, compute system is available. Our nodes are all identical and use Linux as their base operating system, as is usual for Beowulf clusters. One consequence of building a system out of stock materials is that there are inevitable modifications and this means that it is mandatory to utilize an open source operating system and development environment in order to have access to their source code for recompilation. The job control software we utilize is the Portable Batch System (PBS) developed by NASA [29] {THE FOLLOWING JUST CAME IN FROM ERWIN: UNEDITED} The compute jobs were done on a Beowulf style Linux cluster used as a compute farm The cluster was built by Linux Networx (http://www.linuxnetworx.com). Linux Networx provided additional hardware (ICE box) and Clusterworx software to install the system software and control and monitor the hardware of the nodes. The cluster configuration used in this work consisted of 32 standard IA32 architecture nodes each with dual Pentium III CPUs running at 700MHz/1GHz and 512MB memory In addition, one single redundant Pentium III based master node was used to control the cluster nodes and distribute the compute jobs. Nodes were interconnected with standard 100BT Ethernet on a isolated subnet with the master node as the only interface to the outside network. The private cluster 100BT network was connected to the NAS based storage volumes housing the data and user home directories with Gigabit ethernet. Each node had a 2GB swap partition used to cache the sequence databases from the network storage volumes To provide a consistent environment, the nodes had the same mounting points of the directories as all other BDGP Unix computers. The network wide NIS maps were translated to the internal cluster NIS maps with an automated script Local hard disks on the nodes were used as temporary storage for the pipeline jobs. Job distribution to the cluster nodes was done with the queuing system OpenPBS, version 2.3.12 (http://www.openpbs.org). PBS was configured with several queues and each queue having access to a dynamically resizable overlapping fraction of nodes Queues were configured to use one node at a time either running one job using both CPUs (such as the multithreaded BLAST or Interpro motif analyis) or two jobs using one CPU each for optimal utilization of the resources. Due to the architecture of the pipeline [should be described somewhere else], individual jobs were often small but 10000s of them submitted at any given time However, the default PBS FIFO scheduler, while providing a lot of flexibility, does not scale up beyond about 500010000 jobs in any given queue. Thus, the FIFO scheduler was extended to cache the jobs in a queue in memory if a certain number of jobs was exceeded in that queue. Job resource allocation 11 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation was managed on a per queue basis. Individual jobs could only request cluster resources based on the queue they were submitted to and each queue was run on a strict fifo basis With those modifications PBS scaled to 100000 jobs and beyond while still permitting jobs running with higher priority when submitted to a separate higher priority queue. {END OF ERWIN’S TEXT} Gadfly {CHRIS IS TO REWORK ALL OF THE GADFLY SECTION} We designed and constructed a generic annotation database called Gadfly (Genome Annotation Database of the Fly) and used it for storage of processed pipeline analysis results and curated gene models. Like the pipeline system, Gadfly is a combination of a MySQL database and a collection of Perl modules and libraries for accessing the database In this respect, the architecture is similar to the Ensembl database In the following section we will first describe the MySQL database, followed by a description of the Perl modules Gadfly is designed to handle of a wide variety of biological features that can be associated with a nucleic or amino acid sequence. An example of some of the features would be: transcripts, exons, untranslated regions (UTR), tRNAs, promoters, and transmembrane regions. There are three core tables in Gadfly to accomplish this, seq, seq_feature, and sf_property (Figure 2). A sequence is simply a string of contiguous residues representing all or part of a DNA, RNA or peptide molecule In fact, a sequence may be an artificial entity, such as a genomic assembly, that is not physically stored in the database. Rather than creating a multitude of tables to represent all possible biological features, we use a single table called "seq_feature" to represent the set of all entities that can be localized onto a biological sequence. This table contains the core information applicable to all biological sequence features: a unique identifier, an indication of the feature type (exon, transcript, etc), sequence coordinates of the boundaries of the feature, and an identifier indicating the sequence that the feature is localized on. All coordinates are relative to this sequence. A sequence feature itself defines a sequence (because any sub sequence is a sequence) and it may be the antecedent of other sequences. For instance, a transcript feature can be localized with respect to a genomic clone sequence, and also have an mRNA/cDNA sequence of its own. Different kinds of features require different fields. For instance, tRNA features require fields indicating the anticodon and the amino acid transferred. How do we reconcile this with the onetablemanyfeatures approach? We use a technique called generic modeling. We add a table to our schema that models properties. For our seq_feature table, we have a corresponding table called sf_property. This table is crossreferenced to the seq_feature table, and contains fields for property name/key and property value. Using this table we can attach any number of arbitrary properties to our sequence feature There are disadvantages associated with this approach By not directly modeling the properties directly in the SQL schema you lose some of the advantages of 12 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation the SQL model, such as strong typing and ease of querying However, the main advantage is extensibility This means it is very easy to add new properties, as the schema does not need modification. Extensibility is important in biological modeling, as our knowledge of the data is constantly changing We also need to represent relationships between features. For instance, a gene structure can be viewed as a feature containing transcript features, which themselves contain exon features. Protein coding transcript features produce translation features. This model of what a gene consists of is illustrated in the following diagram: [diagram: gadflygrouptalk/compgraph.png] This model is not intended to be allinclusive; for instance, there is no mention of any promoters Some features are left implicit for instance although you can query the Gadfly website for UTR features and intron features, we do not store these explicitly in the database, rather they are generated dynamically from the other features The above model also breaks down in the face of more complex biology. For example, a dicistronic messenger RNA produces two nonoverlapping translations. Each of those translations is considered a separate gene. There are plenty of cases of this happening in the Drosophila genome (see annotation paper). The corresponding graph model looks more complex: [diagram: gadflygrouptalk/compgraph2.png] There are other cases where unusual biology conspires to break carefully designed data models, for example transplicing such as that exhibited by the mod(mdg4) gene in Drosophila (see paper?). This is a good reason for keeping the database schema flexible Our schema allows for arbitrary relationships between any two "seq_feature" entities We this using the "sf_produces_sf" table (the name is historical and somewhat misleading, it can actually represent a wide variety of relationships) The table is extensible to potential future uses, for instance capturing relationships between transcription factors and binding site features ###[diagram: gadflygrouptalk/sf6.png] Other table There are other tables in the schema (49 in all); discussion of these is beyond the scope of this article. These build upon the tables described above, and allow for the modeling of computational analyses and attaching more detailed information to seq_features. The Gadfly schema also subsumes the GO database schema, which was also designed at BDGP The scope of the data currently stored in Gadfly includes: 13 • Annotated gene models, including sequence and positional information and detailed structured curator comments • Evidence for the gene models, including: Alignments of various sequences to the genome, including ESTs, cDNAs, protein sequences (both Drosophila and other species), transcript sequences from the previous release, community sequences 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation The alignments are stored at a high level of detail, preserving the exact baseto base correspondences • Results of the gene prediction programs Genie [ 30], Genscan [31], and tRNAscan SE [32] • Results of promoter prediction analysis • Gene Ontology assignments for the gene models • Results of peptide analysis of the gene models All these are stored in normalized tables, making it possible to do all kinds of queries on these data. Example queries can be found on http://www.fruitfly.org/developers As well as for using Gadfly to house the release 3 annotated sequence data, we are also using separate instantiations of Gadfly to store comparative data on different Drosophila species (see paper ) and heterochromatin data Data flow with pipeline and curators After the pipeline system completes a batch and the data moves into Gadfly, we export the results and autopromoted annotations to individual scaffold files, in GAME XML format These files are collected by the curators, and annotated using Apollo On completion, the curator saves the file to a directory when it is then loaded into the database. One scaffold typically goes through several passes between Apollo and the database before it is complete (nota bene: We originally experimented with nonfile based data transfer between Apollo and the database; we developed a prototype CORBA client and server. In the end we decided a file based system had the advantage of transparency and allowing curators to work offline.) Each annotation cycle on a sequence may affect the consequent proteins As the generation of a highquality peptide set is one of our primary goals we needed to incorporate into this loop a means of evaluating these peptides and providing this assessment to the curators for inspection during the next annotation session on that sequence Whereas the genomic pipeline is launched at distinct stages, on an armby arm basis, the peptide pipeline is more fluid When a peptide sequence is modified (triggered by a curator altering a gene model and saving it to the Gadfly database) then that sequence is reinserted into the peptide pipeline. In this way we make sure that we do not waste compute cycles reanalyzing sequences that have already been processed. To accomplish this the database uniquely identifies every sequence by its name and its MD5 checksum [33] The MD5 checksum provides a fast and convenient way of determining if two sequences are identical. The RSAMD5 algorithm uses as input a message of arbitrary length (in this case a string of letters representing a biological sequence) and calculates as output a unique 128bit (16 octet) checksum. RSAMD5 is believed to be collisionproof. This means that any two different sequences should never share the same checksum. To determine whether a peptide sequence has been altered is a simple comparison of the prior checksum to a more recent checksum. {END OF CHRIS’ GADFLY SECTION} 14 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation This pipeline provides general evaluations, by using BLASTP analysis to compare the peptides to peptides from other model organism genome sequences and also analyzing the peptides for protein family motifs using InterproScan [ 34]. However, the most crucial aspect of the pipeline is the validation of the annotation peptides performed by our program, PEPQC This integrity check takes advantage of the carefully reviewed databases of published peptide or CDS sequences described in Misra et al. PEP-QC generates both summary status codes and detailed alignment information for each gene and each peptide. ClustalW (version 1.8 [35]) and showalign [36] are used to generate a multiple alignment from the annotated peptides for the gene and the corresponding SPTRREAL peptide or peptides. In addition, brief “discrepancy” reports are generated summarizing each SPTRREAL mismatch For instance, an annotated peptide might contain any or all of following mismatches: discrepancy ^1 /description="N-terminal insertion: Q960X8 contains an additional stretch of 88 AA" /position="^M1 (CG2903-PB)" /position="M1 E88 (Q960X8)" /swissprot_date="16-OCT-2001 (Rel 40, Last sequence update)" discrepancy 163 /description="Internal substitution of AA" /position="M163 (CG2903-PB)" /position="K163 (Q960X8)" /swissprot_date="16-OCT-2001 (Rel 40, Last sequence update)" discrepancy 1533 1537 /description="C-terminal substitution of AA with 10 AA" /position="G1533 D1537 (CG2903-PB)" /position="P1580 Q1589 (Q960X8)" /swissprot_date="16-OCT-2001 (Rel 40, Last sequence update)" In the above examples, CG2903PB is the BDGP annotation, and Q960X8 is the SPTRREAL entry. The ClustalW alignments and discrepancy reports are presented to the curator via the minigene reports, described below. A web report, grouped by genomic segment and annotator, is updated nightly and contains lists of genes indexed by status code and linked to their individual minigene reports Curators needed to see the validation report and other available information associated with a gene in order to refine an annotation efficiently. We developed automatically generated "minigene reports" to consolidate all relevant data about each gene into a single Web page. Minigene reports include all the names and synonyms associated with a gene; its cytological location; accessions for the genomic sequence, ESTs, PIR records, 15 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation and Drosophila Gene Collection (DGC) assignments, if any All of this information is hyperlinked to the appropriate databases (e.g. Genbank) for easy access to that information. All literature references for the gene appear in the reports, with hyperlinks to the complete text. The minigene reports also consolidate any comments about the gene, including amendments to the gene annotation submitted by FlyBase curators or members of the Drosophila community. The minigene reports can be accessed directly from Apollo, or searched via a Web form by gene name, symbol, synonym (including FBgn), or genomic location (by scaffold). Other integrity checks Prior to submission to Genbank a number of additional checks are run to detect potential oversights in the annotation. These checks include confirming the validity of any annotations with unusually short ORFs (less than 50 amino acids) or ORFs that are less than 25% of the transcript length. In the special case of known small genes, such as the short Drosophila Immune Response Genes (DIRGS), the genome is scanned to ensure they were accounted for. Similarly, the genome is scanned for particular annotations to verify their presence, including those that have been submitted as corrections from the community, or are cited in the literature, or are a euchromatic tRNA, snRNA, snoRNA, microRNA, or rRNA documented in FlyBase. If the translation start site is absent then an explanation must be provided in the comments. Annotations may also be eliminated if annotations with different identifiers are found at the same genome coordinates or if a proteincoding gene overlaps a transposable element, or a tRNA that overlaps a protein coding gene. Conversely duplicated gene identifiers that are found at different genome coordinates are either renamed or removed. A simple syntax check is also carried out on all the annotation symbols and identifiers. Known mutations in the sequenced strain are documented and the wild type peptide is submitted in place of the mutated version from the genomic strain. As described previously the cDNA sequences were aligned to genome using the program sim4, processed using BOP and stored in Gadfly After annotation was completed these results are used to find the intersection of cDNA alignments and exons A cDNA is assigned to a gene when the cDNA overlaps most of the gene exons. These independently predicted peptide of the cDNA and the gene annotation are then compared just as the SWISSPROT peptides were to detect and resolve discrepancies DISCUSSION The main software engineering lesson we learned in the course of this project is the importance of flexibility Nowhere was this more important than in the database schema In any genome, unusual biology conspires to break carefully designed data models Among the examples we encountered in annotating the Drosophila genome were: (1) the occurrence of distinct transcripts with overlapping UTRs modified our original definition of “alternate transcript”; (2) the existence of dicistronic genes required support for one to many relationships between transcript and peptides; and (3) one case of transsplicing, exhibited by the mod(mdg4) gene [ 37], required a new data model. We also needed to be able to adapt the pipeline to different types and qualities of 16 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation input sequence. For example, in order to analyze the draft sequence of the repeatrich heterochromatin [38], we needed to not only adjust the parameters and data sets used, but also develop an entirely new repeat masking approach to facilitate gene finding in highly repetitive regions. We are now in the process of modifying the pipeline to exploit comparative genome sequences more efficiently. Our intention is to continue extending the system to accommodate new biological research situations Improvements to tools and techniques are often as fundamental to scientific progress as new discoveries, and thus the sharing of research tools is as essential as sharing the discoveries themselves We are active participants in the Generic Model Organism Database (GMOD) project, which seeks to bring together open source software applications and utilities that are useful to the developers of biological and genomic databases For example, we use the Perl based software, gbrowse [ 39] from GMOD for the visual display of the annotations Automated pipelines and the management of downstream data require a significant investment in software engineering. To enable other groups to minimize this cost we are contributing the software we have developed during this project to GMOD. The pipeline software, the database, and the annotation tool Apollo, as a group, provide a core set of utilities to any genome effort that shares our annotation strategy. It remains to be seen how portable they will prove to be because there is often a tradeoff between customization and easeofuse We are aware that making a system easy to configure is often as difficult as building the original system and we will only know the extent to which we were successful when other groups try to reuse and extend these software tools. 17 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation FIGURES Figure 1. The allowed values for the status attribute are READY, RUN, FIN, PROCD, UNPRC, and FAIL. With respect to jobs, READY means the jobs are ready to be sent to the pipeline queue, RUN means the jobs are one the queue or being run, FIN means the jobs have run but have not yet been processed by BOP to extract the results from the raw data, UNPROC generally means there was an error in the processing step and FAIL means there was an error in job execution 18 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation Figure 2. These 3 tables for sequences, sequence features, and descriptive properties of these features, form the core of the Gadfly SQL schema. Note that this model allows for recursivelynested locations of sequencefeatures on sequences. 19 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation Figure 3 20 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation Figure 4 21 10/19/22 An integrated computational pipeline and database supporting whole genome sequence annotation ACKNOWLEDGEMENTS This work was supported by NIH grant HG00750 to G.M Rubin, by NIH Grant HG00739 to FlyBase (P.I. W.M. Gelbart), and by the Howard Hughes Medical Institute. We are grateful and wish to fully thank our external contributors for finding the time and resources to provide additional computation pipeline results for us to consider: Karl Sirotkin at the NCBI, Mark Yandell then at Celera Genomics and now with the BDGP, and Emmanuel Mongin of the Ensembl group 22 10/19/22 XGI [http://www.ncgr.org/xgi/] Comparative Genomic Analysis Tools (CGAT) [http://inertia.bs.jhmi.edu/CGAT/CGAT.html] Automated DNA Annotating and Parsing Tool (ADAPT) [http://wwwsequence.stanford.edu/~curtis/adapt.html] Ensembl Analysis Pipeline [http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/NewAnalysisPipeline.html] NCBI Annotation Process [http://www.ncbi.nlm.nih.gov/genome/guide/build.html#annot] Kerlavage A, Bonazzi V, di Tommaso M, Lawrence C, Li P, Mayberry F, Mural R, Nodell M, Yandell M, Zhang J, Thomas P: The Celera Discovery System. Nucleic Acids Res. 2002, 30:129136 Kent JW, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The Human Genome Browser at UCSC. Genome Research 2002 12 Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L: The Distributed Annotation System. BMC Bioinformatics 2001, 2:7 Durbin R and ThierryMieg J: A C. elegans Database. Documentation, code and data available from anonymous FTP servers at lirmm.lirmm.fr, cele.mrclmb.cam.ac.uk and ncbi.nlm.nih.gov 1991. 10 FlyBase Consortium: The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Research 2002, 30:106108 11 Lewis SE, Searle SMJ, Harris N, Gibson M, Iyer V, Richter J, Wiel C, Bayraktaroglu L, Birney E, Crosby MA, Matthews B, Rubin GM, Misra S, Mungall CJ, Clamp ME: Apollo: A Sequence Annotation Editor. Genome Biology 2002, (this issue) 12 Stapleton et al.: Genome Biology 2002 (this issue) 13 Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell K, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, Smith CD, Tupy JL, Whitfield EJ, Bayraktaroglu L, Berman BP, Celniker SE, de Grey ADNJ, Drysdale RA, Harris NL, Richter J, Russo S, Shu S, Stapleton M, Yamada C, Ashburner M, Gelbart WM, Rubin GM, Lewis SE: Reannotation of the Drosophila Euchromatic Genome. Genome Biology 2002 (this issue) 14 Celniker S, et al.: Genome Biology (this issue) 15 Benson DA, Boguski MS, Lipman DJ, Ostell J, and Ouellette BF: GenBank Nucleic Acids Res 1998, 26:17 16 Stoesser G, Sterk P, Tuli MA, Stoehr PJ, Cameron GN: The EMBL Nucleotide Sequence Database. Nucleic Acids Research 1997, 25:714 17 Tateno Y, Imanishi T, Miyazaki S, FukamiKobayashi K, S aitou N, Sugawara H, Gojobori T: DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res 2002, 30:2730 18 Millburn G, Kaminker J, and Smith CD: personal communication 19 Eleanor Whitfield: personal communication 20 Huang Y: personal communication 21 Mount SM, Salz HK: Premessenger RNA processing factors in the Drosophila genome. J Cell Biol. 2000, 150:F3744 22 Kaminker J, et al.: Genome Biology 2002, this issue 23 Smit AFA, Green P: RepeatMasker [http://ftp.genome.washington.edu/RM/RepeatMasker.html] 24 SPTR A comprehensive, nonredundant and uptodate view of the protein sequence world [http://wserv1.dl.ac.uk/CCP/CCP11/newsletter/vol2_3/sptr.html] 25 Bairoch A, Apweiler R: The SWISSPROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research 2000, 28:4548 26 MySQL [http://www.mysql.com/] 27 Date CJ: An Introduction to Database Systems. AddisonWesley 1983 28 Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Research 1998, 8:967974 29 NASA [http://parallel.nas.nasa.gov/] Reese MG, Kulp D, Tammana H, Haussler D: Geniegene finding in Drosophila. Genome Research 2000, 10:529538 30 31 Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol 1997, 268:7894 32 Lowe T, Eddy SR: tRNAscanSE: a Program For Improved Detection of Transfer RNA genes in Genomic Sequence. Nucleic Acids Research 1997, 25:955964 33 Preneel B: Analysis and Design of Cryptographic Hash Functions. Ph.D. Thesis, Katholieke University Leuven 1993 34 Zdobnov EM, Apweiler R: InterProScanan integration platform for the signaturerecognition methods in InterPro. Bioinformatics. 2001, 17:847848 35 Higgins D, Thompson J, Gibson T, Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22:46734680 36 EMBOSS: showalign [http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/showalign.html] 37 Mongelard F, Labrador M, Baxter EM, Gerasimova TI, Corces VG: Transsplicing as a novel mechanism to explain interallelic complementation in Drosophila. Genetics 2002, 160:14811487 38 Hoskins R et al.: Genome Biology 2002, (this issue) 39 Stein LD, Mungall CJ, Shu SQ, Caudy M, Mangone M, Day A, Nickerson E, Stajich J, Harris TW, Arva A, Lewis S: The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Research 2002 (in press) ... An? ?integrated? ?computational? ?pipeline? ?and? ?database? ?supporting? ?whole? ?genome? ?sequence? ?annotation Figure 4 21 10/19/22 An? ?integrated? ?computational? ?pipeline? ?and? ?database? ?supporting? ?whole? ?genome? ?sequence? ?annotation ACKNOWLEDGEMENTS... recursivelynested locations of? ?sequence? ?features on sequences. 19 10/19/22 An? ?integrated? ?computational? ?pipeline? ?and? ?database? ?supporting? ?whole? ?genome? ?sequence? ?annotation Figure 3 20 10/19/22 An? ?integrated? ?computational? ?pipeline? ?and? ?database? ?supporting? ?whole? ?genome? ?sequence? ?annotation. . .An? ?integrated? ?computational? ?pipeline? ?and? ?database? ?supporting? ?whole? ?genome? ?sequence? ?annotation ABSTRACT Background Any largescale? ?genome? ?annotation? ?project requires a? ?computational? ?pipeline? ?that can

Định dạng
Số trang	24
Dung lượng	192 KB