Evaluation and integration of functional annotation pipelines for newly sequenced organisms: The potato genome as a test case

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	14
Dung lượng	2,05 MB

Nội dung

For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, ‘omics’, and literature data. However, researchers encounter little guidance on how well they perform.

Amar et al BMC Plant Biology 2014, 14:329 http://www.biomedcentral.com/1471-2229/14/329 RESEARCH ARTICLE Open Access Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case David Amar1, Itziar Frades2, Agnieszka Danek3, Tatyana Goldberg4, Sanjeev K Sharma5, Pete E Hedley5, Estelle Proux-Wera2,6, Erik Andreasson2, Ron Shamir1, Oren Tzfadia7* and Erik Alexandersson2 Abstract Background: For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists Several annotation pipelines have been developed for functional analysis based on sequence, ‘omics’, and literature data However, researchers encounter little guidance on how well they perform Here, we used the recently sequenced potato genome as a case study The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available Results: We show that the automatic gene annotations of potato have low accuracy when compared to a “gold standard” based on experimentally validated potato genes Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average) To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard Conclusions: We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms This will greatly aid future functional analysis of ‘-omics’ datasets from potato and other organisms with newly sequenced genomes The new potato annotations are available with this paper Keywords: Functional annotation, Gene ontology, Gene co-expression, Potato, Genomics Background Potato (Solanum tuberosum) is the 3rd largest food crop in terms of human consumption [1] It is therefore important for our food security, and understanding its genome is called for Examples of major challenges in potato research are its sensitivity to drought stress and its lack of resistance to certain diseases, e.g., the oomycete Phytopthora infestans, which caused the Irish famine in the 1840’s Farmers need to use large amounts of fungicides to protect their potato * Correspondence: oren.tzfadia@weizmann.ac.il Department of Plant Science, The Weizmann Institute of Science, Rehovot, Israel Full list of author information is available at the end of the article crops, thereby increasing the cost of cultivation and threatening the environment For example, the global cost of protection and yield loss due to P infestans has been estimated at €4800 M annually [2] Recently, the potato genome (Solanum tuberosum group Phureja) was sequenced by the Potato Genome Sequencing Consortium (PGSC) The PGSC analysis of the genome reported gene models for 39,031 representative transcripts, and 56,218 including splicing variants [3] In a later effort, the International Tomato Annotation Group (ITAG) produced new gene models by jointly analyzing the tomato and potato genomes [4] These new gene models covered 34,727 and 35,004 predicted protein-coding genes © 2014 Amar et al.; licensee BioMed Central Ltd This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated Amar et al BMC Plant Biology 2014, 14:329 http://www.biomedcentral.com/1471-2229/14/329 for the tomato and the potato genomes, respectively Unfortunately, few experimentally validated genes (e.g., by fluorescent-tagged proteins, or gene knock-outs) are available in newly sequenced genomes in which, unlike established model organisms, few genes have verified functions such as the case is for potato Comprehensive and accurate functional annotation of the genes in such recently sequenced genomes is a prerequisite to efficient exploitation of these genomic data A key tool for functional annotation is the Gene Ontology (GO), which provides a structured set of defined terms representing gene properties [5] The structure of gene ontology is composed of three major domains: cellular component (CC), the parts of a cell or its extracellular environment; molecular function (MF), the elemental activities of a gene product at the molecular level; and biological process (BP), which describes a set of functionally related molecular events Thus, the complete GO structure provides a unified vocabulary of biological terms, which can also be used to evaluate biological similarity of different terms [6] Annotating a gene means placing it within some or all of the three gene ontology domains Recent advances in plant science are marked by the rapidly increasing availability and quality of highthroughput sequencing data The most basic usage of these data is gene function prediction, wherein GO plays a pivotal part There are several computational suites like EXPANDER [7], MapMan [8], Mercator [9] and AmiGO [10] that enable biologists to run GO enrichment analyses in several plant model systems This is usually done by first identifying a group of genes that behave similarly in a given expression dataset, seeking ontology terms highly enriched in the group, and associating the highly enriched functions with unannotated genes that belong to the same group This process is sometimes called “guilt by association” Automated gene function annotation is also relevant for well-investigated plant model organisms, such as Arabidopsis thaliana, tomato, Brachypodium and rice, wherein ~40% of the genes still not have any known function [11] In order to assign functional annotation to sequenced plant transcripts, researchers can use several sequencebased annotation pipelines For a comprehensive summary of methods and principles behind automated functional annotation see [12] Some recent efforts have been made to characterize the annotation quality of plant genomes For example, Jaramillo-Garzón, et al [13] used sequence features and showed high predictability of MF and CC terms and lower predictability of BP terms However, the analysis was limited to a small subset of the GO terms (GO-Slim) Ramsak, et al [8] presented GOMapMan, a tool for visualization and analysis of gene annotation in plants In potato, information from orthologous gene Page of 14 families across 26 sequenced plant genomes was analyzed in order to increase the number of potato genes associated with GO terms [14] Still, a robust, automated approach to evaluate and compare genome-wide annotation pipelines is direly needed A typical genome-wide functional annotation of newly sequenced organisms starts by using a single ‘default’ pipeline Here, we analyzed the two sets of potato gene models, from the ITAG and PGSC We compared six annotation pipelines: Trinotate HMM, Trinotate BLAST [15], OrthoMCL-UniProt [16], BLAST2GO [17], Phytozome [18] and InterPro2GO provided in BioMart [19] (Figure 1) These pipelines were chosen because they seek to provide a comprehensive annotation of the whole genome Some of these pipelines are based solely on sequence similarity (BLAST), others rely on specific domains and some are based on clustering of groups of orthologous gene families As we shall show, one clear conclusion of this work is that functional annotations of genomes should rely on more than one annotation pipeline By examining the GO terms generated by these pipelines, we demonstrate that they predict very dissimilar annotations (e.g., on average, less than 30% of the genes annotated by two pipelines are assigned with the same function) To evaluate the performance of the pipelines we first created a set of potato genes (hereafter referred to as “gold standard”), with known functional characterization, including genes from the well characterized biosynthetic Carotenoids pathway We show that pipelines may have rather low accuracy compared to the gold standard Since the size of the gold standard is rather modest (116 PGSC genes ids), we used an additional validation scheme based on gene expression data Under the premise that genes participating in the same biological process should have more similar expression pattern than expected by chance, we evaluated the predictions of each pipeline based on its intra-process gene co-expression level We show that while all pipelines provide much higher intra-process co-expression than expected by chance, there are large differences among the methods We introduce a simple method to combine the results of the different pipelines into a single integrated annotation Compared to the single pipelines, it improved gene coverage, prediction precision, and the overall coexpression of predicted GO processes In addition to improved annotation of potato genes, our analysis provides generic tools that can be applied to improve the annotation of other newly sequenced plants Results and discussion A compendium of the state-of-the-art annotation tools In this study, we tested automatic annotation pipelines on the potato genome We used six state-of-the-art tools for GO gene function prediction: (1) Trinotate HMM, Amar et al BMC Plant Biology 2014, 14:329 http://www.biomedcentral.com/1471-2229/14/329 Page of 14 Figure Overview of pipeline comparison, validation of accuracy and integration processes (A) The PGSC and ITAG gene models were used as input for the six pipelines assessed (B) The annotation from each pipeline was transformed into gene ID – GO term associations (C) Annotations were compared by the number of annotated gene models, the number of GO terms associated per gene model, and GO similarity (D) The quality and comprehensiveness of the annotation of each pipeline were calculated by comparing their predictions to experimentally validated annotation (gold standard) In addition, gene co-expression data were used to test if genes predicted to share the same GO processes are significantly co-expressed (E) An integrated annotation using the ensemble of results of all pipelines was created and validated using the same criteria in D Results of the ensemble annotations were compared to those of the individual pipelines (2) Trinotate BLAST [15], (3) OrthoMCL-UniProt [16], (4) BLAST2GO [17], (5) Phytozome [18], and (6) InterPro2GO [19] See Methods and Additional file 1: Methods S1-4 for details We note that every program has its own set of parameters and fitting the best parameter combination for a particular dataset is a substantial effort The common practice in this area is to use published tools with the default parameter values (see e.g [20,21] If necessary, we then mapped its predicted functions to GO terms using automated mapping files such as Pfam2GO, and the genes and transcripts to protein identifiers Thus, in our analysis a gene corresponds to either a transcript or a protein that appeared in the output of the pipelines Next, the output of each pipeline was summarized as a set of predicted gene-GO term pairs For each gene we then retained only the most “specific” GO terms That is, in case a gene is associated with two GO terms A and B, but B is a generalization of A (i.e an ancestor of A in the GO hierarchy), we excluded B We call this step ancestor removal Note that after filtering, many genes were still associated with more than one GO term, since a gene can have several associated annotations none of which is an ancestor of another For the output of all pipelines, see Additional file 2: Table S1, Additional file 3: Table S2, Additional file 4: Table S3, Additional file 5: Table S4, Additional file 6: Table S5 and Additional file 7: Table S6 for PGSC, and Additional file 8: Table S7, Additional file 9: Table S8, Additional file 10: Table S9, Additional file 11: Table S10, Additional file 12: Table S11 and Additional file 13: Table S12 for ITAG Although Gene Ontology has its limitations as it is biased towards what is already known, it is still a universal key tool for functional annotation inferring functionality based on sequence identity, domains and structure, and literature studies Amar et al BMC Plant Biology 2014, 14:329 http://www.biomedcentral.com/1471-2229/14/329 Disparity among pipelines The output from each pipeline can be represented as a triplet (P, G, GO) where P is the set of all predicted gene-GO term pairs (after ancestor removal), G is the set of genes covered by P, and GO is the set of GO terms covered by P We measured the pairwise similarity between the triplets obtained from the six pipelines used in the study Three different ways were used to compare the output of two pipelines A = (PA , GA , GOA) and B = (PB, GB, GOB) First, we measured the overlap between the predictions of the pipelines PA and PB This was done by calculating the ratio between the size of the intersection of PA and PB and the size of the union of PA and PB This measure is called the Jaccard score [22,23] Second, we measured the similarity between the covered gene sets GA and GB of the pipelines by calculating their Jaccard scores These two scores are complementary: the first measures the overall similarity between A and B, whereas the second measures the tendency of A and B to cover the same genes However, these scores ignore the GO structure and thus they are oblivious to the functional similarity among different GO terms Therefore, we also used a similarity score based on the semantic similarity of GO terms [24] Given a specific GO type GT (BP or MF), for each gene we measured the semantic similarity between its GO terms in A and its GO terms in B We then took the average over all genes as the similarity of A and B in GT (see Methods for details) As this score uses the structure of the GO hierarchy, we call it structure-based An example of the structure-free similarity of the predictions is shown in Figure 2A The figure shows the pairwise Jaccard score between the PGSC MF predictions of the pipelines Overall the similarity is low, averaging 0.27 Nevertheless, local patterns can be observed For example, InterPro2GO, Trinotate HMM, and Phytozome were more similar (average 0.46) Figure 2B shows the Jaccard similarity between the PGSC genes annotated by the different pipelines The mean similarity was a higher 0.54, which is still quite low This indicates that different pipelines tend to cover different genes and, even when covering the same genes, they often associate distinct annotations to them Even when re-computing the structure-free similarity restricted only for the genes shared by each pair of pipelines (considering both MF and BP predictions), the average score was only 0.27 The structure-based MF and BP similarity of PGSC genes is summarized in Figure 2C and 2D Similar matrices on ITAG data are shown in Additional file 1: Figure S1 Again, pipelines tend to be very different, with average similarity of 0.29 in BP and 0.42 in MF The scores are higher than for the structure-free approach because the structure-based approach assigns higher scores when predictions are different but biologically similar Also, like in Page of 14 the structure-free scores in Figure 2A, InterPro2GO, Trinotate HMM, and Phytozome formed a cluster both in BP and in MF Taken together, the discrepancies among pipelines show that pipelines differ in the sets of genes they cover, and the annotation of the same genes in different pipelines can be quite dissimilar Ensemble of pipelines The marked disparity in gene annotation by different pipelines calls for an integration of the different predictions in order to provide a unified potato gene annotation We developed a simple ensemble algorithm inspired by previous studies [25] Our algorithm takes as input the predictions of all pipelines and for each gene merges its predictions into a vector of scores denoted as the gene’s combined profile (Figure 3) Briefly, we first calculate the pipeline-specific gene profiles For a specific pipeline that predicted the pair (G, t), where G is a gene and t is a GO term, the t-th position of the profile is if G is associated with t or at least one of its descendants, and otherwise it is (top right in Figure 3) The combined profile of each gene G is the sum of its pipeline-specific profiles (Figure right) The value in the combined profile of a gene shows how many pipelines agree with each gene-GO term association Given a threshold k, for each gene we report all GO terms with a combined score ≥ k This process produces a list of GO terms for each gene We call this variant Ensemble-k Finally, we apply the ancestor removal filter described above Thus, each value of k produces a different variant of the ensemble algorithm Figure shows a toy example of Ensemble-1 and For clarity, in the next sections we use the name annotation method for both pipelines and variants of the ensemble algorithm We also tested a more involved supervised ensemble method, which in addition ranks the pipelines by their average F-measure against a gold standard (see below), but this did not improve the results (see Additional file 1: Method S6) We compared the annotation methods in terms of gene coverage and the average number of GO terms per gene, which we denote as NGPG Ideally, gene coverage should be as high as possible, while NGPG should be low [26] The results are shown in Figure 4A and 4B One can observe marked differences between the different pipelines, and between ITAG and PGSC gene models For example, based on PGSC data, InterPro2GO and OrthoMCL-UniProt have the highest gene coverage (29,445 and 26,371, respectively), and NGPG score (7 and 7.1, respectively) However, based on ITAG data, OrthoMCL-UniProt’s results were similar to those for PGSC, while for InterPro2GO the number of genes dropped under 20,000 and the NGPG score increased to 8.1 (Figure 4B) Figure 4A and 4B also show the gene coverage and the NGPG of the ensemble algorithm As expected, using Amar et al BMC Plant Biology 2014, 14:329 http://www.biomedcentral.com/1471-2229/14/329 Page of 14 Figure Comparison of annotations of the PGSC genes by different pipelines Each similarity matrix shows all pairwise similarities between the pipelines (A) Structure-free Jaccard similarity of the MF predictions of the pipelines (B) Jaccard similarity of the gene sets covered by each pipeline (C) Structure-based similarity between the GO MF predictions of the pipelines Unlike (A), the calculation here used the GO hierarchy to quantify the similarity of the predictions (see Methods) (D) Structure-based similarity between the GO BP predictions of the pipelines either Ensemble-1 or increased the gene coverage compared to the single pipelines using both ITAG and PGSC gene models For example, based on PGSC the number of covered gene models (including splicing variants) was 41,668 (k = 1) and 29,495 (k = 2) Larger k values led to a sharp decrease in gene coverage, such that even single pipelines covered more genes Using Ensemble-1, the NGPG score was similar to the highest score obtained by a single pipeline, reaching a score of 6.70 on PGSC data, and 8.15 on ITAG data Ensemble-2 led to a sharp decrease in NGPG: 4.39 on PGSC, and 4.68 on ITAG In summary, our results show that the ensemble algorithm increases the gene coverage considerably without increasing the NGPG score Ensemble-1 increased gene coverage by more than 5000 genes on both ITAG and PGSC data, while keeping the NGPG score similar to that of the highest single pipelines Ensemble-2 increased the gene coverage only moderately compared to the single pipelines but the NGPG score declined sharply compared to all pipelines (except Phytozome, but the latter has low gene coverage), hence providing much more focused annotations In the next sections we demonstrate that the aforementioned improvements were not achieved at the expense of precision Validation using the potato gold standard To evaluate predictions of the different annotation methods we compiled a gold standard of 838 and 724 gene-GO term pairs based on PGSC and ITAG data, respectively, using manual annotation by experts (see Methods and Additional file 14: Table S13, Additional file 15: Table S14 and Additional file 16: Table S15) The number of genes included in the gold standard (43 with literature references, which are mapped to 116 PGSC gene ids, see Additional file 14: Table S13), is small, but in an organism such as potato it still contains the majority of genes with experimental evidence We evaluated the annotation methods by calculating their GO-based precision and recall Use of the GO structure to calculate scores for gold standard validation has been previously suggested by [27] The GO-based recall of a gene Amar et al BMC Plant Biology 2014, 14:329 http://www.biomedcentral.com/1471-2229/14/329 Page of 14 Figure A simple example of the ensemble algorithm The input (top left) is a set of GO terms, the GO graph, and association between genes and GO terms The example shows the ensemble process of a single gene G First, the pipeline-specific gene profiles are calculated (top right) A GO term is assigned a value ‘1’ in the profile if G is associated with it or with at least one of its descendants and ‘0’ otherwise Second, the combined profile of G is the sum of its pipeline-specific profiles The scores in the combined profile show how many pipelines agree with each of G’s GO term association Given a threshold k, the GO terms with a combined score lower than k are removed to provide a final list of GO terms associated with G (bottom) Each different value of k constitutes a different variant of the algorithm measures the extent to which its terms according to the gold standard are covered by its predicted GO terms The GO-based precision of a gene measures the extent its predicted GO terms match the gold standard terms For each pipeline we calculated the average precision and average recall (over the genes) and report the Fmeasure, which is the harmonic mean of the precision and the recall [28] See Methods for a full description of these calculations The results of the validation based on PGSC and ITAG data are illustrated in Figures and Additional file 1: Figure S2, respectively Figure 5A shows the F-measure for BP GO terms Ensemble-1 and reached F-measures of 0.8 and 0.77, respectively, while the top performing pipeline was InterPro2GO with only 0.61 Figure 5B shows the F-measure on the MF gold standard Ensemble-1 and reached F-measures of 0.84 and 0.83, respectively, whereas the top performing pipeline was InterPro2GO Figure Gene coverage and mean number of GO terms per gene (NGPG) For each annotation method (i.e., a pipeline and a variant of the ensemble algorithm) the gene coverage (A) and NGPG (B) are shown both for PGSC and ITAG gene models Amar et al BMC Plant Biology 2014, 14:329 http://www.biomedcentral.com/1471-2229/14/329 Page of 14 Figure Validation of annotations based on gold standard For each annotation method (i.e., a pipeline and a variant of the ensemble algorithm) the F-measure of the gold standard validation is shown on PGSC gene models, see Methods for a full description of the scores A score of means perfect agreement between an annotation method and the gold standard A score close to zero means poor concordance with the gold standard (A) F-measure of the BP annotations (B) F-measure of the MF annotations The results show that both in BP and MF the ensemble algorithm improves the results considerably when used with k is or with an F-measure of only 0.71 Thus, the results are in agreement with the BP validation: Ensemble-1 and performed best and improved upon the single pipelines Taken together, our results indicate that Ensemble-1 and provide a significant improvement in comparison to single pipelines Validation using gene expression data An obvious disadvantage of any gold standard is that it is limited to experimentally validated genes and subject to the opinion of experts Consequently, we added an additional validation based on gene co-expression analysis, where we measured the ability of pipelines to predict the same GO-term to highly co-expressed genes Our coexpression analysis is based on the gene expression of 12,956 genes in 326 expression profiles from over 20 microarray studies We used the Pearson correlation coefficient to measure co-expression between genes We used the gene pairwise co-expression scores to validate predicted GO BP terms In order to reduce noise, we ignored terms with >500 genes, or with fewer than five genes Given a set of genes predicted to be associated with the same GO term according to a specific annotation method, we tested if the level of coexpression among its genes is higher than expected by chance (see Methods for details) Thus, for each term in a specific annotation method we calculated a single pvalue To summarize these values when comparing methods we calculated two scores: (1) the number of GO terms with p

Ngày đăng: 27/05/2020, 00:34