Báo cáo khoa học: Reannotation of hypothetical ORFs in plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043 pot

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	9
Dung lượng	351,3 KB

Nội dung

Reannotation of hypothetical ORFs in plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043 Ling-Ling Chen, Bin-Guang Ma and Na Gao Shandong Provincial Research Center for Bioinformatic Engineering and Technique, Shandong University of Technology, Zibo, China Currently, more than 500 completely sequenced microbial genomes are available from public databases, which provide an unprecedented opportunity to study the genetics, biochemistry and evolutionary features of these species. Such analyses depend strongly on the gene annotation of each species. However, for many genomes, there are many hypothetical ORFs for which no functional information exists. Thanks to the devel- opment of the genome-sequencing project, a large number of hypothetical ORFs can now be assigned functions. Furthermore, some annotated hypothetical ORFs actually do not encode proteins, so the number of annotated ORFs is usually greater than the number of actual protein-coding genes for most microbial genomes [1,2]. Erwinia carotovora subsp. atroseptica SCRI1043 (Eca1043) belongs to the Enterobacteriaceae, a family noted for its human pathogens [3]. Eca1043 is a com- mercially important plant pathogen that is restricted to potato in temperate regions; it can cause blackleg in the field and soft rot in tubers after harvest [3]. Although soft rot pathogenesis relies primarily on the prolific production of extracellular plant-cell-wall- degrading enzymes that cause extensive tissue macera- tion, recent discoveries suggest that the process may be far more complex than previously thought [4,5]. The Eca1043 genome was sequenced in 2004, and its annotated ORFs can be divided into two groups: (a) genes with known functions, and (b) hypothetical ORFs. Whether these hypothetical ORFs are protein-coding genes is uncertain and their functions are unknown. Because more than a quarter of ORFs in Eca1043 are hypothetical, it is necessary to reannotate them. Using sequence-alignment tools (e.g. blast and fasta) [6–8] Keywords clusters of orthologous groups of proteins; function assignment; hypothetical ORFs; plant pathogen; principal component analysis Correspondence L L. Chen, Shandong Provincial Research Center for Bioinformatic Engineering and Technique, Center for Advanced Study, Shandong University of Technology, Zibo 255049, China Fax: +86 533 278 0271 Tel: +86 533 278 0271 E-mail: llchen@sdut.edu.cn (Received 18 August 2007, revised 3 October 2007, accepted 12 November 2007) doi:10.1111/j.1742-4658.2007.06190.x Over-annotation of hypothetical ORFs is a common phenomenon in bacterial genomes, which necessitates confirming the coding reliability of hypothetical ORFs and then predicting their functions. The important plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043 (Eca1043) is a typical case because more than a quarter of its annotated ORFs are hypothetical. Our analysis focuses on annotation of Eca1043 hypothetical ORFs, and comprises two efforts: (a) based on the Z-curve method, 49 originally annotated hypothetical ORFs are recognized as noncoding, this is further supported by principal components analysis and other evidence; and (b) using sequence-alignment tools and some functional resources, more than a half of the hypothetical genes were assigned functions. The potential functions of 427 hypothetical genes are summarized according to the cluster of orthologous groups functional category. Moreover, 114 and 86 hypothetical genes are recognized as putative ‘membrane proteins’ and ‘exported proteins’, respectively. Reannotation of Eca1043 hypothetical ORFs will benefit research into the lifestyle, metabolism and pathogenicity of the important plant pathogen. Also, our study proffers a model for the reannotation of hypothetical ORFs in microbial genomes. Abbreviations COG, cluster of orthologous groups; NCBI, National Center for Biotechnology Information; PCA, principal components analysis. 198 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS and other functional resources (e.g. interpro and kegg) [9,10], we predicted the functions of 427 hypothetical ORFs. The predicted functions of 109 hypothetical ORFs are highly reliable with sequence coverage > 80%, identity ‡ 80% and E value < 1e-20 to their homologous proteins. Moreover, 114 and 86 hypothetical ORFs are recognized as putative ‘membrane proteins’ and ‘exported proteins’, respectively. In addition, 49 hypothetical ORFs are identified as noncoding ORFs using a methodology based on Z-curve theory [11]. Using principal components analysis (PCA), it can be intuitively observed that most of the identified noncoding ORFs are found away from the core function-known genes, and close to random sequences. Other evidence also suggests that the 49 recognized noncoding ORFs are unlikely to code for proteins. Consequently, the number of hypothetical genes in Eca1043 decreases from 1254 to 578. These results are highly significant for research into the adaptation, lifestyle and pathogenicity of this important plant pathogen. Results and Discussion Identification of 49 noncoding ORFs In the first stage of annotation, the 1254 hypothetical ORFs were re-identified using the Z-curve method [11]. First, the 2246 genes of known function were ran- domly divided into two almost equal parts. The former served as a training set to calculate Fisher coefficients, and the latter served as a test set to assess the accuracy of the algorithm. Both the training set and the test set should include positive and negative samples. In the Eca1043 genome, 80.6% of the whole-DNA sequences are coding and the remaining intergenic regions are dominated by structural RNA sequences, so it is diffi- cult to prepare an appropriate set of negative samples. Thus, the following procedures were taken to produce negative samples. Each of the known genes was ran- domly shuffled 10 000 times, so that it was trans- formed into a random sequence. Shuffled sequences then served as negative samples. The detailed process of discrimination analysis has been described previously [12]. The sensitivity s n and specificity s p were used to evaluate the algorithm, which were defined as: s n =TP⁄ (TP + FN), s p =TN⁄ (TN + FP), where TP, TN, FP and FN are fractions of positive correct, negative correct, false-positive and false-negative predictions, respectively. The accuracy was defined as the average of s n and s p . After performing 10-fold cross-validation tests, mean sensitivity, specificity and SD were obtained (Table 1). The prediction accuracy was as high as 99.58%. All positive samples in the first group and the corresponding negative samples were merged, forming a new and larger training set. The final Fisher coefficients and thresholds were based on the larger training set. Using the final Fisher coefficients and the criterion for deciding coding ⁄ noncoding, the hypothetical ORFs in Eca1043 were re-identified. A total of 49 of the 1254 hypothetical ORFs were recognized as noncoding (Table 2). Why are the recognized noncoding ORFs unlikely to encode proteins? The need to fold a peptide chain into a stable and functional protein imposes rigorous constraints on coding sequences. Many constraints have been observed and the generally accepted base usage pattern Table 1. The genome feature, sensitivity, specificity and accuracy over 10-fold cross-validation tests for Eca1043. Length (bp) GC content (%) Sensitivity a (%) Specificity a (%) Accuracy b (%) 5 064 019 50.97 99.64 ± 0.002 99.53 ± 0.001 99.58 a ± SD. b Accuracy is defined as the average of the sensitivity and specificity. Table 2. The synonyms of the 49 recognized noncoding ORFs. Synonym ECA0394 ECA0547 ECA0579 ECA0586 ECA0590 ECA0637 ECA0670 ECA0675 ECA0726 ECA1062 ECA1066 ECA1183 ECA1522 ECA1584 ECA1610 ECA1636 ECA1771 ECA2121 ECA2124 ECA2129 ECA2234 ECA2470 ECA2505 ECA2513 ECA2658 ECA2706 ECA2859 ECA2862 ECA2864 ECA2874 ECA2890 ECA2896 ECA3326 ECA3385 ECA3397 ECA3404 ECA3405 ECA3412 ECA3414 ECA3521 ECA3674 ECA3676 ECA3677 ECA3982 ECA4287 ECA4295 ECA4306 ECA4442 ECA4484 L L. Chen et al. Hypothetical ORFs reannotation in Eca1043 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS 199 is the R  GN prototype, where R,  G and N denote pur- ine, nonguanine and any bases at the first, second and third codon positions, respectively [13–17]. It is suggested that the first, second and third codon positions are associated with the biosynthetic pathway, hydro- phobicity pattern and the a-helix- or b-strand-forming potentiality of the coded amino acid, respectively [13– 17]. By contrast, the negative samples are shuffled sequences of function-known genes, so the frequencies of the bases at the three ‘codon’ positions are almost identical (note that the term ‘codon’ in a negative sample is meaningless). The base distribution pattern of negative sample sequences is the NNN type. The difference in the two codon types, R  GN and NNN, forms the basis of our method for distinguishing between protein-coding and noncoding ORFs. The difference between coding and noncoding sequences can be viewed intuitively using PCA. PCA defines the correlation among the variables of given data. The first derived direction is chosen to maximize the SD of the derived variable and the second is to maximize the SD among directions uncorrelated with the first, and so forth [12]. Figure 1 shows the distribution of points on the principal plane spanned by the first two principal components. The coding and noncoding sequences are represented by open circles and triangles, respectively. It can be seen that the two principal axes are responsible for separating the coding and noncoding sequences into two almost nonoverlap- ping clusters. The difference in the two regions reflected the base usage at the three codon positions of coding and noncoding sequences was quite different. The recognized noncoding ORFs are represented by filled stars, distributed far from the core of function- known genes, and close to random sequences. This implies that the 49 ORFs listed in Table 2 are unlikely to encode proteins. In the latest version of RefSeq annotation, clusters of orthologous groups (COGs) of proteins were added to the annotation file. Each COG is a group of three or more proteins that are inferred to be orthologs, i.e. they have evolved from a common ancestor [18,19]. Computational analysis of complete microbial genomes shows that prokaryotic proteins are generally highly conserved, with  70% of them containing ancient conserved regions shared by homologs from distantly related species [18,19]. Therefore, an annotated ORF within a COG is highly likely to be a protein-coding gene with homologs from other species. Of the 2246 genes of known function, 84.3% are included in at least one COG, the ratio decreases to 75.3% in ‘putative’ and ‘probable’ ORFs, and decreases further to 40.6% in ‘hypothetical’ ORFs. Of the 49 recognized noncoding ORFs listed in Table 2, only 4 (8.2%) contain COG codes. In addition, previous statistics have shown that over-annotation of short ORFs was one of the major problems in prokaryotic genome annotation [1]. So we compared the average length of the 2246 function-known genes in the first group and the 49 recognized noncoding ORFs. The average length of the recognized noncoding ORFs (330 bp) is much shorter than that of the function-known genes (1112 bp; Table 3). All the above evidence strongly suggests that the 49 ORFs are over-annotated short ORFs. Of course, our conclusion is only theoretical and needs to be verified by experiments. –0.8 –0.4 0.0 0.4 0.8 –1.2 –0.8 –0.4 0.0 0.4 0.8 1.2 The second principal component The first principal component Fig. 1. The distribution of points on the principal plane spanned by the first (x) and second (y) principal axes using PCA in Eca1043. Open circles represent the function-known genes, open triangles represent the corresponding negative samples and filled stars denote ORFs recognized as noncoding. The first and second principal axes account for 26.2% and 22.3% of the total inertia of the 21-dimensional space, respectively. Note that the distribution of the open circles is separate from that of the open triangles, indicating that coding and noncoding sequences are well distinguished. Fur- thermore, most of the identified noncoding ORFs are far from the core of open circles, and close to the core of open triangles, imply- ing that the 49 recognized noncoding ORFs listed in Table 2 are unlikely to encode proteins. Table 3. Average length and percentage of ORFs with COG code for 2246 function-known protein-coding genes and 49 recognized noncoding ORFs in Eca1043. Feature Genes with known functions Recognized noncoding ORFs With COG code 84.3% 8.2% Average length (bp) 1112 330 Hypothetical ORFs reannotation in Eca1043 L L. Chen et al. 200 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS Function annotation of the hypothetical genes After identifying the 49 noncoding ORFs, the next step was to assign functions to the remaining hypothetical genes. In the original annotation of the Eca1043 genome, although the authors queried all the ORFs against the complete set of ORFs from 64 selected fully annotated bacterial genomes obtained from National Center for Biotechnology Information (NCBI) to determine their functions [3], more than a quarter of the annotated genes still had no functional information. Three years have passed and now > 500 complete bacterial and archaeal genomes are annotated in the NCBI, so a large number of new functional genes can be obtained from public databases. Further- more, many studies with knowledge of Eca1043 genes have been published in the last 3 years. All this information provides valuable resources for assigning functions to a mass of hypothetical genes. After collecting all this information and systematically searching nonredundant nucleotide and protein databases, functions have been assigned to 109 hypothetical genes with high reliability, the synonyms, protein lengths, E values, identities and predicted functions (products) are listed in Table 4. The aligned length covered at least 80% of each gene with the identity ‡ 80% and E value < 1e-20. Furthermore, the functions of another 318 hypothetical genes have been assigned with query coverage > 80%, identity > 30% and E value < 1e-10 (see supplementary Table S1). The predicted functions of the above 427 hypothetical genes were summarized according to COG functional categories. The latest version of COG is classified into 25 functional categories and each category is symbolized by a capital letter, J, A, K, L, B, D, Y, V, T, M, N, W, O, U, C, G, E, F, H, I, P, Q, R and S, respectively. Details of the functions of the codes are listed in Table 5. The 25 functional categories are summarized into four functional groups. According to the COG functional category, 48, 79, 97 and 167 newly annotated hypothetical genes belong to the ‘information storage and processing’, ‘cellular processes and signaling’, ‘metabolism’ and ‘poorly characterized’ groups, respectively. Detailed information about annotated hypothetical genes in each COG functional category is summarized in Table 5. Of the 427 newly annotated genes, 50 can be classified into two or more functional categories and 36 can not be assigned to any category. As pointed out by Bell et al., Eca1043 has the ability to use a range of different nutrients to adapt to diverse environments [3]. In the original annotation study, 80 putative ABC transporters, 36 putative methyl-accepting chemotaxis protein genes and 336 putative regulators were annotated, which supports that Eca1043 is able to respond to a wide range of nutrient sources and live in different environments [3]. In our analysis, more genes associated with a variety of lifestyles and habitats for Eca1043 were identified, including 23 transporters, 17 regulators, 15 transferases and 1 methyl-accepting chemotaxis protein. Further- more, except for the newly annotated 427 hypothetical genes, 114 hypothetical genes were recognized as putative ‘membrane proteins’ and 86 as ‘exported proteins’, which are detailed in supplementary Tables S2 and S3, respectively. It is highly possible that some of the putative ‘exported proteins’ are related to the pathogenicity of Eca1043. In conclusion, 1254 hypothetical ORFs in the important plant pathogen Eca1043 are reannotated in this analysis. First, 49 originally annotated hypothetical ORFs are recognized as noncoding ORFs using a methodology based on the Z-curve method. The recognized noncoding ORFs are very unlikely to encode proteins, as supported by PCA evidence, average length distribution and COG functional category occu- pation. Second, using sequence alignment tools and some functional resources, potential functions for 427 hypothetical genes have been predicted. Moreover, 114 and 86 hypothetical genes are recognized as putative ‘membrane proteins’ and ‘exported proteins’, respectively. Therefore, the number of hypothetical genes decreases to 578. These results provide more information than earlier annotation, and will benefit research into the lifestyle, metabolism and pathogenicity of this important plant pathogen. Experimental procedures The length of the Eca1043 genome is  5.06 Mb and the original annotation was submitted to GenBank (accession number BX950851) in July 2004 [3]. Subsequently, a curated annotation was made available by RefSeq at NCBI (NC_004547). The number of annotated ORFs in the two databases are the same. The sequence and annotation files analyzed in this study were downloaded from NCBI RefSeq (updated 9 February 2007) and the number of annotated ORFs was 4472. Among them, two ORFs (ECA0773 and ECA2198) have lengths that cannot be divided by three, which obviously denotes that they are not protein-coding genes and thus are excluded from this analysis. The remaining 4470 ORFs can be classified into two groups: the first contains 2246 genes with confirmed functions and 970 genes with ‘putative’ or ‘probable’ functions, of which the 2246 function-confirmed genes are used as training parameters; the second group contains 1254 hypothetical ORFs, whose L L. Chen et al. Hypothetical ORFs reannotation in Eca1043 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS 201 Table 4. Synonyms, COG functional categories and predicted functions (products) of 109 Eca1043 hypothetical genes with BLAST search identity ‡ 80%, E value < 1e-20 and aligned length covering at least 80% of each gene. Synonym E value Length (aa) a Functional category b Identity (%) Product ECA0018 2e-39 89 S 98 YihD ECA0019 0.0 311 R 95 YihE ECA0054 1e-107 248 QR 82 SAM-dependent methyltransferases ECA0061 6e-128 280 R 80 Protein involved in catabolism of external DNA ECA0063 2e-38 95 JD 88 Addiction module toxin, RelE ⁄ StbE family ECA0064 2e-31 83 D 87 Antitoxin of toxin-antitoxin stability system ECA0130 7e-118 287 S 87 YicC N-terminal domain protein ECA0264 1e-166 333 R 83 Twin-arginine translocation pathway signal ECA0285 3e-153 285 R 96 Predicted P-loop-containing kinase ECA0293 2e-151 357 MR 81 Predicted sugar phosphate isomerase involved in capsule formation ECA0296 5e-113 260 Q 89 ABC-type transport system involved in resistance to organic solvents, permease component ECA0298 9e-92 209 Q 82 ABC-type transport system involved in resistance to organic solvents, auxiliary component ECA0313 2e-147 309 R 82 Putative Fe-S oxidoreductase ECA0327 8e-127 261 S 84 Extradiol ring-cleavage dioxygenase, class III enzyme, subunit B ECA0338 5e-133 287 G 80 Fructose-bisphosphate aldolase, class II family ECA0383 1e-70 155 G 85 Beta-galactosidase, beta subunit ECA0420 0.0 373 S 84 Cupin 4 family protein ECA0444 3e-27 83 D 87 YefM protein ECA0512 2e-27 80 K 90 Putative regulatory protein ECA0631 4e-17 73 S 88 CsbD-like family ECA0636 7e-45 131 S 82 DoxD-like family protein ECA0696 2e-35 97 J 88 Putative RNA-binding protein ECA0710 7e-74 140 S 99 YhbC-like protein ECA0721 2e-143 292 O 81 Collagenase and related proteases ECA0757 7e-79 215 R 83 Putative oxidoreductase ECA0837 2e-66 148 I 86 Oligoketide cyclase ⁄ lipid transport protein ECA0882 3e-137 299 P 80 Dyp-type peroxidase family ECA0971 8e-142 277 R 90 Predicted TIM-barrel enzyme, possibly a dioxygenase ECA0975 9e-39 90 CO 89 Fe(II) trafficking protein YggX ECA0983 2e-92 238 P 82 Membrane protein TerC, possibly involved in tellurium resistance ECA1010 9e-136 272 H 89 HesA ⁄ MoeB ⁄ ThiF family protein ECA1024 2e-30 116 S 88 tRNA pseudouridine synthase C ECA1071 7e-25 75 K 80 DNA-directed RNA polymerase, subunit M ⁄ Transcription elongation factor TFIIS ECA1125 2e-67 149 K 91 Ribonucleotide reductase regulator NrdR-like ECA1155 4e-112 231 L 85 ExsB protein ECA1191 9e-64 159 S 80 YbaK ⁄ ebsC protein ECA1196 5e-138 304 O 88 Membrane protease subunits, stomatin ⁄ prohibitin homologs ECA1317 1e-61 159 R 81 Predicted metal-dependent hydrolase ECA1319 0.0 474 J 88 tRNA-methylthiotransferase MiaB protein ECA1333 2e-22 93 – 84 LexA regulated, putative SOS response ECA1405 6e-167 358 G 84 Putative sugar ABC transporter ECA1410 1e-172 369 D 81 Mrp protein ECA1578 1e-66 179 S 86 Nucleoprotein ⁄ polynucleotide-associated enzyme ECA1585 0.0 512 R 81 Deoxyribodipyrimidine photolyase-like protein ECA1645 4e-116 254 – 88 Putative plasmid replication protein ECA1663 1e-30 94 K 86 Predicted transcriptional regulators ECA1684 2e-85 321 GER 90 Permeases of the drug ⁄ metabolite transporter (DMT) superfamily ECA1762 2e-50 109 P 92 Sulfite reductase ECA1763 5e-73 219 R 83 Putative transport protein ECA1781 0.0 382 R 83 Rhodanese-like domain protein ECA1782 4e-82 192 S 81 Protein yceI Hypothetical ORFs reannotation in Eca1043 L L. Chen et al. 202 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS Table 4. (Continued). Synonym E value Length (aa) a Functional category b Identity (%) Product ECA1809 2e-50 116 FGR 85 Histidine triad (HIT) protein ECA1814 3e-85 180 R 82 Predicted esterase ECA1816 3e-57 189 S 87 Predicted outer membrane lipoprotein ECA1860 4e-188 499 O 85 FeS assembly protein SufB ECA1927 3e-51 116 O 87 Glutaredoxin-like protein ECA1956 1e-177 389 G 80 Predicted N-acetylglucosaminyl transferase ECA1958 4e-34 109 J 87 Translation initiation factor SUI1 ECA1986 0.0 465 R 82 Predicted ATPase ECA1995 3e-165 311 D 89 YdaO ECA2292 1e-96 206 J 84 Putative translation factor (SUA5) ECA2348 0.0 644 T 94 Putative Ser protein kinase ECA2359 0.0 513 S 85 Putative sporulation protein ECA2367 2e-30 95 S 82 Protein ycgL ECA2464 0.0 484 J 10 Ribosomal RNA small subunit methyltransferase F ECA2511 3e-116 259 QR 80 Predicted methyltransferase ECA2512 7e-158 327 QR 81 SAM-dependent methyltransferases ECA2525 9e-163 401 GEPR 82 Permeases of the major facilitator superfamily ECA2529 2e-120 245 ER 84 Histidinol phosphatase and related hydrolases of the PHP family ECA2560 6e-33 105 S 85 Putative alpha helix protein ECA2683 0.0 562 R 89 TrkA, potassium channel-family protein ECA2708 0.0 442 S 88 Putative FeS oxidoreductase ECA2777 9e-32 109 – 80 Putative phage-related exported protein ECA2812 6e-96 235 R 87 Integral membrane protein, interacts with FtsH ECA2977 3e-96 177 G 96 ABC-type sugar transport system, periplasmic component ECA3034 4e-86 199 R 86 Predicted hydrolases of HD superfamily ECA3037 1e-77 164 S 82 YfbU family protein ECA3057 2e-95 220 S 85 DedA protein (dsg-1 protein) ECA3059 6e-151 336 E 80 Putative aspartate-semialdehyde dehydrogenase ECA3070 3e-155 310 J 82 Adenine-specific methylase ECA3087 2e-123 263 – 81 Necrosis-inducing protein ECA3115 5e-205 413 E 93 Aspartate ⁄ tyrosine ⁄ aromatic aminotransferase ECA3135 2e-59 133 S 80 Cupin 2, conserved barrel ECA3223 0.0 398 R 85 Radical SAM enzyme, Cfr family ECA3262 2e-151 308 R 99 N-acetylmuramic acid 6-phosphate etherase ECA3288 5e-56 127 R 85 Autonomous glycyl radical cofactor ECA3306 2e-52 115 S 93 Iron–sulfur cluster assembly accessory protein ECA3361 1e-108 264 R 82 Cytochrome c assembly protein ECA3382 6e-40 95 C 92 Rhs protein ECA3428 1e-88 172 S 88 Putative hemolysin co-regulated protein ECA3472 3e-137 255 R 86 Predicted glutamine amidotransferase ECA3475 2e-55 116 V 90 HNH endonuclease ECA3487 1e-97 205 G 83 Probable dehydratase ECA3523 7e-87 187 E 82 D,D-heptose 1,7-bisphosphate phosphatase ECA3623 5e-65 141 K 86 Putative negative regulator ECA3774 8e-31 95 G 88 Phosphotransferase enzyme II, B component ECA3792 0.0 477 G 84 Na + ⁄ melibiose symporter and related transporters ECA3824 2e-77 152 S 94 MraZ protein ECA3860 2e-51 125 P 80 ApaG protein ECA3877 5e-144 313 H 81 FAD synthase ECA3894 4e-66 160 S 82 CreA protein ECA4059 1e-73 200 E 86 Lysine exporter protein ECA4063 2e-61 135 O 88 Predicted redox protein, regulator of disulfide bond formation ECA4134 2e-97 191 O 91 HesB ⁄ YadR ⁄ YfhF ECA4157 3e-64 197 U 81 Multiple antibiotic resistance (MarC)-related protein ECA4162 5e-112 231 R 83 Pirin-related protein L L. Chen et al. Hypothetical ORFs reannotation in Eca1043 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS 203 coding status is identified and their functions predicted in this analysis. Re-recognizing hypothetical ORFs The method adopted here is based on the Z-curve of DNA sequence [11], which has been successfully used to find genes in microbe [20,21] and eukaryotic genomes [22,23]. In this analysis, 21 Z-curve variables were adopted, including nine variables of phase-dependent single nucleotides and 12 of phase-independent dinucleotides. For details about these variables, please refer to Gao and Zhang [23]. The Fisher linear discrimination algorithm was used to differentiate protein-coding and noncoding sequences, the procedure was as detailed previously [20,23]. Assigning functions to hypothetical genes Hypothetical ORFs were compared with nucleotide and protein sequences in public nonredundant databases using alignment tools such as blast [6,7] and fasta [8]. Other functional assignment resources, such as interpro [9] and kegg [10], were also used. Furthermore, studies from the past 3 years with information about Eca 1043 were collected and used to manually assign functions to some hypothetical ORFs. Acknowledgements The authors wish to thank Professor Hong-Yu Zhang and Dr Hong-Yu Ou for their valuable suggestions. Table 4. (Continued). Synonym E value Length (aa) a Functional category b Identity (%) Product ECA4275 1e-88 172 S 89 Putative hemolysin co-regulated protein ECA4329 2e-94 218 G 81 Class II aldolase ⁄ adducin domain protein ECA4353 4e-30 81 O 85 SirA protein a Amino acid length of each hypothetical gene. b The 25 functional categories in COG database, i.e., J, A, K, L, B, D, Y, V, T, M, N, W, O, U, C, G, E, F, H, I, P, Q, R and S, respectively. In addition, ‘–’ denotes the corresponding gene can not be assigned to any COG category. Table 5. The number of newly annotated hypothetical genes in each of the 25 COG functional categories. Group Code Description Number a Information storage and processing J Translation, ribosomal structure and biogenesis 9 A RNA processing and modification – K Transcription 24 L Replication, recombination and repair 15 B Chromatin structure and dynamics – Cellular processes and signaling D Cell cycle control, cell division, chromosome partitioning 12 Y Nuclear structure – V Defense mechanisms 6 T Signal transduction mechanisms 16 M Cell wall ⁄ membrane ⁄ envelope biogenesis 18 N Cell motility 2 Z Cytoskeleton – W Extracellular structures – U Intracellular trafficking, secretion, and vesicular transport 6 O Posttranslational modification, protein turnover, chaperones 19 Metabolism C Energy production and conversion 6 G Carbohydrate transport and metabolism 33 E Amino acid transport and metabolism 23 F Nucleotide transport and metabolism 4 H Coenzyme transport and metabolism 4 I Lipid transport and metabolism 5 P Inorganic ion transport and metabolism 13 Q Secondary metabolites biosynthesis, transport and catabolism 9 Poorly characterized R General function prediction only 81 S Function unknown 86 a ‘–’indicates there is no newly annotated gene in this COG functional category. Hypothetical ORFs reannotation in Eca1043 L L. Chen et al. 204 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS The study was supported by the National Natural Science Foundation of China (30600119), the National Basic Research Program of China (2003CB114400) and the scientific research funds of Shandong University of Technology (grant 2004KJM29 and 04KQ14). References 1 Nielsen P & Krogh A (2005) Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21, 4322–4329. 2 Skovgaard M, Jensen LJ, Brunak S, Ussery D & Krogh A (2001) On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17, 425–428. 3 Bell KS, Sebaihia M, Pritchard L, Holden MT, Hyman LJ, Holeva MC, Thomson NR, Bentley SD, Churcher LJ, Mungall K et al. (2004) Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence fac- tors. Proc Natl Acad Sci USA 101, 11105–11110. 4Pe ´ rombelon MCM (2002) Potato diseases caused by soft rot erwinias: an overview of pathogenesis. Plant Pathol 51, 1–12. 5 Toth IK, Bell KS, Holeva MC & Birch PRJ (2003) Soft rot erwiniae: from genes to genomes. Mol Plant Pathol 4, 17–30. 6 Altschul SF, Madden TL, Scha ¨ ffer AA, Zhang J, Zhang Z, Miller W & Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389– 3402. 7 Scha ¨ ffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV & Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29, 2994– 3005. 8 Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183, 63–98. 9 Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R et al. (2007) New developments in the Inter- Pro database. Nucleic Acids Res 35, D224–D228. 10 Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M & Hira- kawa M (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34, D354–D357. 11 Zhang CT & Zhang R (1991) Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res 19, 6313–6317. 12 Dillon WR & Goldstein M (1984) Multivariate Analysis – Methods and Applications (Wiley Series in Probability and Mathematical Statistics). Wiley, New York, NY. 13 Trifonov EN (1987) Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. J Mol Biol 194, 643–652. 14 Zhang CT & Chou KC (1994) A graphic approach to analyzing codon usage in 1562 E. coli protein coding sequences. J Mol Biol 238, 1–8. 15 Gupta SK, Majumdar S, Bhattacharya TK & Ghosh TC (2000) Studies on the relationships between the synonymous codon usage and protein secondary structural units. Biochem Biophys Res Commun 269 , 692– 696. 16 Pan A, Dutta C & Das J (1998) Codon usage in highly expressed genes of Haemophillus influenzae and Myco- bacterium tuberculosis: translational selection versus mutational bias. Gene 215, 405–413. 17 Chiusano ML, Alvarez-Valin F, Di Giulio M, D’Onofrio G, Ammirato G, Colonna G & Bernardi G (2000) Second codon positions of genes and the secondary structures of proteins. Relationships and implications for the origin of the genetic code. Gene 261, 63–69. 18 Tatusov RL, Koonin EV & Lipman DJ (1997) A geno- mic perspective on protein families. Science 278, 631– 637. 19 Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41. 20 Chen LL & Zhang CT (2003) Gene recognition from questionable ORFs in bacterial and archaeal genomes. J Biomol Struct Dyn 21, 99–110. 21 Guo FB, Ou HY & Zhang CT (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 31, 1780– 1789. 22 Zhang CT & Wang J (2000) Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve. Nucleic Acids Res 28, 2804–2814. 23 Gao F & Zhang CT (2004) Comparison of various algorithms for recognizing short coding sequences of human genes. Bioinformatics 20 , 673–681. Supplementary material The following supplementary material is available online: Table S1. Synonyms, COG functional categories and predicted functions (products) of 318 Eca hypothetical L L. Chen et al. Hypothetical ORFs reannotation in Eca1043 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS 205 genes with blast search identity > 30%, E value < 1e- 10 and aligned length covers at least 80% of each gene. Table S2. Synonyms of the 114 recognized membrane proteins. Table S3. Synonyms of the 86 recognized exported proteins. This material is available as part of the online article from http://www.blackwell-synergy.com Please note: Blackwell Publishing are not responsible for the content or functionality of any supplementary materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article. Hypothetical ORFs reannotation in Eca1043 L L. Chen et al. 206 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS . Reannotation of hypothetical ORFs in plant pathogen Erwinia carotovora subsp. atroseptica SCRI1043 Ling-Ling Chen, Bin-Guang Ma and Na. annotation of the hypothetical genes After identifying the 49 noncoding ORFs, the next step was to assign functions to the remaining hypothetical genes. In the

Ngày đăng: 23/03/2014, 07:20

Xem thêm