ReannotationofhypotheticalORFsinplant pathogen
Erwinia carotovorasubsp.atroseptica SCRI1043
Ling-Ling Chen, Bin-Guang Ma and Na Gao
Shandong Provincial Research Center for Bioinformatic Engineering and Technique, Shandong University of Technology, Zibo, China
Currently, more than 500 completely sequenced micro-
bial genomes are available from public databases,
which provide an unprecedented opportunity to study
the genetics, biochemistry and evolutionary features of
these species. Such analyses depend strongly on the
gene annotation of each species. However, for many
genomes, there are many hypotheticalORFs for which
no functional information exists. Thanks to the devel-
opment of the genome-sequencing project, a large
number ofhypotheticalORFs can now be assigned
functions. Furthermore, some annotated hypothetical
ORFs actually do not encode proteins, so the number
of annotated ORFs is usually greater than the number
of actual protein-coding genes for most microbial
genomes [1,2].
Erwinia carotovorasubsp.atroseptica SCRI1043
(Eca1043) belongs to the Enterobacteriaceae, a family
noted for its human pathogens [3]. Eca1043 is a com-
mercially important plantpathogen that is restricted to
potato in temperate regions; it can cause blackleg in
the field and soft rot in tubers after harvest [3].
Although soft rot pathogenesis relies primarily on the
prolific production of extracellular plant-cell-wall-
degrading enzymes that cause extensive tissue macera-
tion, recent discoveries suggest that the process may be
far more complex than previously thought [4,5]. The
Eca1043 genome was sequenced in 2004, and its anno-
tated ORFs can be divided into two groups: (a) genes
with known functions, and (b) hypothetical ORFs.
Whether these hypotheticalORFs are protein-coding
genes is uncertain and their functions are unknown.
Because more than a quarter ofORFsin Eca1043 are
hypothetical, it is necessary to reannotate them. Using
sequence-alignment tools (e.g. blast and fasta) [6–8]
Keywords
clusters of orthologous groups of proteins;
function assignment; hypothetical ORFs;
plant pathogen; principal component
analysis
Correspondence
L L. Chen, Shandong Provincial Research
Center for Bioinformatic Engineering and
Technique, Center for Advanced Study,
Shandong University of Technology,
Zibo 255049, China
Fax: +86 533 278 0271
Tel: +86 533 278 0271
E-mail: llchen@sdut.edu.cn
(Received 18 August 2007, revised 3
October 2007, accepted 12 November
2007)
doi:10.1111/j.1742-4658.2007.06190.x
Over-annotation ofhypotheticalORFs is a common phenomenon in bacte-
rial genomes, which necessitates confirming the coding reliability of hypo-
thetical ORFs and then predicting their functions. The important plant
pathogen Erwiniacarotovorasubsp.atrosepticaSCRI1043 (Eca1043) is a
typical case because more than a quarter of its annotated ORFs are hypo-
thetical. Our analysis focuses on annotation of Eca1043 hypothetical
ORFs, and comprises two efforts: (a) based on the Z-curve method, 49
originally annotated hypotheticalORFs are recognized as noncoding, this
is further supported by principal components analysis and other evidence;
and (b) using sequence-alignment tools and some functional resources,
more than a half of the hypothetical genes were assigned functions. The
potential functions of 427 hypothetical genes are summarized according to
the cluster of orthologous groups functional category. Moreover, 114 and
86 hypothetical genes are recognized as putative ‘membrane proteins’ and
‘exported proteins’, respectively. Reannotationof Eca1043 hypothetical
ORFs will benefit research into the lifestyle, metabolism and pathogenicity
of the important plant pathogen. Also, our study proffers a model for the
reannotation ofhypotheticalORFsin microbial genomes.
Abbreviations
COG, cluster of orthologous groups; NCBI, National Center for Biotechnology Information; PCA, principal components analysis.
198 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
and other functional resources (e.g. interpro and
kegg) [9,10], we predicted the functions of 427 hypo-
thetical ORFs. The predicted functions of 109 hypo-
thetical ORFs are highly reliable with sequence
coverage > 80%, identity ‡ 80% and E value < 1e-20
to their homologous proteins. Moreover, 114 and 86
hypothetical ORFs are recognized as putative ‘mem-
brane proteins’ and ‘exported proteins’, respectively. In
addition, 49 hypotheticalORFs are identified as non-
coding ORFs using a methodology based on Z-curve
theory [11]. Using principal components analysis
(PCA), it can be intuitively observed that most of the
identified noncoding ORFs are found away from the
core function-known genes, and close to random
sequences. Other evidence also suggests that the 49 rec-
ognized noncoding ORFs are unlikely to code for pro-
teins. Consequently, the number ofhypothetical genes
in Eca1043 decreases from 1254 to 578. These results
are highly significant for research into the adaptation,
lifestyle and pathogenicity of this important plant
pathogen.
Results and Discussion
Identification of 49 noncoding ORFs
In the first stage of annotation, the 1254 hypothetical
ORFs were re-identified using the Z-curve method [11].
First, the 2246 genes of known function were ran-
domly divided into two almost equal parts. The former
served as a training set to calculate Fisher coefficients,
and the latter served as a test set to assess the accuracy
of the algorithm. Both the training set and the test set
should include positive and negative samples. In the
Eca1043 genome, 80.6% of the whole-DNA sequences
are coding and the remaining intergenic regions are
dominated by structural RNA sequences, so it is diffi-
cult to prepare an appropriate set of negative samples.
Thus, the following procedures were taken to produce
negative samples. Each of the known genes was ran-
domly shuffled 10 000 times, so that it was trans-
formed into a random sequence. Shuffled sequences
then served as negative samples. The detailed process
of discrimination analysis has been described
previously [12]. The sensitivity s
n
and specificity s
p
were used to evaluate the algorithm, which were defined
as: s
n
=TP⁄ (TP + FN), s
p
=TN⁄ (TN + FP), where
TP, TN, FP and FN are fractions of positive correct,
negative correct, false-positive and false-negative
predictions, respectively. The accuracy was defined as
the average of s
n
and s
p
. After performing 10-fold
cross-validation tests, mean sensitivity, specificity and
SD were obtained (Table 1). The prediction accuracy
was as high as 99.58%. All positive samples in the first
group and the corresponding negative samples were
merged, forming a new and larger training set. The
final Fisher coefficients and thresholds were based on
the larger training set. Using the final Fisher coeffi-
cients and the criterion for deciding coding ⁄ noncoding,
the hypotheticalORFsin Eca1043 were re-identified.
A total of 49 of the 1254 hypotheticalORFs were
recognized as noncoding (Table 2).
Why are the recognized noncoding ORFs unlikely
to encode proteins?
The need to fold a peptide chain into a stable and
functional protein imposes rigorous constraints on
coding sequences. Many constraints have been
observed and the generally accepted base usage pattern
Table 1. The genome feature, sensitivity, specificity and accuracy over 10-fold cross-validation tests for Eca1043.
Length (bp) GC content (%) Sensitivity
a
(%) Specificity
a
(%) Accuracy
b
(%)
5 064 019 50.97 99.64 ± 0.002 99.53 ± 0.001 99.58
a
± SD.
b
Accuracy is defined as the average of the sensitivity and specificity.
Table 2. The synonyms of the 49 recognized noncoding ORFs.
Synonym
ECA0394 ECA0547 ECA0579 ECA0586 ECA0590 ECA0637 ECA0670
ECA0675 ECA0726 ECA1062 ECA1066 ECA1183 ECA1522 ECA1584
ECA1610 ECA1636 ECA1771 ECA2121 ECA2124 ECA2129 ECA2234
ECA2470 ECA2505 ECA2513 ECA2658 ECA2706 ECA2859 ECA2862
ECA2864 ECA2874 ECA2890 ECA2896 ECA3326 ECA3385 ECA3397
ECA3404 ECA3405 ECA3412 ECA3414 ECA3521 ECA3674 ECA3676
ECA3677 ECA3982 ECA4287 ECA4295 ECA4306 ECA4442 ECA4484
L L. Chen et al. HypotheticalORFsreannotationin Eca1043
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS 199
is the R
GN prototype, where R,
G and N denote pur-
ine, nonguanine and any bases at the first, second and
third codon positions, respectively [13–17]. It is sug-
gested that the first, second and third codon positions
are associated with the biosynthetic pathway, hydro-
phobicity pattern and the a-helix- or b-strand-forming
potentiality of the coded amino acid, respectively [13–
17]. By contrast, the negative samples are shuffled
sequences of function-known genes, so the frequencies
of the bases at the three ‘codon’ positions are almost
identical (note that the term ‘codon’ in a negative sam-
ple is meaningless). The base distribution pattern of
negative sample sequences is the NNN type. The
difference in the two codon types, R
GN and NNN,
forms the basis of our method for distinguishing
between protein-coding and noncoding ORFs.
The difference between coding and noncoding
sequences can be viewed intuitively using PCA. PCA
defines the correlation among the variables of given
data. The first derived direction is chosen to maximize
the SD of the derived variable and the second is to
maximize the SD among directions uncorrelated with
the first, and so forth [12]. Figure 1 shows the distribu-
tion of points on the principal plane spanned by the
first two principal components. The coding and non-
coding sequences are represented by open circles and
triangles, respectively. It can be seen that the two prin-
cipal axes are responsible for separating the coding
and noncoding sequences into two almost nonoverlap-
ping clusters. The difference in the two regions
reflected the base usage at the three codon positions of
coding and noncoding sequences was quite different.
The recognized noncoding ORFs are represented by
filled stars, distributed far from the core of function-
known genes, and close to random sequences. This
implies that the 49 ORFs listed in Table 2 are unlikely
to encode proteins.
In the latest version of RefSeq annotation, clusters
of orthologous groups (COGs) of proteins were
added to the annotation file. Each COG is a group
of three or more proteins that are inferred to be
orthologs, i.e. they have evolved from a common
ancestor [18,19]. Computational analysis of complete
microbial genomes shows that prokaryotic proteins
are generally highly conserved, with 70% of them
containing ancient conserved regions shared by homo-
logs from distantly related species [18,19]. Therefore,
an annotated ORF within a COG is highly likely to
be a protein-coding gene with homologs from other
species. Of the 2246 genes of known function, 84.3%
are included in at least one COG, the ratio decreases
to 75.3% in ‘putative’ and ‘probable’ ORFs, and
decreases further to 40.6% in ‘hypothetical’ ORFs.
Of the 49 recognized noncoding ORFs listed in
Table 2, only 4 (8.2%) contain COG codes. In addi-
tion, previous statistics have shown that over-annota-
tion of short ORFs was one of the major problems
in prokaryotic genome annotation [1]. So we com-
pared the average length of the 2246 function-known
genes in the first group and the 49 recognized non-
coding ORFs. The average length of the recognized
noncoding ORFs (330 bp) is much shorter than that
of the function-known genes (1112 bp; Table 3). All
the above evidence strongly suggests that the 49
ORFs are over-annotated short ORFs. Of course, our
conclusion is only theoretical and needs to be verified
by experiments.
–0.8 –0.4 0.0 0.4 0.8
–1.2
–0.8
–0.4
0.0
0.4
0.8
1.2
The second principal component
The first principal component
Fig. 1. The distribution of points on the principal plane spanned by
the first (x) and second (y) principal axes using PCA in Eca1043.
Open circles represent the function-known genes, open triangles
represent the corresponding negative samples and filled stars
denote ORFs recognized as noncoding. The first and second princi-
pal axes account for 26.2% and 22.3% of the total inertia of the
21-dimensional space, respectively. Note that the distribution of the
open circles is separate from that of the open triangles, indicating
that coding and noncoding sequences are well distinguished. Fur-
thermore, most of the identified noncoding ORFs are far from the
core of open circles, and close to the core of open triangles, imply-
ing that the 49 recognized noncoding ORFs listed in Table 2 are
unlikely to encode proteins.
Table 3. Average length and percentage ofORFs with COG code
for 2246 function-known protein-coding genes and 49 recognized
noncoding ORFsin Eca1043.
Feature
Genes with
known functions
Recognized
noncoding ORFs
With COG code 84.3% 8.2%
Average length (bp) 1112 330
Hypothetical ORFsreannotationin Eca1043 L L. Chen et al.
200 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
Function annotation of the hypothetical genes
After identifying the 49 noncoding ORFs, the next step
was to assign functions to the remaining hypothetical
genes. In the original annotation of the Eca1043 gen-
ome, although the authors queried all the ORFs
against the complete set ofORFs from 64 selected
fully annotated bacterial genomes obtained from
National Center for Biotechnology Information
(NCBI) to determine their functions [3], more than a
quarter of the annotated genes still had no functional
information. Three years have passed and now > 500
complete bacterial and archaeal genomes are annotated
in the NCBI, so a large number of new functional
genes can be obtained from public databases. Further-
more, many studies with knowledge of Eca1043 genes
have been published in the last 3 years. All this infor-
mation provides valuable resources for assigning func-
tions to a mass ofhypothetical genes. After collecting
all this information and systematically searching non-
redundant nucleotide and protein databases, functions
have been assigned to 109 hypothetical genes with high
reliability, the synonyms, protein lengths, E values,
identities and predicted functions (products) are
listed in Table 4. The aligned length covered at least
80% of each gene with the identity ‡ 80% and E value
< 1e-20. Furthermore, the functions of another 318
hypothetical genes have been assigned with query
coverage > 80%, identity > 30% and E value < 1e-10
(see supplementary Table S1).
The predicted functions of the above 427 hypotheti-
cal genes were summarized according to COG func-
tional categories. The latest version of COG is
classified into 25 functional categories and each cate-
gory is symbolized by a capital letter, J, A, K, L, B,
D, Y, V, T, M, N, W, O, U, C, G, E, F, H, I, P, Q, R
and S, respectively. Details of the functions of the
codes are listed in Table 5. The 25 functional catego-
ries are summarized into four functional groups.
According to the COG functional category, 48, 79, 97
and 167 newly annotated hypothetical genes belong to
the ‘information storage and processing’, ‘cellular pro-
cesses and signaling’, ‘metabolism’ and ‘poorly charac-
terized’ groups, respectively. Detailed information
about annotated hypothetical genes in each COG func-
tional category is summarized in Table 5. Of the 427
newly annotated genes, 50 can be classified into two or
more functional categories and 36 can not be assigned
to any category.
As pointed out by Bell et al., Eca1043 has the ability
to use a range of different nutrients to adapt to
diverse environments [3]. In the original annotation
study, 80 putative ABC transporters, 36 putative
methyl-accepting chemotaxis protein genes and 336
putative regulators were annotated, which supports
that Eca1043 is able to respond to a wide range of
nutrient sources and live in different environments [3].
In our analysis, more genes associated with a variety
of lifestyles and habitats for Eca1043 were identified,
including 23 transporters, 17 regulators, 15 transferases
and 1 methyl-accepting chemotaxis protein. Further-
more, except for the newly annotated 427 hypothetical
genes, 114 hypothetical genes were recognized as puta-
tive ‘membrane proteins’ and 86 as ‘exported proteins’,
which are detailed in supplementary Tables S2 and S3,
respectively. It is highly possible that some of the puta-
tive ‘exported proteins’ are related to the pathogenicity
of Eca1043.
In conclusion, 1254 hypotheticalORFsin the impor-
tant plantpathogen Eca1043 are reannotated in this
analysis. First, 49 originally annotated hypothetical
ORFs are recognized as noncoding ORFs using a
methodology based on the Z-curve method. The recog-
nized noncoding ORFs are very unlikely to encode
proteins, as supported by PCA evidence, average
length distribution and COG functional category occu-
pation. Second, using sequence alignment tools and
some functional resources, potential functions for 427
hypothetical genes have been predicted. Moreover, 114
and 86 hypothetical genes are recognized as putative
‘membrane proteins’ and ‘exported proteins’, respec-
tively. Therefore, the number ofhypothetical genes
decreases to 578. These results provide more informa-
tion than earlier annotation, and will benefit research
into the lifestyle, metabolism and pathogenicity of this
important plant pathogen.
Experimental procedures
The length of the Eca1043 genome is 5.06 Mb and the
original annotation was submitted to GenBank (accession
number BX950851) in July 2004 [3]. Subsequently, a
curated annotation was made available by RefSeq at NCBI
(NC_004547). The number of annotated ORFsin the two
databases are the same. The sequence and annotation files
analyzed in this study were downloaded from NCBI RefSeq
(updated 9 February 2007) and the number of annotated
ORFs was 4472. Among them, two ORFs (ECA0773 and
ECA2198) have lengths that cannot be divided by three,
which obviously denotes that they are not protein-coding
genes and thus are excluded from this analysis. The remain-
ing 4470 ORFs can be classified into two groups: the first
contains 2246 genes with confirmed functions and 970 genes
with ‘putative’ or ‘probable’ functions, of which the 2246
function-confirmed genes are used as training parameters;
the second group contains 1254 hypothetical ORFs, whose
L L. Chen et al. HypotheticalORFsreannotationin Eca1043
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS 201
Table 4. Synonyms, COG functional categories and predicted functions (products) of 109 Eca1043 hypothetical genes with BLAST search
identity ‡ 80%, E value < 1e-20 and aligned length covering at least 80% of each gene.
Synonym E value
Length
(aa)
a
Functional
category
b
Identity
(%) Product
ECA0018 2e-39 89 S 98 YihD
ECA0019 0.0 311 R 95 YihE
ECA0054 1e-107 248 QR 82 SAM-dependent methyltransferases
ECA0061 6e-128 280 R 80 Protein involved in catabolism of external DNA
ECA0063 2e-38 95 JD 88 Addiction module toxin, RelE ⁄ StbE family
ECA0064 2e-31 83 D 87 Antitoxin of toxin-antitoxin stability system
ECA0130 7e-118 287 S 87 YicC N-terminal domain protein
ECA0264 1e-166 333 R 83 Twin-arginine translocation pathway signal
ECA0285 3e-153 285 R 96 Predicted P-loop-containing kinase
ECA0293 2e-151 357 MR 81 Predicted sugar phosphate isomerase involved in capsule formation
ECA0296 5e-113 260 Q 89 ABC-type transport system involved in resistance to organic
solvents, permease component
ECA0298 9e-92 209 Q 82 ABC-type transport system involved in resistance to organic
solvents, auxiliary component
ECA0313 2e-147 309 R 82 Putative Fe-S oxidoreductase
ECA0327 8e-127 261 S 84 Extradiol ring-cleavage dioxygenase, class III enzyme, subunit B
ECA0338 5e-133 287 G 80 Fructose-bisphosphate aldolase, class II family
ECA0383 1e-70 155 G 85 Beta-galactosidase, beta subunit
ECA0420 0.0 373 S 84 Cupin 4 family protein
ECA0444 3e-27 83 D 87 YefM protein
ECA0512 2e-27 80 K 90 Putative regulatory protein
ECA0631 4e-17 73 S 88 CsbD-like family
ECA0636 7e-45 131 S 82 DoxD-like family protein
ECA0696 2e-35 97 J 88 Putative RNA-binding protein
ECA0710 7e-74 140 S 99 YhbC-like protein
ECA0721 2e-143 292 O 81 Collagenase and related proteases
ECA0757 7e-79 215 R 83 Putative oxidoreductase
ECA0837 2e-66 148 I 86 Oligoketide cyclase ⁄ lipid transport protein
ECA0882 3e-137 299 P 80 Dyp-type peroxidase family
ECA0971 8e-142 277 R 90 Predicted TIM-barrel enzyme, possibly a dioxygenase
ECA0975 9e-39 90 CO 89 Fe(II) trafficking protein YggX
ECA0983 2e-92 238 P 82 Membrane protein TerC, possibly involved in tellurium resistance
ECA1010 9e-136 272 H 89 HesA ⁄ MoeB ⁄ ThiF family protein
ECA1024 2e-30 116 S 88 tRNA pseudouridine synthase C
ECA1071 7e-25 75 K 80 DNA-directed RNA polymerase, subunit M ⁄ Transcription
elongation factor TFIIS
ECA1125 2e-67 149 K 91 Ribonucleotide reductase regulator NrdR-like
ECA1155 4e-112 231 L 85 ExsB protein
ECA1191 9e-64 159 S 80 YbaK ⁄ ebsC protein
ECA1196 5e-138 304 O 88 Membrane protease subunits, stomatin ⁄ prohibitin homologs
ECA1317 1e-61 159 R 81 Predicted metal-dependent hydrolase
ECA1319 0.0 474 J 88 tRNA-methylthiotransferase MiaB protein
ECA1333 2e-22 93 – 84 LexA regulated, putative SOS response
ECA1405 6e-167 358 G 84 Putative sugar ABC transporter
ECA1410 1e-172 369 D 81 Mrp protein
ECA1578 1e-66 179 S 86 Nucleoprotein ⁄ polynucleotide-associated enzyme
ECA1585 0.0 512 R 81 Deoxyribodipyrimidine photolyase-like protein
ECA1645 4e-116 254 – 88 Putative plasmid replication protein
ECA1663 1e-30 94 K 86 Predicted transcriptional regulators
ECA1684 2e-85 321 GER 90 Permeases of the drug ⁄ metabolite transporter (DMT) superfamily
ECA1762 2e-50 109 P 92 Sulfite reductase
ECA1763 5e-73 219 R 83 Putative transport protein
ECA1781 0.0 382 R 83 Rhodanese-like domain protein
ECA1782 4e-82 192 S 81 Protein yceI
Hypothetical ORFsreannotationin Eca1043 L L. Chen et al.
202 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
Table 4. (Continued).
Synonym E value
Length
(aa)
a
Functional
category
b
Identity
(%) Product
ECA1809 2e-50 116 FGR 85 Histidine triad (HIT) protein
ECA1814 3e-85 180 R 82 Predicted esterase
ECA1816 3e-57 189 S 87 Predicted outer membrane lipoprotein
ECA1860 4e-188 499 O 85 FeS assembly protein SufB
ECA1927 3e-51 116 O 87 Glutaredoxin-like protein
ECA1956 1e-177 389 G 80 Predicted N-acetylglucosaminyl transferase
ECA1958 4e-34 109 J 87 Translation initiation factor SUI1
ECA1986 0.0 465 R 82 Predicted ATPase
ECA1995 3e-165 311 D 89 YdaO
ECA2292 1e-96 206 J 84 Putative translation factor (SUA5)
ECA2348 0.0 644 T 94 Putative Ser protein kinase
ECA2359 0.0 513 S 85 Putative sporulation protein
ECA2367 2e-30 95 S 82 Protein ycgL
ECA2464 0.0 484 J 10 Ribosomal RNA small subunit methyltransferase F
ECA2511 3e-116 259 QR 80 Predicted methyltransferase
ECA2512 7e-158 327 QR 81 SAM-dependent methyltransferases
ECA2525 9e-163 401 GEPR 82 Permeases of the major facilitator superfamily
ECA2529 2e-120 245 ER 84 Histidinol phosphatase and related hydrolases of the PHP family
ECA2560 6e-33 105 S 85 Putative alpha helix protein
ECA2683 0.0 562 R 89 TrkA, potassium channel-family protein
ECA2708 0.0 442 S 88 Putative FeS oxidoreductase
ECA2777 9e-32 109 – 80 Putative phage-related exported protein
ECA2812 6e-96 235 R 87 Integral membrane protein, interacts with FtsH
ECA2977 3e-96 177 G 96 ABC-type sugar transport system, periplasmic component
ECA3034 4e-86 199 R 86 Predicted hydrolases of HD superfamily
ECA3037 1e-77 164 S 82 YfbU family protein
ECA3057 2e-95 220 S 85 DedA protein (dsg-1 protein)
ECA3059 6e-151 336 E 80 Putative aspartate-semialdehyde dehydrogenase
ECA3070 3e-155 310 J 82 Adenine-specific methylase
ECA3087 2e-123 263 – 81 Necrosis-inducing protein
ECA3115 5e-205 413 E 93 Aspartate ⁄ tyrosine ⁄ aromatic aminotransferase
ECA3135 2e-59 133 S 80 Cupin 2, conserved barrel
ECA3223 0.0 398 R 85 Radical SAM enzyme, Cfr family
ECA3262 2e-151 308 R 99 N-acetylmuramic acid 6-phosphate etherase
ECA3288 5e-56 127 R 85 Autonomous glycyl radical cofactor
ECA3306 2e-52 115 S 93 Iron–sulfur cluster assembly accessory protein
ECA3361 1e-108 264 R 82 Cytochrome c assembly protein
ECA3382 6e-40 95 C 92 Rhs protein
ECA3428 1e-88 172 S 88 Putative hemolysin co-regulated protein
ECA3472 3e-137 255 R 86 Predicted glutamine amidotransferase
ECA3475 2e-55 116 V 90 HNH endonuclease
ECA3487 1e-97 205 G 83 Probable dehydratase
ECA3523 7e-87 187 E 82
D,D-heptose 1,7-bisphosphate phosphatase
ECA3623 5e-65 141 K 86 Putative negative regulator
ECA3774 8e-31 95 G 88 Phosphotransferase enzyme II, B component
ECA3792 0.0 477 G 84 Na
+
⁄ melibiose symporter and related transporters
ECA3824 2e-77 152 S 94 MraZ protein
ECA3860 2e-51 125 P 80 ApaG protein
ECA3877 5e-144 313 H 81 FAD synthase
ECA3894 4e-66 160 S 82 CreA protein
ECA4059 1e-73 200 E 86 Lysine exporter protein
ECA4063 2e-61 135 O 88 Predicted redox protein, regulator of disulfide bond formation
ECA4134 2e-97 191 O 91 HesB ⁄ YadR ⁄ YfhF
ECA4157 3e-64 197 U 81 Multiple antibiotic resistance (MarC)-related protein
ECA4162 5e-112 231 R 83 Pirin-related protein
L L. Chen et al. HypotheticalORFsreannotationin Eca1043
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS 203
coding status is identified and their functions predicted in
this analysis.
Re-recognizing hypothetical ORFs
The method adopted here is based on the Z-curve of DNA
sequence [11], which has been successfully used to find
genes in microbe [20,21] and eukaryotic genomes [22,23]. In
this analysis, 21 Z-curve variables were adopted, including
nine variables of phase-dependent single nucleotides and 12
of phase-independent dinucleotides. For details about these
variables, please refer to Gao and Zhang [23]. The Fisher
linear discrimination algorithm was used to differentiate
protein-coding and noncoding sequences, the procedure
was as detailed previously [20,23].
Assigning functions to hypothetical genes
Hypothetical ORFs were compared with nucleotide and
protein sequences in public nonredundant databases using
alignment tools such as blast [6,7] and fasta [8]. Other
functional assignment resources, such as interpro [9] and
kegg [10], were also used. Furthermore, studies from the
past 3 years with information about Eca 1043 were collected
and used to manually assign functions to some hypothetical
ORFs.
Acknowledgements
The authors wish to thank Professor Hong-Yu Zhang
and Dr Hong-Yu Ou for their valuable suggestions.
Table 4. (Continued).
Synonym E value
Length
(aa)
a
Functional
category
b
Identity
(%) Product
ECA4275 1e-88 172 S 89 Putative hemolysin co-regulated protein
ECA4329 2e-94 218 G 81 Class II aldolase ⁄ adducin domain protein
ECA4353 4e-30 81 O 85 SirA protein
a
Amino acid length of each hypothetical gene.
b
The 25 functional categories in COG database, i.e., J, A, K, L, B, D, Y, V, T, M, N, W, O, U,
C, G, E, F, H, I, P, Q, R and S, respectively. In addition, ‘–’ denotes the corresponding gene can not be assigned to any COG category.
Table 5. The number of newly annotated hypothetical genes in each of the 25 COG functional categories.
Group Code Description Number
a
Information storage
and processing
J Translation, ribosomal structure and biogenesis 9
A RNA processing and modification –
K Transcription 24
L Replication, recombination and repair 15
B Chromatin structure and dynamics –
Cellular processes and signaling D Cell cycle control, cell division, chromosome partitioning 12
Y Nuclear structure –
V Defense mechanisms 6
T Signal transduction mechanisms 16
M Cell wall ⁄ membrane ⁄ envelope biogenesis 18
N Cell motility 2
Z Cytoskeleton –
W Extracellular structures –
U Intracellular trafficking, secretion, and vesicular transport 6
O Posttranslational modification, protein turnover, chaperones 19
Metabolism C Energy production and conversion 6
G Carbohydrate transport and metabolism 33
E Amino acid transport and metabolism 23
F Nucleotide transport and metabolism 4
H Coenzyme transport and metabolism 4
I Lipid transport and metabolism 5
P Inorganic ion transport and metabolism 13
Q Secondary metabolites biosynthesis, transport and catabolism 9
Poorly characterized R General function prediction only 81
S Function unknown 86
a
‘–’indicates there is no newly annotated gene in this COG functional category.
Hypothetical ORFsreannotationin Eca1043 L L. Chen et al.
204 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
The study was supported by the National Natural
Science Foundation of China (30600119), the
National Basic Research Program of China
(2003CB114400) and the scientific research funds
of Shandong University of Technology (grant
2004KJM29 and 04KQ14).
References
1 Nielsen P & Krogh A (2005) Large-scale prokaryotic
gene prediction and comparison to genome annotation.
Bioinformatics 21, 4322–4329.
2 Skovgaard M, Jensen LJ, Brunak S, Ussery D & Krogh
A (2001) On the total number of genes and their length
distribution in complete microbial genomes. Trends
Genet 17, 425–428.
3 Bell KS, Sebaihia M, Pritchard L, Holden MT, Hyman
LJ, Holeva MC, Thomson NR, Bentley SD, Churcher
LJ, Mungall K et al. (2004) Genome sequence of the
enterobacterial phytopathogen Erwinia carotovora
subsp. atroseptica and characterization of virulence fac-
tors. Proc Natl Acad Sci USA 101, 11105–11110.
4Pe
´
rombelon MCM (2002) Potato diseases caused by
soft rot erwinias: an overview of pathogenesis. Plant
Pathol 51, 1–12.
5 Toth IK, Bell KS, Holeva MC & Birch PRJ (2003) Soft
rot erwiniae: from genes to genomes. Mol Plant Pathol
4, 17–30.
6 Altschul SF, Madden TL, Scha
¨
ffer AA, Zhang J,
Zhang Z, Miller W & Lipman DJ (1997) Gapped
BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res 25, 3389–
3402.
7 Scha
¨
ffer AA, Aravind L, Madden TL, Shavirin S,
Spouge JL, Wolf YI, Koonin EV & Altschul SF
(2001) Improving the accuracy of PSI-BLAST protein
database searches with composition-based statistics
and other refinements. Nucleic Acids Res 29, 2994–
3005.
8 Pearson WR (1990) Rapid and sensitive sequence com-
parison with FASTP and FASTA. Methods Enzymol
183, 63–98.
9 Mulder NJ, Apweiler R, Attwood TK, Bairoch A,
Bateman A, Binns D, Bork P, Buillard V, Cerutti L,
Copley R et al. (2007) New developments in the Inter-
Pro database. Nucleic Acids Res 35, D224–D228.
10 Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF,
Itoh M, Kawashima S, Katayama T, Araki M & Hira-
kawa M (2006) From genomics to chemical genomics:
new developments in KEGG. Nucleic Acids Res 34,
D354–D357.
11 Zhang CT & Zhang R (1991) Analysis of distribution
of bases in the coding sequences by a diagrammatic
technique. Nucleic Acids Res 19, 6313–6317.
12 Dillon WR & Goldstein M (1984) Multivariate Analysis
– Methods and Applications (Wiley Series in Probability
and Mathematical Statistics). Wiley, New York, NY.
13 Trifonov EN (1987) Translation framing code and
frame-monitoring mechanism as suggested by the analy-
sis of mRNA and 16S rRNA nucleotide sequences.
J Mol Biol 194, 643–652.
14 Zhang CT & Chou KC (1994) A graphic approach to
analyzing codon usage in 1562 E. coli protein coding
sequences. J Mol Biol 238, 1–8.
15 Gupta SK, Majumdar S, Bhattacharya TK & Ghosh
TC (2000) Studies on the relationships between the
synonymous codon usage and protein secondary struc-
tural units. Biochem Biophys Res Commun 269
, 692–
696.
16 Pan A, Dutta C & Das J (1998) Codon usage in highly
expressed genes of Haemophillus influenzae and Myco-
bacterium tuberculosis: translational selection versus
mutational bias. Gene 215, 405–413.
17 Chiusano ML, Alvarez-Valin F, Di Giulio M,
D’Onofrio G, Ammirato G, Colonna G & Bernardi
G (2000) Second codon positions of genes and the
secondary structures of proteins. Relationships and
implications for the origin of the genetic code. Gene
261, 63–69.
18 Tatusov RL, Koonin EV & Lipman DJ (1997) A geno-
mic perspective on protein families. Science 278, 631–
637.
19 Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR,
Kiryutin B, Koonin EV, Krylov DM, Mazumder R,
Mekhedov SL, Nikolskaya AN et al. (2003) The COG
database: an updated version includes eukaryotes. BMC
Bioinformatics 4, 41.
20 Chen LL & Zhang CT (2003) Gene recognition from
questionable ORFsin bacterial and archaeal genomes.
J Biomol Struct Dyn 21, 99–110.
21 Guo FB, Ou HY & Zhang CT (2003) ZCURVE: a new
system for recognizing protein-coding genes in bacterial
and archaeal genomes. Nucleic Acids Res 31, 1780–
1789.
22 Zhang CT & Wang J (2000) Recognition of protein
coding genes in the yeast genome at better than 95%
accuracy based on the Z curve. Nucleic Acids Res 28,
2804–2814.
23 Gao F & Zhang CT (2004) Comparison of various
algorithms for recognizing short coding sequences of
human genes. Bioinformatics 20 , 673–681.
Supplementary material
The following supplementary material is available
online:
Table S1. Synonyms, COG functional categories and
predicted functions (products) of 318 Eca hypothetical
L L. Chen et al. HypotheticalORFsreannotationin Eca1043
FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS 205
genes with blast search identity > 30%, E value < 1e-
10 and aligned length covers at least 80% of each
gene.
Table S2. Synonyms of the 114 recognized membrane
proteins.
Table S3. Synonyms of the 86 recognized exported
proteins.
This material is available as part of the online article
from http://www.blackwell-synergy.com
Please note: Blackwell Publishing are not responsible
for the content or functionality of any supplementary
materials supplied by the authors. Any queries (other
than missing material) should be directed to the corre-
sponding author for the article.
Hypothetical ORFsreannotationin Eca1043 L L. Chen et al.
206 FEBS Journal 275 (2008) 198–206 ª 2007 The Authors Journal compilation ª 2007 FEBS
. Reannotation of hypothetical ORFs in plant pathogen
Erwinia carotovora subsp. atroseptica SCRI1043
Ling-Ling Chen, Bin-Guang Ma and Na. annotation of the hypothetical genes
After identifying the 49 noncoding ORFs, the next step
was to assign functions to the remaining hypothetical
genes. In the