Báo cáo khoa học: Prediction of missing enzyme genes in a bacterial metabolic network Reconstruction of the lysine-degradation pathway ofPseudomonas aeruginosa doc
Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 12 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
12
Dung lượng
350,54 KB
Nội dung
Prediction of missing enzyme genes in a bacterial metabolic network Reconstruction of the lysine-degradation pathway of Pseudomonas aeruginosa Yoshihiro Yamanishi1, Hisaaki Mihara2, Motoharu Osaki2, Hisashi Muramatsu3, Nobuyoshi Esaki2, Tetsuya Sato1, Yoshiyuki Hizukuri1, Susumu Goto1 and Minoru Kanehisa1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan Division of Environmental Chemistry, Institute for Chemical Research, Kyoto University, Japan Department of Biology, Graduate School of Science, Osaka University, Japan Keywords kernel methods; lysine degradation pathway; metabolic network; missing enzymes; network inference Correspondence Y Yamanishi, Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan Fax: +81 774 38 3269 Tel: +81 774 38 3270 E-mail: yoshi@kuicr.kyoto-u.ac.jp (Received December 2006, revised 17 February 2007, accepted March 2007) doi:10.1111/j.1742-4658.2007.05763.x The metabolic network is an important biological network which consists of enzymes and chemical compounds However, a large number of metabolic pathways remains unknown, and most organism-specific metabolic pathways contain many missing enzymes We present a novel method to identify the genes coding for missing enzymes using available genomic and chemical information from bacterial genomes The proposed method consists of two steps: (a) estimation of the functional association between the genes with respect to chromosomal proximity and evolutionary association, using supervised network inference; and (b) selection of gene candidates for missing enzymes based on the original candidate score and the chemical reaction information encoded in the EC number We applied the proposed methods to infer the metabolic network for the bacteria Pseudomonas aeruginosa from two genomic datasets: gene position and phylogenetic profiles Next, we predicted several missing enzyme genes to reconstruct the lysinedegradation pathway in P aeruginosa using EC number information As a result, we identified PA0266 as a putative 5-aminovalerate aminotransferase (EC 2.6.1.48) and PA0265 as a putative glutarate semialdehyde dehydrogenase (EC 1.2.1.20) To verify our prediction, we conducted biochemical assays and examined the activity of the products of the predicted genes, PA0265 and PA0266, in a coupled reaction We observed that the predicted gene products catalyzed the expected reactions; no activity was seen when both gene products were omitted from the reaction Most biological functions involve the coordinated actions of many proteins, and the complexity of living systems arises as a result of such interactions It is therefore important to understand biological systems by analyzing the relationships among many proteins A challenge in recent genome science is to computationally predict the systemic functional behaviors of proteins from genomic and molecular information for industrial and other practical applications [1,2] Recent sequence projects and developments in biotechnology have contributed to an increasing amount of high-throughput genomic data for biomolecules and their interactions These data are useful sources from which to computationally infer many types of biological networks [3–6] Abbreviations OGC, ortholog gene cluster; ROC, receiver operating curve 2262 FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS Y Yamanishi et al The metabolic network is an important class of biological network, consisting of enzymes and chemical compounds Recent developments in pathway databases, such as KEGG PATHWAY [7] and EcoCyc [8], enable us to analyze known metabolic networks Unfortunately, most organism-specific metabolic networks contain many ‘missing enzymes’ in their known pathways Because the experimental determination of metabolic networks remains challenging, even for the most basic organisms, there is a need to develop methods to infer the unknown parts of metabolic networks and identify genes coding for missing enzymes in known metabolic pathways [9–11] Thanks to the development of homology detection tools [12–14], enzyme genes can be easily found from fully sequenced genomes using comparative genomics [15], but it can be difficult to assign them a precise biological role within a pathway Missing enzymes are an obstacle to understanding the functional behavior of enzymes in metabolic pathways There are two research directions for finding the genes of missing enzymes The first is to use genomic information to predict candidate genes coding for the missing enzymes Examples include using information about the gene order along the chromosome in bacterial genomes [16], gene fusion [17,18], genomic context [19,20], gene-expression patterns [21,22], statistical methods [23] and multiple genomic datasets [5,6,24] The second approach is to use information about the chemical compounds with which the enzymes are involved An example is the path-computation approach [25], in which all possible paths between two compounds are searched by losing the substrate-specificity restriction However, this system tends to produce too many candidates and it is difficult to select reliable paths It is more natural to use both genomic data and chemical information simultaneously, rather than to use each individually This study presents a novel method to identify genes coding for missing enzymes from genomic data and chemical information for bacterial genomes First, we designed kernel-similarity measures [26] between genes based on gene positions and phylogenetic profiles This is motivated by the interesting observation that functionally related genes tend to be closely located along bacterial chromosomes [16,27] or evolve in a correlated manner [28–30] Next, we predict a global gene network applying supervised network inference using the kernels based on the genomic datasets, which are based on a previously developed network inference algorithm [24,31] Finally, we collect genes that have potential functional relations with enzyme genes adjacent to the target missing enzyme using the original candidate score, and select genes based on the enzyme Prediction of missing enzyme genes commission (EC) numbers of the target enzymes in the pathway Figure illustrates this procedure We applied the proposed method to the metabolic network of Pseudomonas aeruginosa and attempted to find several missing enzyme genes We focused on the lysine-degradation pathway of P aeruginosa (Fig 2) because it contains many missing enzymes for which the coding genes have not yet been identified Our survey of missing enzymes in the KEGG PATHWAY database suggests that the lysine-degradation pathway map for P aeruginosa is missing 28 of its 62 enzymes (45%) Lysine catabolism is notable for its biochemical diversity across organisms Enzymatic reactions in the lysine pathway in bacteria are completely different from those seen in eukaryotes and archaea, and there is also variation within bacteria We also focused on the lysine-degradation pathway because the substrates and intermediates of the pathway are structurally simple, and so the reactions can be easily examined biochemically Thus, the computational prediction of missing enzymes can be verified using relatively simple biochemical experiments We selected gene candidates for some of the missing enzymes in the lysine-degradation pathway based on the candidate scores, which in turn were based on association scores with known enzymes that catalyzed similar reactions based on EC number For example, we identified PA0266 as a putative 5-aminovalerate aminotransferase (EC 2.6.1.48) and PA0265 as a putative glutarate semialdehyde dehydrogenase (EC 1.2.1.20) To verify the prediction, we conducted wet-lab experiments, in which PA0265 and PA0266 genes were cloned and expressed in Escherichia coli, and the proteins purified The activity of PA0265 and PA0266 was examined, and we found that the enzymes catalyzed the expected reactions Therefore, we concluded that PA0265 is glutarate semialdehyde dehydrogenase and PA0266 is 5-aminovalerate aminotransferase This is how we successfully reconstructed the metabolic pathway for lysine degradation Results Inference of potential gene network First, we attempted to infer a global network consisting of the potential functional relationship between the genes of P aeruginosa from two genomic datasets: the gene position along the genome, and phylogenetic profiles Details of our network inference method are given in the Experimental procedures and the original references [24,31] In previous studies, the usefulness of the network inference method was confirmed by a FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS 2263 Prediction of missing enzyme genes Y Yamanishi et al Gene Location Predicted Gene Network 9 5 4 1.1.1.60 + Phylogenetic Profile ??? Gene (1 0 1 1 0) Gene (1 0 1 1 0) Gene (1 0 1 1 0) Gene (1 0 1 1 0) Gene (0 0 0 1 1 0) Gene (1 1 1 1 1 1 0) Gene (1 0 1 1 1 1) Gene (1 0 0 0 0) Gene (1 0 0 0 0) 1.1.1.79 4.1.1.40 1.1.1.77 PATHWAY Database Fig Procedure for predicting missing enzyme genes First, we estimated the functional associations between genes by predicting a global gene network from the chromosomal proximity and phylogenetic profiles, using the supervised network inference method Second, we looked for sets of genes sharing high association scores with the neighbors of missing enzymes Finally, we selected candidates for the missing enzymes based on the chemical reaction information encoded in the first three digits of the EC numbers cross-validation experiment which attempted to recover the metabolic network in the KEGG PATHWAY database as follows In each cross-validation step, the known enzyme genes were randomly divided into two sets: the training set and the test set, in the proportion of nine to one First, we used the training dataset for a learning process Second, we predicted the network involving the enzyme genes in the test set Finally, we evaluated the accuracy of the prediction using ROC scores, defined as the area under the receiver operating curve (ROC), that is, the area under the plot of true positives as a function of false positives, normalized to for a perfect prediction and 0.5 for a random prediction To evaluate the biological relevance of the gene position and the phylogenetic profile with metabolic networks, we computed ROC scores by applying the cross-validation test as in previous studies Table shows the ROC scores for gene position, phylogenetic profile, and the integration of both datasets Both gene position and phylogenetic profile seemed to capture information for reconstructing the metabolic network We evaluated the biological relevance of each data source by ROC score ) 0.5, and used them to weight the data integration process The resulting weights for gene position and phylogenetic profile are 0.48 and 2264 0.52, respectively, where the sum of the weights is normalized to We also observed a significant effect of integrating the two genomic datasets into a single set via the sum of the kernel-similarity matrices Finally, we predicted a global network for all the genes of P aeruginosa In the inference process, we used all the current knowledge about the metabolic network as training data The predicted network enabled us to predict unknown functional relations between genes The results of the predicted gene network can be obtained from http://web.kuicr.kyoto-u.ac.jp/supp/ yoshi/pae/ A web server to carry out the network inference procedure is in preparation Missing enzyme gene prediction There are many missing enzymes whose coding genes have not been identified in known pathways In this study, we focused on the reconstruction of the lysinedegradation pathway of P aeruginosa, because this pathway contains many missing enzymes and our understanding of the detailed enzymatic behavior in this pathway is far from complete Figure shows the lysine-degradation pathway stored in the KEGG PATHWAY database, where a green box indicates that the enzyme’s gene has been identified for FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS Y Yamanishi et al Prediction of missing enzyme genes D-Lysine 2.6.1.21 LYSINE DEGRADATION L-Pipecolate 1.5.1.21 1.5.1.1 Δ1-Piperideine2-carboxylate 1.5.3.7 1.5.99.3 Penicillins and cephalosporins biosynthesis 5.4.3.4 6-Acetoamido2-oxohexanoate 3.5.1.17 5.1.1.5 6-Amino-2oxohexanoate 5.1.1.9 L-β-Lysine 1.4.3.14 2.6.1.65 5.4.3.2 3.5.1.17 2.3.1.32 2.6.1.39 1.2.4.2 L-Lysine 1.5.1.9 Saccharopine 1.5.1.7 1.2.1.31 L-2-Amino- L-2-Aminoadipate 1.5.1.10 adipate 6-semialdehyde S -Glutaryldihydrolipoamide 3,5-Diamino5-Amino-3hexanoate oxohexanoate 1.4.1.11 N -Acetyllysine Δ1-Piperideine6-L-carboxylate 2-Oxoadipate 5.4.3.3 1.4.1.12 2-Amino-52,5-Diaminooxohexanoate hexanoate N6-Hydroxylysine 1.14.13.59 1.5.1.8 N2-(D-1-Carboxy1.5.1.16 ethyl)-L-lysine Cadaverine N6-Acetyl-N6hydroxy-lysine 2.3.1.102 Aerobactin 6.3.2.27 Lysine biosynthesis 1-Piperideine Biotin metabolism 2.3.1.61 Crotonoyl1.3.99.7 CoA Glutaryl4.2.1.17 CoA 1.13.122 3.5.1.30 Glutarate Glutarate semialdehyde 2.3.1.– 3.5.1.63 (S)-3-Hydroxybutanoyl-CoA 1.1.1.– 5-Acetanmidopentanoate 1.1.1.35 AcetoacetylCoA 2.3.1.9 Acetyl-CoA 5-Aminopentanamide 5-Amino pentanoate Proteinlysine 2.6.1.39 6-Acetanmido2-oxohexanoate N -Acetyllysine N6-HydroxyProtein-N Glycine 4-TrimethylProtein2.1.1.43 2.1.1.43 trimethyl-lysine trimethyl-lysine N6-Me-lysine 2.1.1.43 1.2.1.3 ammoniobutanoate 2.1.1.59 2.1.1.59 3.4.–.– 2.1.1.59 Camitine 2.1.2.1 1.14.11.8 1.14.11.1 Protein-N, N 4-Trimethyl- 1.2.1.47 Trimethyl2.1.1.60 2.1.1.60 2.1.1.60 Me2-lysine ammoniobutanal lysine 1.14.11.1 2.7.1.81 5-Phosphonooxy-lysine 1.14.11.4 Citrate cycle 2.4.1.50 erythro-5Hydroxylysine 5-Galactosyloxy-lysine 3-Dehydroxycamitine Fig Lysine-degradation pathway of P aeruginosa A small circle corresponds to one chemical compound and a rectangle corresponds to one enzyme protein Green indicates that the coding enzyme genes have been identified, and red indicates that the coding enzyme genes have not yet been identified ‘?’ indicates that the enzyme has not been assigned an EC number Table Prediction accuracy for gene network inference: ROC scores Method ROC score Gene position only Phylogenetic profile only Integration of the genomic data 0.65 0.7 0.79 P aeruginosa, and the red color indicates missing enzymes for which genes have yet to be identified Based on the predicted gene network, we attempted to predict the candidate genes corresponding to missing enzymes in this pathway There are two paths from l-lysine to glutarate in the lysine-degradation pathway: l-lysine fi 5-amino pentanamide fi 5-amino pentanoate fi glutarate semialdehyde fi glutarate and l-lysine fi cadaverine fi 1-piperideine fi 5-amino pentanoate fi glutarate semialdehyde fi glutarate The second pathway is known to exist for P aeruginosa [32] However, several of the enzyme genes involved have not been identified, therefore we focused on the second pathway, which is illustrated in Fig We used PA1586 (EC 2.3.1.61) and PA0447 (EC 1.3.99.7) as seed genes, because they are adjacent to the missing enzymes These enzyme genes PA1586 and PA0447 are known to work in the lysine-degradation pathway We looked for genes with high graphical association scores to PA1586 and PA0447 in our predicted gene network, using our original candidate score (see Experimental procedures for more details) Table shows a list of the top 50 high-scoring genes Several of these high-scoring genes may be functionally related to PA1586 and PA0447 Taking into account the first three digits of the EC numbers, we assigned the high-scoring genes to each missing enzyme For example, the first three digits of the EC number for PA1589 (EC 6.2.1.5) are the same as those for the missing enzyme (EC 6.2.1.6), therefore we predicted that PA1589 is a candidate for the enzyme gene corresponding to EC 6.2.1.6 In a similar manner, we predicted PA0265 (EC 1.2.1.16) and PA0266 (EC 2.6.1.19) as enzyme gene candidates for EC 1.2.1.20 and EC 2.6.1.48 The chemical reactions between cadaverine, 1-piperideine and 5-amino pentanoate have not been assigned an EC number in the International FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS 2265 Prediction of missing enzyme genes Y Yamanishi et al O NH2 HO L-Lysine NH2 F EC:4.1.1.18 O Cadaverine H2N NH2 E EC: ? 1–Pierideine N D EC: ? O 5–Aminopentanoate NH2 HO Expression and purification of recombinant enzymes C EC: 2.6.1.48 O O Glutarate semialdehyde H HO B EC: 1.2.1.20 O O Glutarate OH HO A EC: 6.2.1.6 O O Glutaryl–CoA HO CoA Fig A series of chemical reactions focused on in this study Cadaverin-based path from L-lysine to glutarate via cadaverine, 1-piperideine, 5-amino pentanoate, and glutarate semialdehyde Union of Biochemistry and Molecular Biology (IUBMB) at the time of writing To obtain putative EC number information, we used the E-zyme system [24], which is an automatic EC number assignment system developed in the KEGG database Using the E-zyme system, we carried out EC 2266 number predictions based on the chemical structures of 1-piperideine and 5-amino pentanoate As a result, the E-zyme system returned EC 1.1.1.- for the chemical reaction The list of high-scoring genes contains PA1576 (EC 1.1.1.31), so we assigned it to the missing enzyme involved in the reaction between 1-piperideine and 5-amino pentanoate Unfortunately, the current version of the E-zyme system could not generate a prediction for the reaction between cadaverine and 1-piperideine, because there is no template information describing the target reaction in the current system For EC 4.1.1.18, the list of high-scoring genes does not contain any genes whose first three EC number digits match Therefore, we were not able to assign any specific gene to the missing enzyme EC 4.1.1.18 However, there are many hypothetical proteins with high candidate scores in the list given in Table 2, so there is a possibility that one of these hypothetical proteins might work as an enzyme in the target chemical reaction Table summarizes our gene assignment for the corresponding missing enzymes in the lysine-degradation pathway Finally, we conducted a wet-lab experiment based on biological assays in order to verify that our predicted genes were involved in the target chemical reactions We focused on a successive reaction: 4-amino pentanoate fi glutarate semialdehyde fi glutarate Recall that we predicted that PA0266 was a putative 5-aminovalerate aminotransferase (EC 2.6.1.48) and PA0265 a putative glutarate semialdehyde dehydrogenase (EC 1.2.1.20) The PA0265 and PA0266 genes were cloned by PCR and expressed in E coli, and the proteins were purified to homogeneity as a C-terminal histidine-tagged fusion protein SDS ⁄ PAGE analysis of the purified PA0265 and PA0266 proteins gave single bands with subunit molecular masses of 53 and 46 kDa, respectively, in good agreement with those calculated from the amino acid sequences (53 142 and 46 285 Da, respectively) Purified PA0266 exhibits a yellow color and UV-visible spectra characteristic of a pyridoxal 5¢-phosphate-dependent enzyme (data not shown) Enzymatic activity of predicted genes The activity of PA0265 and PA0266 was examined in a coupled reaction, in which conversion of 5-amino pentanoate fi glutarate semialdehyde fi glutarate was monitored by the increase in the amount of FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS Y Yamanishi et al Prediction of missing enzyme genes Table Top 50 high-scoring genes in our candidate scores Score Candidate Annotation 0.51 0.495 0.48 0.485 0.47 0.46 0.46 0.46 0.45 0.455 0.44 0.445 0.43 0.435 0.435 0.435 0.435 0.42 0.42 0.42 0.425 0.425 0.425 0.41 0.41 0.41 0.415 0.405 0.4 0.39 0.39 0.39 0.395 0.38 0.38 0.385 0.37 0.375 0.36 0.36 0.365 0.365 0.35 0.35 0.35 0.355 0.355 0.355 0.34 0.34 0.3 PA1587 PA1591 PA1593 PA1585 PA1594 PA1592 PA1589 PA1579 PA1595 PA0265 PA0266 PA1584 PA1597 PA1599 PA1582 PA1578 PA1576 PA4330 PA1588 PA1581 PA1571 PA1570 PA0456 PA1603 PA1577 PA1573 PA1601 PA2013 PA1574 PA1998 PA1596 PA1583 PA1600 PA1985 PA1604 PA1605 PA1590 PA0450 PA1629 PA0455 PA0854 PA0446 PA2556 PA2401 PA1606 PA2400 PA2250 PA1575 PA4333 PA1628 PA3416 lipoamide dehydrogenase-glc (EC 1.8.1.4) hypothetical protein hypothetical protein 2-oxoglutarate dehydrogenase (E1 subunit) (EC 1.2.4.2) hypothetical protein hypothetical protein succinyl-CoA synthetase alpha chain (EC 6.2.1.5) hypothetical protein hypothetical protein succinate-semialdehyde dehydrogenase (EC 1.2.1.16) 4-aminobutyrate aminotransferase (EC 2.6.1.19) succinate dehydrogenase (B subunit) (EC 1.3.99.1) hypothetical protein probable transcriptional regulator succinate dehydrogenase (D subunit) (EC 1.3.99.1) hypothetical protein probable 3-hydroxyisobutyrate dehydrogenase (EC 1.1.1.31) probable enoyl-CoA hydratase ⁄ isomerase (EC 4.2.1.17) succinyl-CoA synthetase beta chain (EC 6.2.1.5) succinate dehydrogenase (C subunit) (EC 1.3.99.1) hypothetical protein probable transcriptional regulator probable cold-shock protein probable transcriptional regulator hypothetical protein conserved hypothetical protein probable aldehyde dehydrogenase probable enoyl-CoA hydratase ⁄ isomerase (EC 4.2.1.17) conserved hypothetical protein probable transcriptional regulator heat shock protein HtpG succinate dehydrogenase (A subunit) (EC 1.3.99.1) probable cytochrome c pyrroloquinoline quinone biosynthesis protein A hypothetical protein hypothetical protein branched chain amino acid transporter probable phosphate transporter probable enoyl-CoA hydratase ⁄ isomerase (EC 4.2.1.17) RNA helicase DbpA fumarate hydratase (EC 4.2.1.2) conserved hypothetical protein probable transcriptional regulator probable nonribosomal peptide synthetase hypothetical protein probable nonribosomal peptide synthetase lipoamide dehydrogenase-Val (EC 1.8.1.4) hypothetical protein probable fumarase (EC 4.2.1.2) putative 3-hydroxybutyryl-CoA dehydrogenase (EC 1.1.1.157) pyruvate dehydrogenase E1 component, beta subunit (EC 1.2.4.1) NADPH at 340 nm (Fig 4A) We found that the enzymes catalyzed the expected reactions (Fig 4B), and no activity was seen when both enzymes were omitted from the reaction Reaction mixture containing only PA0266 showed a slight increase A340, due to the formation of pyridoxamine 5’-phosphate from FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS 2267 Prediction of missing enzyme genes Y Yamanishi et al degradation pathway of P putida [34] These genes have an orthologous relationship with those predicted for P aeruginosa, so this is additional evidence for our prediction We also confirmed the validity of our prediction by conducting biochemical assays We examined enzyme activity in successive enzymatic reactions, and observed that the genes PA0266 and PA0265 work as 5-aminovalerate aminotransferase and glutarate semialdehyde dehydrogenase, catalyzing successive chemical reactions from 5-amino pentanoate to glutarate There is a hypothesis that the predicted gene products PA0266 and PA0265 might have broad substrate specificity For example, the E coli gene of EC 2.6.1.19 (on which many experimental studies have been performed) has high sequence similarity with P aeruginosa gene PA0266, and the corresponding gene cluster structure is well conserved To date, techniques for reconstructing metabolic networks have depended heavily on sequence homology detection [35] A typical computational approach to reconstructing the metabolic network from the genome sequence of a certain organism is as follows: (a) Assign an EC number to enzyme candidate genes by detecting homology based on comparative genomics across different organisms (b) Obtain compound information such as substrates and products, in which the enzyme genes are involved, from reaction knowledge based on the EC number (c) Assign each enzyme gene to appropriate positions in metabolic pathway maps, created from current biochemical knowledge for many organisms (d) Visualize metabolic Table Assignment of genes to missing enzymes in the lysinedegradation pathway of P aeruginosa Reaction Candidate gene A EC:6.2.1.6 B EC:1.2.1.20 C EC:2.6.1.48 D 5ami.1-pip E Cadav.Delta F EC:4.1.1.18 PA1589 (succinyl-CoA synthetase; EC 6.2.1.5) PA0265 (dehydrogenase; EC 1.2.1.16) PA0266 (amino-transferase; EC 2.6.1.19) PA1576 (dehydrogenase; EC 1.1.1.31) not specified not specified pyridoxal 5’-phosphate via a transamination reaction catalyzed by PA0266 Therefore, we concluded that PA0265 is glutarate semialdehyde dehydrogenase and PA0266 is 5-aminovalerate aminotransferase Discussion Here, we have proposed a novel method to predict genes coding for missing enzymes in metabolic pathways using genomic data and chemical information for bacterial genomes As an application of this technique, we attempted to reconstruct the enzyme gene network of the lysine-degradation pathway in P aeruginosa We filled in some of the enzyme genes in the lysinedegradation pathway, for example, by predicting PA0266 as a putative 5-aminovalerate aminotransferase and PA0265 as a putative glutarate semialdehyde dehydrogenase Recently, a report has suggested candidate genes for 5-aminovalerate aminotransferase and glutarate semialdehyde dehydrogenase in the lysineA PA0266 2-Oxoglutarate B PA0265 Glutarate semialdehyde 5-Aminovalerate NADP+ Glutamate NADPH Absorbance at 340 nm 0.20 a 0.15 0.10 0.05 b c d 0 100 200 300 Time (s) 2268 Glutarate 400 500 600 Fig Enzymatic activity of predicted genes (A) Schematic drawing of reactions catalyzed by aminovalerate aminotransferase (PA0266) and glutarate semialdehyde dehydrogenase (PA0265) (B) Activity of PA0265 and PA0266 The reaction was carried out in the presence of PA0266 and PA0265 (a), PA0266 (b), PA0265 (c), or in the absence of the enzymes (d) FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS Y Yamanishi et al pathways that are specific to a target organism However, this procedure does not always work well in reconstructing the correct metabolic pathways, and tends to lead to many missing enzymes or gaps in known metabolic pathways If we cannot detect a significant sequence homology with enzyme genes whose pathway information is known in other organisms, it is not possible to identify candidate genes for missing enzymes This has been one cause of missing enzymes or pathway gaps in the predicted metabolic network, as suggested previously [9–11] There are two possible reasons for missing enzymes in predicted pathways First, there may be alternative paths between the two compounds either side of the gap To solve this, a path computation approach has been proposed [25] This method searches all possible pathways between two compounds if the enzyme linking the compounds is missing However, it has been pointed out that this system tends to show too many possible pathways Second, the EC number annotation might be wrong for the enzyme linking the compounds We often observe that the sequence homology for enzymes sharing the first three digits of the EC number is well conserved across different organisms, however, sequence homology corresponding to the substrate specificity represented by the fourth digit of the EC number is not strongly conserved Therefore, it is suspected that wrongly annotated genes may have been the cause of some of the pathway gaps or missing enzymes It is also suspected that many genes have been assigned incorrect EC numbers and assigned the wrong biological roles Even so, the first three digits of the EC number remain useful for predicting potential enzyme genes, and if the first three digits in the EC number are the same between two enzymes, those enzymes can be considered to catalyze similar types of chemical reactions Therefore, our gene-selection method for missing enzymes can be regarded as reasonable from a chemical viewpoint It should also be pointed out that our method is applicable to any reaction, even when no EC numbers are assigned to the reactions, because our procedure includes the process of estimating the possible EC subsubclass for the reactions based on biochemical structure transformation patterns [33] There are many reactions for which EC numbers have not been assigned, especially in secondary metabolism We expect that our approach works well for such complex metabolic pathways From a technical viewpoint, we transformed all the predictor datasets into kernel-similarity matrices in order to estimate functional associations between genes In this study, we used the gene position and phylogenetic profiles because they reflect the following Prediction of missing enzyme genes two properties of bacterial genomes First, functionally interacting genes in metabolic pathways tend to be closely located along the chromosome, as seen in operon structures [16,27] Second, functionally interacting genes in metabolic pathways tend to evolve in a correlated manner [28–30] Performance depends on the design of the kernel-similarity measure, so there remains room in the evaluation for gene–gene similarities based on each data source For gene position data, the incorporation of directed information of genes into the similarity would be interesting For phylogenetic profiles, the use of a real-valued phylogenetic profile [36] might improve the performance Additional use of other genomic information, such as gene fusion [17,18], in the framework of kernel methods will be studied in future Another solution to the problem of missing enzymes would be to use other experimental data such as geneexpression data [21,22] The pattern of gene expression based on several experimental conditions makes it possible to observe the expression behavior of thousands of genes and estimate potential functional associations between them It has been confirmed that the gene-expression pattern of successively working enzyme pairs is more similar than that of randomly selected enzyme pairs [21] Therefore, gene-expression data would be a useful source of additional data in our study However, microarray technology is expensive, so the information is not always available for the target organism, and we were not able to obtain the microarray gene expression data for P aeruginosa Another problem is that the microarray data tend to contain considerable noise By contrast, our method brings about a new possibility for the systematic prediction of potential functional relationships between genes Our predicted network enables us to suggest unknown gene–gene relations and estimate missing enzyme genes using just the adjacency information and comparative genomics The originality of this study is also seen in the collaborative work between both computational prediction and experimental validation In this study, the biological validity of the prediction was confirmed by conducting a biochemical assay, and it was observed that the enzymes corresponding to the predicted genes catalyzed successive reactions in the target metabolic pathway This type of collaborative work will become a standard in research in near future Furthermore, we expect to identify more missing enzyme genes in other pathways by a similar application of our approach Comprehensive identification of missing enzyme genes in the entire metabolic network will be carried out in the future FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS 2269 Prediction of missing enzyme genes Y Yamanishi et al Experimental procedures Datasets In this study, we focused on the metabolic pathways of P aeruginosa As a gold standard for the enzyme gene network, we used the KEGG PATHWAY database [7] The resulting enzyme network contains 799 nodes and 2782 edges Note that this network is based on biological phenomena and represents known molecular interaction networks in various cellular processes We obtained information about enzyme genes from the KEGG GENES database, in which EC numbers are assigned to candidate enzyme genes At the time of writing, in P aeruginosa, 1133 genes have been assigned at least one EC number, but only 799 have been assigned at least one precise role in metabolic pathways The dataset for the gene position on the genome was constructed from the KEGG GENES database We obtained information about the start and end positions of each gene region (ORF region), and we computed all pair-wise distances between the genes The gene position data can be regarded as a dataset representing the spatial association between genes along chromosomes Phylogenetic profiles were constructed from a set of ortholog gene clusters (OGCs) obtained from comprehensive cluster analysis for all the genes of fully sequenced organisms in KEGG GENES A group of genes identified as a quasi-clique in the graph of the KEGG SSDB (sequence similarity database) is thought to be a candidate for the OGC The concept of OGC is similar to that of the COG database [37] In this study, we focus on organisms with fully sequenced genomes, including 11 eukaryotes, 16 archaea, and 118 bacteria Each phylogenetic profile consists of a string of bits, in which the presence and absence of an orthologuous gene is coded and 0, respectively, across the above 145 organisms We obtained chemical information for the enzymes, for example chemical reactions, substrates and products, from their EC numbers, using the KEGG LIGAND database [38], which contains 11 817 compounds and 6349 reactions at the time of writing EC numbers are a numerical classification scheme for enzymes, based on the chemical reactions they catalyze We focused on the first three digits in the EC number, because the fourth digit in the EC number is often just a serial number In cases where a target reaction has not been assigned an EC number, we used the E-zyme system, which was recently developed in the KEGG database The E-zyme system is an EC number assignment system for chemical reactions, which enabled us to estimate the first three digits of the EC number for the target reaction by taking into account the structural information of two given chemical compounds [33] Data representation and integration To deal with the heterogeneity of genomic datasets, we propose to transform all the datasets into kernel-similarity 2270 matrices [26] In recent years, kernel methods such as support vector machine have received much attention in computational biology An advantage of using kernel methods is that we can apply a variety of statistical analyses to any structured data, for example graphs, strings and trees Suppose that we have a set of genes fxi gn ; where n is the i¼1 number of genes For the gene position data, we computed all the pair-wise distances between genes along the chromosome, where the distance dij between gene i and gene j is defined by the number of nucleotides between the end of the i-th gene and the start of the j-th gene along the chromosomes We then derived a distance kernel using the formula Kposition (xi,xj) ¼ exp(–dij ⁄ h) for i,j ¼ 1,2, ,n where h is a positive constant parameter In this study the parameter h is set to 105 This means that, the larger the distance between two genes along the chromosome, the smaller the value of the similarity score The resulting kernel matrix (similarity matrix) is denoted as Kposition The phylogenetic profiles are sets of numerical vectors Suppose that we have n genes and q organisms Let us define x as the phylogenetic profile for each gene (145 dimensional vector) and y as the phylogenetic profile for each organism (5525 dimensional vector) Here we used a weighted linear kernel (weighted inner product) as follows: Kphylogenetic ðxi ; xj Þ ¼ xT Wxj ; i for i; j ¼ 1; 2; ; n; where W is an diagonal matrix whose elements are given as Wịkk ẳ corrypae ; yk ị; for k ẳ 1; 2; ; q where q is the number of organisms, ypae is the phylogenetic profile for P aeruginosa, and corr(.) refers to Peason’s correlation coefficient This means that the more similar the gene inheritance pattern between two genes, the larger the value of the similarity score The resulting kernel-similarity matrix is denoted as Kphylogenetic The weight is introduced to reduce the effect of related organisms with P aeruginosa All the kernel-similarity matrices are supposed to be normalized so that the diagonal elements are all This means that the maximum value of the similarity score is and the minimum value of the similarity score is To integrate the above information of gene position and phylogenetic profile into a single one, we constructed a new kernel-similarity matrix by taking the weighted sum of the above kernel matrices as follows: Kgenomic ¼ w1Kposition+w2Kphylogenetic The usefulness of this type of data integration has been shown previously [24,39] Network inference A straightforward approach to network reconstruction is a similarity-based approach, which is based on an assumption that functionally related enzyme pairs are likely to share high similarity with respect to a given dataset Intuitively, the kernel value K(xi,xj) can often be considered as a measure of similarity between gene xi and gene xj This strategy FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS Y Yamanishi et al is therefore to predict an edge between two genes whenever the kernel value between these genes is above a threshold to be determined We refer to this approach as the direct approach The discrete version of this approach corresponds to the joint graph method [17] However, we sometimes meet cases in which gene pairs sharing high similarity based on the data not always have any functional relation In this study, we used a recently proposed algorithm to perform the supervised inference of the metabolic gene network [24,31] As opposed to the direct approach, these methods require a partial knowledge of the true metabolic network An advantage of using the supervised network inference method is that we can distinguish functionally related gene pairs as being different from functionally meaningless gene pairs, which have numerically high similarity values based on the data This formalism is more suitable to our current situation, because we can obtain partially known networks from, for example, the KEGG PATHWAY database Here, we make a brief review of the supervised network inference method This algorithm involves a training process, where a mapping of all genes to a low-dimensional space is learned by exploiting the partial knowledge of the network, and a test process where new edges are inferred Roughly speaking, the training process finds a projection f(Ỉ) which minimizes the following criterion: Á2 P À where i $ j means gene i and gene j i$j fðxi Þ À fðxj Þ are adjacent on the training network Note that f(x) ¼ (f (1)(x), f (2)(x), , f (L)(x))T and L are the number of features of interest The test process is simply the direct approach performed after genes are mapped to the low-dimensional feature space, that is, pairs of genes with short interdistances are connected Following the spirit of the direct approach, we use a similarity measure to evaluate the closeness between genes in the feature space In this study, the Pearson’s correlation coefficient between f(xi) for gene i and f(xj) for gene j is used as an indicator of the presence or absence of edges This is referred to as graphical association score, and the resulting matrix whose elements represent the graphical association scores is denoted as S For example, S(xi,xj) represents a graphical association score between genes xi and xj High scoring gene pairs are expected to be connected in the target network, therefore the output of this algorithm is thought of as a weighted graph In this study, we adopted the kernel CCA-based algorithm [24], and set the number of features L (dimension of the feature space) to 50, the regularization parameter k (trade-off parameter to avoid over-fitting in the training process) to 0.1 in the application, because the usefulness of those parameter values had been confirmed through systematic cross-validation experiments in our previous studies [24] Prediction of missing enzyme genes Selecting candidate genes coding for missing enzymes Missing enzymes in metabolic pathways are found visually by looking at the connectivity between the enzyme genes on the pathway map reflecting current pathway knowledge Suppose that there is a pathway hole between known enzyme gene a and known enzyme gene b, and this pathway hole consists of missing enzymes To find genes coding for such missing enzymes, we search set of genes having high graphical association score with the known enzyme genes a and b in our predicted network More generally, suppose that there are multiple known enzyme genes around a target pathway hole as A ¼ {a1,a2, ,a|A|}, where |A| is the number of known enzyme genes that are adjacent to missing enzymes in a target pathway hole We define candidate score dened as follows: PjAj pẳ1 Sx; ap ị; where S is the graphical association matrix jAj whose elements correspond to weighted edges in the predicted network High-scoring genes are chosen as candidates for target missing enzymes We then select genes for which the first three digits of the EC number are the same as that of the corresponding missing enzymes This strategy is based on the following properties of the EC numbers The first three digits of the EC number represent the chemical reaction types with which an enzyme is involved, while the fourth digit represents the substrate specificity or serial number [24] Therefore, a set of enzymes, whose the first three digits of the EC number are the same, are suspected of catalyzing similar reactions Cloning and gene expression DNA fragments containing the PA0265 and PA0266 genes were amplified by PCR from the genomic DNA of P aeruginosa: PAO1 (M Olson, University of Washington, Seattle, WA) and cloned into pET21a(+) (Novagen, Madison, WI) The primers used for the PCR cloning for PA0265 were as follows: 5’-GGAATTCCATATGCAACT CAAAGATGCCAAGCTG)3’ and 5’-CCCAAGCTTGA TACCGCCCAGGCAGAGGTACTTG-3’ The primers used for the PCR cloning of PA0266 were as follows: 5’-GGAATTCCATATGAGCAAGACCAACG AATCCC-3’ and 5’-CCGCTCGAGAGCGAGTTCGTCG AAGCACTCGG-3’ PCR was performed using KOD-plus DNA polymerase (Toyobo Co., Ltd, Osaka, Japan) with 30 cycles of 94 °C for 30 s, 60 °C for 30 s, and 68 °C for 120 s The resulting PA0265 DNA fragment was digested with NdeI and HindIII, and the PA0266 fragment was digested with NdeI and XhoI Each digested fragment was ligated into the corresponding sites of pET21a(+) (Novagen) to obtain pETPA0265 and pETPA0266 The proteins with a C-terminal His6-tag were overexpressed in the E coli BL21(DE3) cells carrying pETPA0265 or pETPA0266 at FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS 2271 Prediction of missing enzyme genes Y Yamanishi et al 37 °C (for PA0265) and 28 °C (for PA0266, the lower temperature was to prevent the formation of inclusion bodies) PA0265 was purified as follows E coli BL21(DE3) ⁄ pETPA0265 cells were harvested, resuspended in binding buffer (20 mm Tris ⁄ HCl, pH 7.9, mm imidazole), and disrupted by sonication Cell debris was removed by centrifugation The resulting supernatant was loaded onto a 10 mL of His-Bind column (Novagen) equilibrated with the binding buffer The column was washed with 200 mL of a wash buffer (20 mm Tris ⁄ HCl, pH 7.9, 60 mm imidazole) The enzyme was eluted using a liner gradient of 60 to 500 mm imidazole in a buffer The enzyme fractions were pooled and dialyzed against 20 mm Tris ⁄ HCl (pH 8.0) The purified enzyme was concentrated and stored at )80 °C until use Purification of PA0266 was performed in the same manner as the purification of PA0255, except that all buffers contained 20 m pyridoxal 5’-phosphate 10 11 Enzyme assay A coupled enzymatic reaction was carried out in 100 mm Tris ⁄ HCl (pH 8.0) containing 20 mm 5-amino pentanoate, 20 mm a-ketoglutaric acid, 0.2 mm NADP+, 0.1 mm pyridoxal 5’-phosphate, 50 mgỈmL)1 PA0266, and 20 mgỈmL)1 PA0265 at 35 °C An increase in A340 due to the formation of NADPH was monitored with a UV-2450 spectrophotometer (Shimadzu, Kyoto, Japan) 12 13 14 Acknowledgements This study was supported by ICR Grants for Young Scientists, grants from the Ministry of Education, Culture, Sports, Science and Technology, and the Japan Science and Technology Corporation The computational resource was provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University 15 16 References Kanehisa M (2001) Prediction of higher order functional networks from genomic data Pharmacogenomics 2, 373–385 Kanehisa M & Bork P (2003) Bioinformatics in the post-sequence era Nat Genet 33, 305–310 Toh H & Horimoto K (2002) Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling Bioinformatics 18, 287– 297 Covert MW, Knight EM, Reed JL, Herrgard MJ & Palsson BO (2004) Integrating high-throughput and computational data elucidates bacterial networks Nature 429, 92–96 Hu Z, Mellor J, Wu J, Yamada T, Holloway D & Delisi C (2005) VisANT: data-integrating visual framework 2272 17 18 19 20 for biological networks and modules Nucleic Acids Res 33, W352–W357 von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P & Snel B (2003) STRING: a database of predicted functional associations between proteins Nucleic Acids Res 31, 258–261 Kanehisa M, Goto S, Kawashima S, Okuno Y & Hattori M (2004) The KEGG resource for deciphering the genome Nucleic Acids Res 32, D277–D280 Keseler IM, Collado-Vides J, Gama-Castro S, Ingraham J, Paley S, Paulsen IT, Peralta-Gil M & Karp PD (2005) EcoCyc: a comprehensive database resource for Escherichia coli Nucleic Acids Res 33, D334–D337 Karp PD (2004) Call for an enzyme genomics initiative Genome Biol 5, 401 Osterman A & Overbeek R (2003) Missing genes in metabolic pathways: a comparative genomics approach Curr Opin Chem Biol 7, 238–251 Francke C, Siezen RJ & Teusink B (2005) Reconstructing the metabolic network of a bacterium from its genome Trends Microbiol 13, 550–558 Smith TF & Waterman MS (1981) Identification of common molecular subsequences J Mol Biol 147, 195–197 Altschul SF, Gish W, Miller W, Myers E & Lipman DJ (1990) Basic local alignment search tool J Mol Biol 215, 403–410 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W & Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25, 3389–3402 Brenner SE, Chothia C & Hubbard TJP (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships Proc Natl Acad Sci USA 95, 6073–6078 Overbeek R, Fonstein M, D’Souza M, Pusch GD & Maltsev N (1999) The use of gene clusters to infer functional coupling Proc Natl Acad Sci USA 96, 2896–2901 Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO & Eisenberg D (1999) A combined algorithm for genome-wide prediction of protein function Nature 402, 83–86 Enright AJ, Iliopoulos I, Kyrpides NC & Ouzounis CA (1999) Protein interaction maps for complete genomes based on gene fusion events Nature 402 (6757), 25–26 Snel B, Lehmann G, Bork P & Huynen MAB (2000) STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene Nucleic Acids Res 28, 3442–3444 Huynen M, Snel B, Lathe W & Bork P (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences Genome Res 10, 1204–1210 FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS Y Yamanishi et al 21 Kharchenko P, Vitkup D & Church GM (2004) Filling gaps in a metabolic network using expression information Bioinformatics 20, 449–453 22 David H, Hofmann G, Oliveira AP, Jarmer H & Nielsen J (2004) Metabolic network driven analysis of genome-wide transcription data from Aspergillus nidulans Genome Biol 7, R108 23 Green ML & Karp PD (2004) A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases BMC Bioinformatics 5, 76 24 Yamanishi Y, Vert J-P & Kanehisa M (2004) Protein network inference from multiple genomic data: a supervised approach, Bioinformatics (in ISMB2004) 20, i363–i370 25 Goto S, Bono H, Ogata H, Fujibuchi W, Nishioka T, Sato K & Kanehisa M (1996) Organizing and computing metabolic pathway data in terms of binary relations Pacific Symp Biocomputing 2, 175–186 26 Schoelkopf B, Tsuda K & Vert J-P (2004) Kernel Methods in Computational Biology MIT Press, Cambridge, MA 27 Ogata H, Fujibuchi W, Goto S & Kanehisa M (2000) A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters Nucleic Acids Res 28, 4029–4036 28 Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D & Yeates TO (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles Proc Natl Acad Sci USA 96, 4285–4288 29 Goh C, Bogan AA, Joachimiak M, Walther D & Cohen FE (2000) Co-evalution of proteins with their interaction partners J Mol Biol 299, 403–410 30 Pazos F & Valencia A (2001) Similarity of phylogenetic trees as indicator of protein–protein interaction Protein Eng 14, 609–614 Prediction of missing enzyme genes 31 Vert J-P & Yamanishi Y (2005) Supervised graph inference Adv Neural Inform Process Systems 17, 1433–1440 32 Fothergill JC & Guest JR (1977) Catabolism of l-lysine by Pseudomonas aeruginosa J Gen Microbiol 99, 139–155 33 Kotera M, Okuno Y, Hattori M, Goto S & Kanehisa M (2004) Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions J Am Chem Soc 126, 16487–16498 34 Revelles O, Espinosa-Urgel M, Fuhrer T, Sauer U & Ramos JL (2005) Multiple and interconnected pathways for l-lysine catabolism in Pseudomonas putida KT2440 J Bacteriol 187, 7500–7511 35 Bono H, Ogata H, Goto S & Kanehisa M (1998) Reconstruction of amino acid biosynthesis pathways from the complete genome sequence Genome Res 8, 203–210 36 Marcotte EM, Xenarios I, van Der Bliek AM & Eisenberg D (2000) Localizing proteins in the cell from their phylogenetic profiles Proc Natl Acad Sci USA 97, 12115–12120 37 Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND & Koonin EV (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes Nucleic Acids Res 29, 22–28 38 Goto S, Okuno Y, Hattori M, Nishioka T & Kanehisa M (2002) LIGAND: database of chemical compounds and reactions in biological pathways Nucleic Acids Res 30, 402–404 39 Yamanishi Y, Vert J-P, Nakaya A & Kanehisa M (2003) Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis Bioinformatics (in ISMB2003) 19, i323–i330 FEBS Journal 274 (2007) 2262–2273 ª 2007 The Authors Journal compilation ª 2007 FEBS 2273 ... survey of missing enzymes in the KEGG PATHWAY database suggests that the lysine-degradation pathway map for P aeruginosa is missing 28 of its 62 enzymes (45%) Lysine catabolism is notable for... that there are multiple known enzyme genes around a target pathway hole as A ¼ {a1 ,a2 , ,a| A|}, where |A| is the number of known enzyme genes that are adjacent to missing enzymes in a target pathway. .. to missing enzymes in this pathway There are two paths from l-lysine to glutarate in the lysine-degradation pathway: l-lysine fi 5-amino pentanamide fi 5-amino pentanoate fi glutarate semialdehyde