Cellobiohydrolase (EC 3.2.1.91) is one of the important enzymes involved in cellulose hydrolysis. In this study, the gene sequences encoding cellobiohydrolase were extracted from the metagenome DNA data of microorganisms surrounding white-rot fungi in Cuc Phuong National Park based on the KEGG database. 73 ORFs encoding cellobiohydrolase were obtained, of which 15 ORFs contained complete genes, 6 ORFs with functional regions.
SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 119 USING SOME BIOINFORMATIC TOOLS TO MINING GENES CODING CELLOBIOHYDROLASE FROM METAGENOME DATA OF THE BACTERIA SURROUNDING WHITE-ROT FUNGI (Trametes versicolor) IN CUC PHUONG NATIONAL PARK Nguyen Thi Binh1*, Le Thi Thu Hong2, Truong Nam Hai2 Hanoi Metropolitan University Academy of Science and Technology, Vietnam Academy of Science and Technology Abstract: Cellobiohydrolase (EC 3.2.1.91) is one of the important enzymes involved in cellulose hydrolysis In this study, the gene sequences encoding cellobiohydrolase were extracted from the metagenome DNA data of microorganisms surrounding white-rot fungi in Cuc Phuong National Park based on the KEGG database 73 ORFs encoding cellobiohydrolase were obtained, of which 15 ORFs contained complete genes, ORFs with functional regions The expression level of the protein in E coli was estimated by Periscope software, which showed that the gene code GL0212614 had the highest expression level of 742 mg/l The secondary and tertiary structures of GL0212614 were predicted by Phyre2, showing that the structure of GL0212614 was determined based on c3nfvA template with 46% coverage and 100% confidence In the secondary structure, there are 25% α helix, 29% β helix, 2% TM helix and 14% no identify GL0212614 is an acidic enzyme, the optimal temperature for enzyme activity is 55°C-65°C These results are an impotant basis in order to choose gene expression conditions Keywords: Bioinformatics, cellobiohydrolase, DNA metagenome, E.coli, expression level Received 10 May 2022 Revised and accepted for publication 26 July 2022 (*) Email: ntbinh@daihocthudo.edu.vn INTRODUCTION Cellulose is one of the most important and popular biomass today To effectively degrade this biomass source, it is necessary to participate in cellulase enzymes: endo-1,4-β-D-glucanase (endocellulase EC 3.2.1.4), exo-1,4-β-D- glucanase (exocellulase or cellobiohydrolase EC 3.2.1.91) and β-glucosidase (cellubiose hydrolase EC 3.2.1.21) Enzymes called endoglucanase or endocellulase perform cleavage at random points within the cellulose, producing oligosaccharides of variable size Exocellulases or cellobiohydrolases act on the terminal ends 120 HANOI METROPOLITAN UNIVERSITY of oligosaccharide chains produced by endocellulases, cleaving glycosidic bonds and releasing glucose or cellobiose [1] The enzyme β-glucosidase is responsible for breaking down cellobiose into glucose molecules Of these three groups of enzymes, cellobiohydrolase is an important component of the cellulase system and plays a major role in biofuel production from plant biomass [2] Cellobiohydrolase is usually produced from fungi but also many bacteria that contain the gene encoding this enzyme Because microorganisms have a rather special cellulosomal system, there are many different studies to study and exploit the gene encoding cellobiohydrolase on this object such as the gene encoding cellobiohydrolase from Clostridium clariflavum [2], the gene HmCel6A and variable its variant HmCel6A-3SNP from bacteria in hot spring area [3], gene Cel6A from Penicillium [4]… Soil is a potential ecosystem with abundant, diverse microorganisms This is considered an important source to search for new enzymes with high efficiency in cellulose degradation [5], especially the surrounding white-rot fungi The white-rot fungi can effectively metabolize all the components in the wood This hydrolysis of white-rot fungi is often associated with enzymes of bacteria living in the same ecosystem In the process of fungi decomposing wood, redox preaction had occurred that acidify the environment, in addition, the fungi are also capable of producing the environment with secondary metabolic products Therefore, bacteria that survive in these conditions must have properties suitable for the environment To efficiently exploit genes from microorganisms in different ecosystems, metagenomics techniques had been used to search for new genes from non-culturing microorganisms Gene sequencing yields very large metagenome data To efficiently exploit these data, bioinformatics tools were used to screen and predict candidate genes encoding for proteins of interest before conducting experimental studies In this study, we present how to use some bioinformatics tools to mine new cellobiohydrolase enzyme genes from microbial metagenome DNA data surrounding white-rot fungi in Cuc Phuong National Park MATERIALS AND METHODS Resources: The 51.8 Gb metagenome DNA data of the microbial sample residing surrounding the wood-hydrolyzed white-rot fungi (T versicolor) in the Cuc Phuong rainforest was sequenced using the HiSeqIllumina sequencing system ( Illumina, San Diego, USA) at BGI, Hong Kong Research Methods Prediction of ORFs using MetaGene Annotator (MGA) software: The 51.8 Gb metagenome DNA data were sequenced, using IDBA software (http://www.cs.hku.hk/~alse/idba_ud) to sequence the short sequences into 2,611,883 dimensional contigs The mean length was 898 bp and there were 4,104,872 ORFs identified using the MGA software (http://metagene.nig.ac.jp/metagene/metagene.html) These ORFs were then compared with the KEGG (Kyoto Encyclopedia of Genes and Genomes) data to find the ORF sequences encoding cellobiohydrolase SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 121 Prediction of functional regions of ORF using PFAM and HHMER: Pfam (http://pfam.xfam.org/) is a database of a large collection of protein families and domains To predict the functional regions of ORF by Pfam, we provide protein sequences, using e-value 1.0 and provide a personal e-mail address, confirming submission via the HMMerwebsite, the results will be returned after 2-3 days HHMer is online software that allows the prediction of functional regions of proteins in Pfam quickly, based on a representative HMM model (https://www.ebi.ac.uk/Tools/hmmer/search/phmmer ) Prediction of protein expression level inferred from ORF using Periscope software: Protein expression levels in E coli cells were predicted using the Periscope software available at http://lightning.med.monash.edu/periscope/ Periscope classifies the expression levels of soluble proteins into three levels: high, moderate, low, in addition to a predictive function of the amount of soluble protein in mg/l Predicting the spatial structure of proteins: Phyre2 software (http://www.sbg.bio.ic.ac.uk/~phyre2/html/page.cgi ?id=index) was used To predict the higher-order structure of a protein, the user submits the protein sequence to determine the secondary and tertiary structure of the models, domain composition, and model quality of the protein Typical structure prediction results will be returned to the sender's e-mail Prediction of some physical properties of proteins: Use the AcalPred software at http://lin-group.cn/server/AcalPred to predict the acidic or alkaline proteins Users enter the target protein sequence into the search box, the software will return results on the acid- or alkaline-protein in a few minutes TBI software (http://www.tbi.org.tw/tools/) was used to predict the optimum temperature of enzyme activity The inputs to TBI are the amino acid sequences and the results will be available in a few minutes RESULTS AND DISCUSSION 3.1 Prediction of ORFs encoding the enzyme cellobiohydrolase Based on the KEGG database and using the MGA software, 73 ORFs were predicted encoding cellobiohydrolase In which, 15 ORFs (20.55%) contain the entire gene (complete gene), the remaining 11 ORFs lack the 3' end, ORFs lack the 5' end and 42 ORFs lack both 5' and 3' ends In the data analysis of genes encoding cellobiohydrolase enzymes, we prioritized to select complete ORFs for further analysis 3.2 Analysis of functional regions of ORF Proteins usually consist of one or more functional regions called domains Therefore, searching of domains presented in proteins provided insights into their function To evaluate the function of enzymes, we conducted the domains of 15 complete ORFs Of which, there were ORFs with functional domains: ORF has Alginate_lyase domain, ORF has Amidase domain, ORF has CBM2 domain, ORF has CBP_BcsO domain, ORF has GH128 + Laminin G3 domain, ORF has domain Znribbon (domains of genes are shown in Figure 1) These domains were involved in the function of genes Therefore, in the next prediction we will proceed on complete ORFs with defined functional regions 122 HANOI METROPOLITAN UNIVERSITY Figure Diagram showing the functional domain of ORFs 3.3 Prediction of expression levels of genes encoding the enzyme cellobiohydrolase E coli is considered to be the most popular recombinant protein expression system today Expression of E coli soluble proteins not only purified target proteins, but also enhanced the ability to obtain structurally intact and biologically active proteins The expression level of soluble proteins was determined by Periscope software complete ORFs, which were identified the functional regions, were expressed in E coli The results of predicting the expression levels of ORFs were shown in Table Table Predicted expression level of cellobiohydrolase gene in E coli No Gene code GL0212614 GL0221923 GL2034110 GL0879211 GL0058533 GL0733968 Domain Alginate_lyase Amidase3 CBM2 CBP_BcsO GH128+ Laminin G3 Znribbon Expression level (mg/l) 742,5445 9,2954 9,1483 15,3828 13,5843 0,1945 The results of expression showed that the gene code GL0212614 containing the Alginate_lyase domain had the highest expression level 742 mg/l The remaining gene codes all had low expression levels, which will be difficult for further expression studies Therefore, the gene GL0212614 was selected to estimate the properties before further experiments The gene sequences and amino acid sequences in GL0212614 are shown in Figure atg aaa gta att gtt ttc ctg att tta atg gtg gtt cta aac agc tgt tct ttg gct ttt M K V I V F L I L M V V L N S C S L A F gcc caa tca ttt gtt cat ccg ggt gga tta cat acc ctc gcc gac tta aac cga atg aaa A Q S F V H P G G L H T L A D L N R M K SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 gat atg gtg aag aag cgg gcg cat cca tgg ata gac agt tgg aac aaa ctt atc caa gat D M V K K R A H P W I D S W N K L I Q D cca ctt gca caa aac acc tat aca gct gca ccc aag gca aat atg ggc gat agt cgg cag P L A Q N T Y T A A P K A N M G D S R Q cgt gca tca acc gat gcg cac gcg gct tat ttg aat gcc ata cgc tgg tac atc aca ggt R A S T D A H A A Y L N A I R W Y I T G gat cgc agt tat ggg gat tgt gcg att tcc atc tgt aac gca tgg tcc ggc acc gtt gat D R S Y G D C A I S I C N A W S G T V D cga gtg cca tca ggt gta gac att ccc gga ctg agt gga atc gct atc gct gag ttt gca R V P S G V D I P G L S G I A I A E F A ttg gcc gca gaa gta ctt cgg ctg aat gaa cgg tgg gaa atc gat gaa att agg cgt ttt L A A E V L R L N E R W E I D E I R R F aaa acc atg atg act acc tat ttt tat ccg gtt tgc cat gat ttc ttg acg aac cat gct K T M M T T Y F Y P V C H D F L T N H A gga agg tgt gcc gat tat ttt tgg gca aac tgg gat gcc tgt aat ata gct gca tta att G R C A D Y F W A N W D A C N I A A L I gct atg ggt gta ctt tgc gat gat cgg aat att tat gac gaa gga gtt gaa tat ttt aaa A M G V L C D D R N I Y D E G V E Y F K cac gga gat ggc gcc ggc agc atc gaa cac gcc gtt gcc tac att cat tcc ggt aat ctc H G D G A G S I E H A V A Y I H S G N L ggg caa tgg cag gaa agc ggc agg gat cag gaa cat gca cag tta gga gtg gga ctt ttg G Q W Q E S G R D Q E H A Q L G V G L L gct gca gcc tgt cag gtt gcg tgg aat cag gga ttg gac cta ttc agt tat gat aat aac A A A C Q V A W N Q G L D L F S Y D N N cgg ctt ctt gct ggt gcc gaa tat gta gca aaa tat aac cta tgg cag gat gta cct ttt R L L A G A E Y V A K Y N L W Q D V P F aaa tat tat aac agc tgc cag cag gta aac cat aat tgg tca tct att aat gga agg gga K Y Y N S C Q Q V N H N W S S I N G R G agg ttg gat gat cgc ccg ctt tgg gag tta att tac aat cat tat gtc gtt aga aaa agg R L D D R P L W E L I Y N H Y V V R K R ttg aac gca cct aat tca aaa tta atg gct gaa ctc atg aga ccc gag cat ggc agt aac L N A P N S K L M A E L M R P E H G S N gat cat ttt gga tac ggt aca ctg aca ttt acg ttg gat gga aag cag tca ccc tat cct 123 124 HANOI METROPOLITAN UNIVERSITY D H F G Y G T L T F T L D G K Q S P Y P gca ctt gca aca cca gcc att ccg acc cat ctg act gct aca gca ggt gta aat aga gta A L A T P A I P T H L T A T A G V N R V tat ctc aca tgg cat cca tct gaa gga tat act gcg cag gga tat gag gtg caa cgg gct Y L T W H P S E G Y T A Q G Y E V Q R A ata agt agc gcc ggt cct tat aac atc att acc aaa tgg aat gat cat aca tca cca caa I S S A G P Y N I I T K W N D H T S P Q tat ata gat ccg gat gta aca aat gga aca aat tac tac tac cgg gtg gcg gca ttg aac Y I D P D V T N G T N Y Y Y R V A A L N caa tca ggt act agt tcg tat tct tcc att gtc cag gcc agt cct cag gct gca gga gaa Q S G T S S Y S S I V Q A S P Q A A G E ctt cct gcg aaa tgg aaa aat aca tta atc ggg aaa gga aat gat ggc aat gcc gct ttt L P A K W K N T L I G K G N D G N A A F gct gcc gtt ggc gaa gga acc ttt att gtt aaa gga aac gga act gat ctc gga gga aat A A V G E G T F I V K G N G T D L G G N gaa gat caa ata acc tat act tac tgt cgt gta gaa gga gat ttt gtg atc acc gca aga E D Q I T Y T Y C R V E G D F V I T A R att tcg gat att act ggg cct aat cag aaa aca ggg ata atg gtt agg gaa tcg ctg gct I S D I T G P N Q K T G I M V R E S L A gca gac gcg aaa gca gtg agc ata acc ttg gga gat gca ggc gga cgt ttt gcc cga atg A D A K A V S I T L G D A G G R F A R M ggc aaa cgt aaa aat gac aaa gaa aaa atg tct ttt aca ttg gga aac gct tat aca tgg G K R K N D K E K M S F T L G N A Y T W ttg ccg gcg tgg ttc agg tta gaa cgg act gga agc tct tat aaa gca ttt gaa tct tcc L P A W F R L E R T G S S Y K A F E S S gat ggg acg cat tgg ttt aag gtt tct act gaa aac ttc agc atg tca aaa aca gca ttt D G T H W F K V S T E N F S M S K T A F gtc gga ttg gtt gtt gct tca ggt agt gcg tca gga ata gat act gtc acc ttc gat cat V G L V V A S G S A S G I D T V T F D H gta aag atc acc aaa agt act aat tct ggc aaa caa ggc gaa tga V K I T K S T N S G K Q G E Figure Gene sequence and amino acid sequence of the gene GL0212614 SCIENTIFIC JOURNAL OF HANOI METROPOLITAN UNIVERSITY − VOL.62/2022 125 3.4 Predicting the spatial structure and some physical properties of proteins Since the structure of proteins tended to be more conservative than their amino acid sequences during evolution, we proceed to predict the spatial structure of the GL0212614 gene using the Phyre2 software The results showed that the spatial structure model of the GL0212614 gene determined based on the alginate lyase enzyme c3nfvA_ template from Bacteroides2 ovatus had a coverage of 46% and a confidence level of 100% (Figure 3) In the secondary structure, there was 25% α helix, 29% β helix, 2% TM helix and 14% unidentified Figure Structural model of the GL0212614 gene using Phyre2 Some physical properties of the GL0212614 gene were also predicted When inserting the amino acid sequence into AcalPred software, the results of acidic and alkaline index were 0.919904 and 0.080096, respectively According to the prediction of this tool, the gene of choice is an acidic enzyme This result is consistent with previous studies showing that genes are active under acidic pH conditions [4] The optimal temperature for enzyme activities according to TBI had levels: above 65°C, 55°C-60°C, below 55°C The results of melting temperature (Tm) of GL0212614 had a Tm of 0.8289, so the optimal temperature for enzyme activity is 55°C-60°C This result will help us to choose suitable temperature and pH conditions in future studies CONCLUSION We had exploited 15 complete genes encoding cellobiohydrolase for microbial metagenome data surrounding white-rot fungi in Cuc Phuong National Park In which, genes had functional regions Gene code GL0212614 with functional region Alginate_lyase was the gene with the highest expression level of 742 mg/l GL0212614 was structurally determined based on the alginate lyase c3nfvA_ enzyme template from Bacteroides2 ovatus with 46% coverage and 100% confidence In the secondary structure, there were 25% α helix, 29% β helix, 2% TM helix and 14% unidentified GL0212614 is an acidic protein, the optimal temperature for enzyme activity is 55°C-65°C These results are an important basis in order to choose gene expression conditions 126 HANOI METROPOLITAN UNIVERSITY Acknowledgments: This study was supported by the grant from the Bilaterial International Project, code: NĐT.50.GER/18, from Ministry of Science and Technology (MOST), Vietnam and Federal Ministry of Education and Research, Germany; using the facilities in National Key Laboratory of Gene Technology, Institute of Biotechnology, Vietnam Academy of Science and Technology (VAST), Vietnam REFERENCES F L Soares Júnior et al (2013), “Endo- and exoglucanase activities in bacteria from mangrove sediment,” Brazilian J Microbiol., vol 44, no 3, p 969, doi: 10.1590/S1517-83822013000300048 A Zafar et al (2021), “Efficient biomass saccharification using a novel cellobiohydrolase from Clostridium clariflavum for utilization in biofuel industry,” RSC Adv., vol 11, no 16, pp 9246– 9261, Mar 2021, doi: 10.1039/D1RA00545F M Takeda et al (2022), “Metagenomic mining and structure-function studies of a hyperthermostable cellobiohydrolase from hot spring sediment,” Commun Biol 2022 51, vol 5, no 1, pp 1–11, Mar 2022, doi: 10.1038/s42003-022-03195-1 L Gao, F Wang, F Gao, L Wang, J Zhao, and Y Qu (2011), “Purification and characterization of a novel cellobiohydrolase (PdCel6A) from Penicillium decumbens JU-A10 for bioethanol production,” Bioresour Technol., vol 102, no 17, pp 8339–8342, Sep 2011, doi: 10.1016/J.BIORTECH.2011.06.033 T.-T.-H Le et al (2022), “De Novo Metagenomic Analysis of Microbial Community Contributing in Lignocellulose Degradation in Humus Samples Harvested from Cuc Phuong Tropical Forest in Vietnam,” Divers 2022, Vol 14, Page 220, vol 14, no 3, p 220, Mar 2022, doi: 10.3390/D14030220 SỬ DỤNG MỘT SỐ CÔNG CỤ TIN SINH ĐỂ KHAI THÁC GEN MÃ HÓA ENZYME CELLOBIOHYDROLASE TỪ DỮ LIỆU METAGENOME CỦA KHU HỆ VI KHUẨN QUANH NẤM MỤC TRẮNG (Trametes versicolor) Ở VƯỜN QUỐC GIA CÚC PHƯƠNG Tóm tắt: Cellobiohydrolase (EC 3.2.1.91) enzyme quan trọng tham gia vào trình thủy phân cellulose Trong nghiên cứu này, trình tự gen mã hóa cellobiohydrolase khai thác từ liệu DNA metagenome vi sinh vật quanh khu nấm mục trắng vườn Quốc gia Cúc Phương dựa sở liệu KEGG Có 73 ORF mã hóa enzyme cellobiohydrolase thu nhận, có 15 ORF chứa gen hồn thiện, có ORF có vùng chức Mức độ biểu protein E coli ước đoán phần mềm Periscope cho thấy mã gen GL0212614 có mức độ biểu cao 742 mg/l Cấu trúc bậc hai bậc ba GL0494307 dự đoán Phyre2 cho thấy, GL0212614 có cấu trúc xác định dựa khn c3nfvA có độ bao phủ 46% độ tin cậy 100% Trong cấu trúc bậc gen GL0212614 có 25% xoắn α, 29% xoắn β, 2% xoắn TM 14% không xác định GL0212614 gen chịu axit, nhiệt độ tối ưu cho hoạt tính enzyme 55°C65oC Những kết cở sở quan để lựa chọn điều kiện biểu gen Từ khóa: Cellobiohydrolase, DNA metagenome, E.coli, mức độ biểu hiện, tin sinh học ... Resources: The 51.8 Gb metagenome DNA data of the microbial sample residing surrounding the wood-hydrolyzed white- rot fungi (T versicolor) in the Cuc Phuong rainforest was sequenced using the HiSeqIllumina... study, we present how to use some bioinformatics tools to mine new cellobiohydrolase enzyme genes from microbial metagenome DNA data surrounding white- rot fungi in Cuc Phuong National Park MATERIALS... efficiency in cellulose degradation [5], especially the surrounding white- rot fungi The white- rot fungi can effectively metabolize all the components in the wood This hydrolysis of white- rot fungi is often