Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 12 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
12
Dung lượng
1,6 MB
Nội dung
www.nature.com/scientificreports OPEN received: 03 October 2016 accepted: 18 January 2017 Published: 20 February 2017 Genome-wide primary transcriptome analysis of H2producing archaeon Thermococcus onnurineus NA1 Suhyung Cho1,*, Min-Sik Kim2,*, Yujin Jeong1,*, Bo-Rahm Lee3, Jung-Hyun Lee2, Sung Gyun Kang2 & Byung-Kwan Cho1,3 In spite of their pivotal roles in transcriptional and post-transcriptional processes, the regulatory elements of archaeal genomes are not yet fully understood Here, we determine the primary transcriptome of the H2-producing archaeon Thermococcus onnurineus NA1 We identified 1,082 purinerich transcription initiation sites along with well-conserved TATA box, A-rich B recognition element (BRE), and promoter proximal element (PPE) motif in promoter regions, a high pyrimidine nucleotide content (T/C) at the −1 position, and Shine-Dalgarno (SD) motifs (GGDGRD) in 5′ untranslated regions (5′ UTRs) Along with differential transcript levels, 117 leaderless genes and 86 non-coding RNAs (ncRNAs) were identified, representing diverse cellular functions and potential regulatory functions under the different growth conditions Interestingly, we observed low GC content in ncRNAs for RNA-based regulation via unstructured forms or interaction with other cellular components Further comparative analysis of T onnurineus upstream regulatory sequences with those of closely related archaeal genomes demonstrated that transcription of orthologous genes are initiated by highly conserved promoter sequences, however their upstream sequences for transcriptional and translational regulation are largely diverse These results provide the genetic information of T onnurineus for its future application in metabolic engineering Archaea are unique organisms with ecological significance that have a similar genome organization and cellular structure to those of bacteria, and comparable molecular transcription and translation mechanisms to those of eukaryotes1–3 Particularly, the archaeal transcription apparatus is similar to the eukaryotic RNA polymerase (RNAP) II system, which requires a set of transcription factors to initiate transcription In an early stage of archaeal transcription, the transcription apparatus, composed of an RNAP and associated initiation factors, is assembled at the promoter region and transcription start site (TSS), defined as the +1 position of the 5′UTR of mRNA, to form the closed complex, commonly referred to as the pre-initiation complex4 Along with the pivotal role of RNAP and its related transcription factors in initiating archaeal transcription, the DNA sequence such as TATA box and BRE is a cis-encoded determinant for transcription initiation as a guide signal embedded in the genome4 Thus, determining the precise transcript architecture of transcript 5′ends allows us to reveal diverse cis-encoded determinants, including promoter elements, 5′UTRs, and TSSs Also, the precise location of TSSs determined by experimental methods provides the sequence and structure of the mRNA 5′end for investigating transcription regulation, mRNA stability, and translational efficiency5 For genome-scale determination of TSSs, two RNA-sequencing methods have been intensively used for diverse bacterial strains, such as Escherichia coli, Helicobacter pylori, Streptomyces coelicolor, and Synechocystis sp PCC68035–8 Those methods are differential RNA-seq (dRNA-seq) for annotating TSS7,9 and strand-specific RNA-seq (ssRNA-seq) for measuring mRNA transcript levels5,10,11 In addition to these bacterial species, recent transcriptome sequencing studies on archaeal strains such as Thermococcus kodakarensis, Methanosarcina Department of Biological Sciences and KI for the BioCentury, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Republic of Korea 2Korea Institute of Ocean Science and Technology, Ansan 426-744, Republic of Korea 3Intelligent Synthetic Biology Center, Daejeon 305-701, Republic of Korea *These authors contributed equally to this work Correspondence and requests for materials should be addressed to S.C (email: shcho95@kaist ac.kr) or B.-K.C (email: bcho@kaist.ac.kr) Scientific Reports | 7:43044 | DOI: 10.1038/srep43044 www.nature.com/scientificreports/ Figure 1. Determining transcriptional architecture of the T onnurineus NA1 genome (A) Example of dRNA-seq and ssRNA-seq profiles mapped onto the T onnurineus NA1 genome For TSS determination, total RNA samples were isolated from independent biological replicates of the mid-exponential growth phase cultures and two sequencing libraries were constructed, one from TEX treated (+TEX) and the other from untreated total RNA (−TEX) The expression levels of mRNA transcripts were obtained from ssRNA-seq, representing the genomic region between TON_1301 and TON_1310, in yeast peptone sulfur (YPS), MM1-CO (MMC), and MM1-Formate (MMF) media conditions (B) TSS confirmation for TON_1301 and TON_1306 using the 5′Rapid Amplification of cDNA Ends derivative method (5′tagRACE) and Sanger sequencing Full-length gels are included in Supplementary Fig. S1 (C) Average normalized intensity of each position for maximum peak within ±200 nt from the identified TSSs (D) A total of 1,082 TSSs were identified and classified according to their positions relative to adjacent ORFs TSSs located from 300 bp upstream to 50 bp downstream of the start codon of the annotated ORF were classified as the primary (P) or secondary TSSs (S) The peaks found within the coding regions were assigned as the internal TSSs (I) TSSs located within the reverse strands and the annotated ORFs were classified as the antisense (A) and intergenic TSSs (N), respectively mazei, Methanolobus psychrophilus, and Haloferax volcanii identified the genome-wide location of TSS and further revealed the significance of post-transcriptional regulation and extensive ncRNA-based regulation12–15 Along with understanding the cis-encoded determinants of each archaeal strain, it is also important to compare upstream non-coding regions to elucidate how closely related archaeal strains respond to unique environmental conditions Among such archaeal strains, T onnurineus NA1 showed a close phylogenetic relationship to T kodakarensis KOD1 with gene rearrangement at the level of chromosomal segments16–18 Here, we exploited the dRNA-seq method to obtain an accurate map of TSSs across the T onnurineus NA1 genome T onnurineus NA1 is a hyperthermophilic archaeon belonging to the order of Thermococcales with Pyrococcus species found in deep-sea hydrothermal vents16 This strain has a relatively small genome of approximately 1.8 Mbp, consisting of 2,026 genes, which uses monocarbon substrates such as carbon monoxide (CO) and formate as carbon and energy sources for cell growth under anaerobic conditions19 Extensive analyses were then performed to elucidate the cis-encoded determinants of transcription initiation and the upstream regulatory regions around the promoters and 5′UTRs Also, we generated a map of genome-scale ncRNAs and their differential utilization under formate and CO conditions, which are known to be upregulated for the production of hydrogen gas based on measuring levels of individual transcripts using ssRNA-seq20 This comprehensive genome-scale view of transcript architecture using upstream regulatory features provides a better understanding of transcriptional and posttranscriptional regulation in archaeal genomes Results Determination of the primary transcriptome. To determine the TSSs in the T onnurineus NA1 genome, a dRNA-seq method was exploited and the abundance of RNA transcripts was quantified using ssRNA-seq method (Fig. 1A) Briefly, we isolated total RNA from independent biological replicates of mid-exponential growth phase cultures and obtained primary transcripts by removing transcripts lacking tri-phosphorylated 5′ends (e.g., rRNA and tRNA) and any degraded transcripts by treating with terminator exonuclease (TEX)6,7,11 As a control, a TEX-untreated cDNA library was prepared in parallel, which represents the 5′ends of the whole transcriptome including the intact, processed, and degraded RNAs Consequently, two independent cDNA libraries were constructed, which were TEX treated (+TEX) and untreated (−TEX), and both were sequenced on an Illumina sequencing platform The resulting sequence reads were trimmed and mapped to the reference genome (NC_011529), resulting in 67- and 247-fold coverage obtained for +TEX and −TEX libraries, respectively (Supplemental Table S1)16 Additionally, the amount of sequence reads mapped to ribosomal RNA (rRNA) were 0.03–0.04%, indicative of a high rRNA depletion efficiency for the archaeal dRNA-seq method14 By integrating the mapping results with plausible criteria (see Materials and Methods for detection criteria), we newly assigned 1,082 TSSs in the T onnurineus NA1 genome (Supplemental Table S2) Only TSSs present in both TEX+and TEX− libraries within ±5 bp resolution were retained TSSs were then curated using iterative cluster Scientific Reports | 7:43044 | DOI: 10.1038/srep43044 www.nature.com/scientificreports/ Figure 2. Analysis of 5′ UTRs in T onnurineus NA1 (A) Distribution of 5′UTR lengths is shown within 300 nt Another distribution at 0–5 nt was found and considered to produce leaderless mRNAs (lmRNAs, n = 117) (B) Examples of these lmRNAs are illustrated (TON_1055 and TON_1056) (C) Functional enrichment analysis of proteins encoded by the umRNAs (n = 643) and lmRNAs (n = 98) using COG mapping subdivision as described previously with some modifications21 These enriched signals enabled the determination of TSSs for TUs by the contiguous gene expression signals obtained from ssRNA-seq6,11 Currently, a total of 2,026 genes are annotated in the T onnurineus NA1 genome including 1,975 protein-coding genes16 In addition, 1,161 operons have been predicted by the DOOR2 database, where 1,224 genes are organized into 410 multi-gene operons (number of genes in an operon ≥2) and other operons are transcribed from a single structural gene (751 genes in total)22 We identified primary TSSs from the upstream sequences of 834 operons (71.8%) Among those, 302 multi-gene operons were determined by the primary TSSs (73.7%) (Supplemental Table S2) For example, the transcription of TON_1301, TON_1302, and TON_1303 is initiated by a single TSS at the genomic position of 1,182,769 indicating that these genes are transcribed from a single TU (TU0722) (Fig. 1A) The operon harboring the largest number of genes was a ribosomal operon consisting of 23 genes (from TON_0065 to TON_0088) Although TSS information is useful to predict the TU architecture encoded in bacterial and archaeal genomes, it has limited by the absence of the length information of mRNA transcribed from the TU and the conditional use of TSS in responses to the environmental conditions For instance, it was predicted that the longest TU is 16,819 bp in length encoding 18 genes (from TON_1563 to TON_1580) involved in H2 production16 However, it can be also speculated that the presence of additional TSSs within the long TU Furthermore, independent verification of the identified TSSs was obtained by using the 5′Rapid Amplification of cDNA Ends derivative method (5′tagRACE) and Sanger sequencing23 For instance, we obtained the targeted PCR products for several TSSs from the cDNA library (Supplemental Fig. S1), such as those for TON_1301 (237 nt) and TON_1306 (204 nt) (Fig. 1B) The 5′end sequence of each PCR product was confirmed by Sanger sequencing (Supplemental Fig. S1) The primary transcript profile was correlated with the normalized ssRNA-seq profiles, which showed a sharp signal increase at the TSS and covered up to 150 bp downstream from the start codons of the annotated ORFs (Fig. 1C) The ssRNA-seq profiles also indicated a sharp signal at the same TSS positions, however the maximum peak signals were observed within 25–50 bp downstream This is consistent with the fact that most mRNAs in the ssRNA-seq libraries lack 20–30 nt from their proximal 5′ terminus24 This suggests that the canonical ssRNA-seq profiles reflect both intact transcripts with preserved 5′ends and processed (or degraded) transcripts TSSs were further categorized by their genomic locations and levels of enrichment (Fig. 1D and Supplemental Table S2), giving 961 primary TSSs (P) that were selected as a maximum peak height TSSs located within 300 bp upstream from the start codon of the annotated ORF The secondary TSSs (S) were collected from the second highest peak in the same region as the primary TSSs, accounting for 12 ORFs, and the internal TSSs (I) were called from the peaks found within 23 coding regions Interestingly, T onnurineus NA1 showed a low preference for secondary TSSs (1.1%) under the growth conditions examined, which is similar with the finding that the closely related archaeon T kodakarensis showed less use of the secondary TSSs (7.8%)12 The antisense (A) and intergenic TSSs (N) were collected from reverse strands of 29 annotated coding strands and 57 non-coding regions, respectively Consequently, primary TSSs correspond to 88.8% of detected TSSs and the other TSSs were represented in the remaining 11.2% (Supplemental Table S2) Analysis of 5′ UTRs. We next examined the length distribution of 5′UTRs between the defined primary TSSs and start codons of 941 annotated protein-coding genes The 5′UTR sequence promotes ribosomal binding through the Shine-Dalgarno sequence and frequently influences translational efficiency and post-translational regulation25 The primary transcript 5′UTR lengths were mostly 0–50 nt (84.6%), with a median length of 12 nt and the longest length of 479 nt (Fig. 2A and Supplemental Fig. S2) Interestingly, we observed TSSs lacking 5′UTRs defined as leaderless mRNAs (lmRNA), which is consistent with the fact that archaeal mRNAs typically have short 5′ UTRs26–28 In contrast, bacterial 5′UTR lengths show a median length of 55 nt defined as 5′ UTR-associated leadered mRNA (umRNA)12,27,29 For instance, TON_1056 and TON_1055 encode an unknown protein and a 3-methyladenine DNA glycosylase, respectively (Fig. 2B) Based upon the fact that the length of the consensus archaeal RBS in the 5′UTR is 6 nt, transcripts having 5′UTRs less than 5 nt were classified as lmRNA Previous studies showed that translation of bacterial lmRNA was highly efficient when the genomic position of start codon is identical with the transcription start position (i.e., 5′UTR length = 0) and the limit of the 5′ UTR length for lmRNA translation was ~5 nt27 A total of 117 TSSs (12.4% of primary TSSs) generated a substantial Scientific Reports | 7:43044 | DOI: 10.1038/srep43044 www.nature.com/scientificreports/ Figure 3. Genome-scale analysis of upstream sequences (A) Determination of promoter elements TATA boxes (5′-TTWTAW), polyA BRE motifs upstream of the TATA box, and PPE motifs “A” at −10 were identified relative to the TSS (+1) The bottom panels show the conservation of each motif of umRNAs and lmRNAs (B) Proportion of each nucleotide at TSS (+1) and two nt upstream and downstream of the TSS of umRNAs and lmRNAs The pyrimidine-purine dinucleotide motif is shown at −1 and +1 positions (C) Start codon usage of all ORFs (primary TSS-associated ORFs, umRNAs, and lmRNAs (D) Conserved ribosome-binding site (RBS) motif (5′-GGTG) for umRNAs (n = 796) and lmRNAs (n = 117) For lmRNAs, the absence of an RBS was confirmed subset of lmRNAs In particular, 70% of the lmRNAs showed complete overlap with transcriptional initiation sites with the annotated start codon (i.e., the length of 5′ UTR = 0) To determine whether lmRNAs are over- or underrepresented in certain biological functions, the umRNAs and lmRNAs were categorized into clusters of orthologous groups (COGs), respectively The results showed that a total of 416 umRNAs (64.7%) and 66 lmRNAs (67.3%) were assigned to COG categories (Fig. 2C)30 Both the major portion of umRNAs (47.8%) and lmRNAs (65.2%) were highly enriched in groups R and S, which represent unknown functions, while the residual transcripts were found across diverse functions including replication, recombination, repair, inorganic ion transport, and metabolism The number of lmRNAs in the T onnurineus NA1 genome is similar with the numbers in T kodakarensis (~14%)12 and Methanolobus psychrophilus (~15%)14 However, compared to those identified in Sulfolobus solfataricus P2 (~69%)31, Haloferax volcanii (72%)15, or Pyrococcus abyssi (few lmRNAs)32, this observation suggests that the usage of leaderless translation is evolutionarily diverse among archaeal species However, most proteins encoded by lmRNAs have not been functionally annotated yet Characterization of promoters and RBSs. Archaeal promoters contain two major conserved sequence elements, the A/T rich TATA box sequence of 8 bp and BRE, which is approximately centered around 27 nt and 33 nt upstream from the TSS4 In addition to these two major conserved sequences, the A/T-rich promoter proximal element (PPE) is located approximately −10 base pairs upstream of the TSS, and the initiator element (INR) is conserved within the initially transcribed region It has been suggested that both PPE and INR elements are not required for initiating transcription, but they can regulate the strength of transcription output3 To determine the promoter elements in the T onnurineus NA1 genome, we searched the conserved sequences across the genome between −50 nt and +10 nt from the identified TSSs using MEME33 1,006 TSSs (93.0%; p