Amino acid sequence alignments of ribosomal proteins revealed an unusual taxon-specific block structure, with some blocks universally conserved and others specific to one or two phylogen
Trang 1BOSTON UNIVERSITY COLLEGE OF ENGINEERING
Dissertation
ANALYSIS OF RIBOSOMAL PROTEIN BLOCK STRUCTURE: FUNCTIONAL CHARACTERIZATION, EVOLUTIONARY IMPLICATIONS AND DISTANT HOMOLOGY SEARCH
USING DISCRETE STATE MODELS
PAOLA FAVARETTO Laurea Degree, Universita’ degli Studi di Padova, Italy, 2001
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
2007
Trang 2UMI Number: 3246605
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy submitted Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction
In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted Also, if unauthorized copyright material had to be removed, a note will indicate the deletion
® UMI UMI Microform 3246605 Copyright 2007 by ProQuest Information and Learning Company
All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code
ProQuest Information and Learning Company
300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346
Trang 4Acknowledgments
I would like to express deep gratitude to my advisor, Professor Temple F Smith, for his guidance and support during the completion of this project His unending drive and extraordinary enthusiasm for science have been of great inspiration and motivation in my
work
Iam grateful to Professor Scott C Mohr for his thoughtful advice and encouraging help, especially in the early stage of the project His meticulous work and rigorous attitude have shown me the importance of details and analytical thinking
Iam indebted to Professor Hyman Hartman for his important contribution to the project and his constant interest in my work His comments and insights were always very much appreciated
I would like to acknowledge Professor Sandor Vajda and Professor Lucia M Vaina for serving on my Ph.D committee, and Professor Jadwiga Bienkowska for serving on the prospectus committee
My appreciation goes to all the members of the BioMolecular Engineering Research Cen- ter for their friendship and numerous stimulating conversations: Arjun Bhutkar, Prashant Vishwanath, Kavitha Venkatesan, Filiz Aslan, Hongxian He, Esther Epstein, Nancy Sands and Sean Quinlan
I would also like to thank my parents for always believing in me and for giving me the
HH
Trang 5opportunity to pursue my goals
Finally, I would like to thank my husband Atul for his profound support and immense help throughout the completion of this degree He has been a constant source of strength and encouragement, and I am honored to have him in my life
iv
Trang 6ANALYSIS OF RIBOSOMAL PROTEIN BLOCK STRUCTURE:
FUNCTIONAL CHARACTERIZATION, EVOLUTIONARY IMPLICATIONS AND DISTANT HOMOLOGY SEARCH USING DISCRETE STATE MODELS
(Order No )
PAOLA FAVARETTO Boston University College of Engineering, 2007 Major Professor: Temple F Smith, Professor of Biomedical Engineering
ABSTRACT
The ribosome, a very complex molecular machine, plays a fundamental role in all liv- ing organisms and exhibits extraordinary engineering design concepts This investigation examined the complexity of the translational apparatus, seeking to understand how its com- ponents evolved to their present configuration It includes detailed sequence and structural comparative analyses of the ribosome and its associated proteins with functional character- ization of these components across the three phylogenetic domains
Amino acid sequence alignments of ribosomal proteins revealed an unusual taxon-specific block structure, with some blocks universally conserved and others specific to one or two phylogenetic domains Statistical and phylogenetic analyses of the universal blocks imply that modern Bacteria, Archaea and Eukarya clearly have a common ancestor, while the
Trang 7phylodomain-specific blocks suggest that these groups also share more recent, taxon-specific cenancestors Major evolutionary implications of the observed block structure are: (7) the crenarchaeal, endosymbiotic origin of the modern eukaryotic translational apparatus; and (it) the occurrence of a prokaryotic bottleneck that drastically reduced the diversity of modern species progenitors about 2.2 billion years ago
Surprisingly, the highly conserved blocks identified in most of the translation-related proteins do not associate consistently with any identifiable particular function or structural feature, or even with rRNA contacts A comprehensive investigation of the rRNA-ribosomal protein interactions, however, demonstrated a major role of ribosomal proteins in constrain- ing the rRNA conformational space and stabilizing its correct, universally conserved core fold
In order to identify possible evolutionary relationships between the taxon-specific block structure and other proteins, a new stochastic tool for the identification of distant ho- mologous domains in single-, repeated- and multi-domain contexts was implemented The approach uses sequence and structure information embedded in Discrete State Models, and
a Markov threading technique to estimate the compatibility of any query sequence with the models under consideration The method was successfully applied to a variety of cases, in- cluding the ribosomal blocks, the WD40-repeat domain and the very diverse ubiquitin-like family
vi
Trang 92 Taxon-specific Block Structure in Ribosomal Proteins 10
2.1 Introduction © 0 c c c c Q Q Q n ng kg kg gi kg kia 10
2.2 Sequence DataseE Q Q Q Q u ng kg g kg kg cv kia 11
2.3.1 Alignment Refinements 0.0.00 eee eee 13
2.4 Sequence Analysis 2 ng và kg ki à kia a 15
2.5 Structural Analysis 2 nà gà cv v k kg k k va 29
2/71 Ribosomal Phylogeny Reconstruction .004 51
3.1 Structural Role of Ribosomal Proteins 000 55
3.2 RNA-chaperone Activity of Ribosomal Proteins 56
3.3 Consensus Secondary Structure of rRNA 0 0 eee ee 58
3.4 RNA Secondary-structure Prediction Software: mfold .4 59
3.5 Fold Prediction Scoring Scheme 00.22.0022 eee 63
viii
Trang 103.6.1 Individual Protein Constraints .0 2.0.2.0 0 500 0G 64 3.6.2 Binding Pathway Constraints "1< 67
3.6.3 Artificial and Hypothetical Protein Constraints 73
3.6.4 Conclusions: Role of Ribosomal Proteins in rRNA Folding 76
3.7.2 Conclusions: Sequence Composition Impact on rRNA Folding 85
3.9 Conclusions: Constraining rRNA Conformational Space 87
Discrete State Models for Distant Homology Search of Ribosomal Blocks 89
4.2 Discrete State-space Models (DSMs) .2.000.- 92
4.2.1 Creation of Discrete State Models .0-., 94
4.2.2 Distant Homology Search: Problem Formulation 97
4.4 Method and Models Validation 2.0 0.00 0000 106
Trang 114.4.1 Identification of Repeated-domain Proteins: WD40-repeat Family
44.2 WD40-repeat Model Validation 2.0 0.2.0.0 0004 ae
4.4.3 Identification of Small Domain Proteins and Concatenated Domains:
4.5 Discrete State Models for Ribosomal Protein Blocks
Discussion
5.1 Evolutionary Implications of the Ribosomal Protein Block Structure
5.2 The Archaeal Origin of the Eukaryotic Translational Apparatus
5.3 Ribosomal Block Structure and Interactions with rRNA ., 5.4 Protein Domain Identification Using DSMs
PDB Structures of Proteins in the Translational Apparatus
rRNA and Ribosomal Protein Interactions
Phylogenetic Reconstruction Parameters
149
152
Trang 12D WD40-repeat Prediction in S cerevisiae 162
xi
Trang 13Ribosomal proteins in the 30S subunit and their phylogenetic assignment
Ribosomal proteins in the 50S subunit and their phylogenetic assignment
Structural folds of ribosomal proteins and their spatial location in the ribo-
Interactions between ribosomal proteins and RNA helices in the 30S subunit
Interactions between ribosomal proteins and RNA helices in the 50S subunit
Examples of binding pathways and their impact on the conformational vari- ability of 16S and accuracy of predicted folds Examples of binding pathways and their impact on the conformational vari- ability of 23S and accuracy of predicted folds .0
Trang 14Domain variability in the 30S and 50Ssubunits
Generic DSMs used as competing models in the Bayesian estimation of pos-
Comparison of predicted repeat boundaries for sequence PDB code 1GXRA
Ubiquitin-like protein subfamilies 0 Q Q Q Q Q ga
Ubiquitin and ubiquitin-like model classes 1 1 ee ee
Ubiquitin-like domain occurrences identified in a sample of representative complete eukaryotic gnomes 2 0 ee ee va
Ubiquitin-like domain occurrences identified in a sample of representative archaeal genome@s HQ ng kg gi k k k kg sa
Ubiquitin-like domain occurrences identified in a sample of representative bacterial gnomes 1 4a
List of translational protein structures in the Protein Data Bank (PDB)
Distribution of rRNA-ribosomal protein interaction (RPI) patterns
Distribution of rRNA-ribosomal protein interaction across the amino acid types associated with residues in the universal blocks
Distribution of hydrogen-bond interactions between rRNA and ribosomal proteins (RPI)across block types 2.2 ee
Trang 15Distribution of rRNA-ribosomal protein interaction across the amino acid
Distribution of individual amino acids in the proteins of the small ribosomal
Distribution of individual amino acids in the proteins of the large ribosomal
Results of WD40-repeat search on the entire S cerevisiae genome, identified
by both the DSM method and the profile-based approach
Results of WD40-repeat search on the entire S cerevisiae genome, identified
List of PDB structures for ubiquitin-like proteins
Trang 16Ribosomal proteins distribution among the three phylogenetic domains
Schematic representation of the multiple alignments in the universal riboso- mai protein set (UN) of the 305 subunit uc 2
Schematic representation of the multiple alignments in the universal riboso- mal protein set (UN) of the 50S subunit cố
Schematic representation of the multiple alignments in the archaeal-eukaryotic ribosomal protein set (AE) of the 30S subunit
Schematic representation of the multiple alignments in the archaeal-eukaryotic
ribosomal protein set (AP) of the 508 subunit
Length distribution of the taxon-specific blocks identified in the ribosomal
Average number of blocks in ribosomal proteins
Block distribution among the ribosomal proteins
Trang 172.9 Average number of conserved positions in subsets of universal ribosomal pro-
2.10 Average number of informative positions in subsets of universal ribosomal
2.11 Distribution of rRNA-ribosomal protein interactions (RPI) among block types 36
2.12 Distribution of rRNA-ribosomal protein interaction (RPI) types in both ri- bosomal subunits 0 c c cv cu 2 kg L k v va 37 2.13 Distributions of rRNA-ribosomal protein interactions among amino acids in
2.14 rRNA-ribosomal protein interactions mapped into the secondary structure of the 238 rRNA of H marismortui (3? end) 2.2.0.0 00 000004 42
2.15 rRNA-ribosomal protein interactions mapped into the secondary structure of
the 238 rRNA of H marismortui (5’ end) 2 0 ee 43 2.16 rRNA-ribosomal protein interactions mapped into the secondary structure of 16S rRNA of T thermophilus c c c c cv v g g g Q gà cv ga và va 44
2.17 Phylogenetic reconstructions using the positional variation among aligned
3.1 Consensus secondary-structure diagram for 16S rRNA 60
3.2 Consensus secondary-structure diagram for 238 rRNA (5’end) 61
Xvi
Trang 18Consensus secondary-structure diagram for 238 rRNA (3’ end)
Number of folds predicted when individual ribosomal protein constraints were applied to the FE coli 16S sequence 1 kg va
Number of folds predicted when individual ribosomal protein constraints were applied to the H marismortui 23S sequence .0 000.4
Assembly map for the small ribosomal subunit (308)
Assembly map for the large ribosomal subunit (50S) Validation of results using artificial and hypothetical protein constraints Correlation between the number of predicted folds and the number of applied
Average percentage of native base pairs predicted correctly for 16S È cob sequences with canonical base-pair substitutions Average total number of base pairs predicted for sequences with canonical
Trang 193.14 Average number of secondary-structure folds predicted within 5% of the min- imum free-energy fold for H marismortui 23S sequences with canonical base-
4.5 WD40-repeat model validation results 2 2 ee ee eee 110
4.6 Log Likelihood Ratio (LLR) plot for sequence PDB code 1GKR 111
4.7 Log Likelihood Ratio (LLR) plot for sequence PDB code 1GXRA 112
4.8 Specificity and sensitivity of ubiquitin-like model classes 118
4.9 Validation of the ubig model 2 0 Q Q L ng nà và va 119 4.10 Log Likelihood Ratio (LLR) obtained by threading yeast’s RUB1 sequence (SGD YDRI139©) through the ubi@model 120
4.11 Log Likelihood Ratio (LLR) obtained by threading yeast ribosomal protein L40 sequence (SGD YIL148W) through the ubig model 121
Xvili
Trang 204.12 Log Likelihood Ratio (LLR) obtained by threading yeast polyubiquitin se-
quence (SGD YLLO039C) through the ii model
4.13 Distribution of the ubiquitin domain location within protein hits across a representative sample of eukaryotic gnomes 0
XIX
Trang 21Chapter 1
Introduction
The remarkable accomplishment represented by the completion of the ribosomal subunit crystal structures [119, 8, 17, 16, 126, 83, 14, 66], together with the solution of the ribosome
in complex with ancillary proteins, have spurred new interest among the scientific com- munity and brought renovated attention to the protein translation machinery in general
At the same time, the database of proteins involved in translation has rapidly expanded, mainly due to the ever-increasing number of completely sequenced genomes Considerable information has become available also on the functional sites of the ribosome as well as
protein-RNA and protein-protein interactions This wealth of new biological information
that spans sequence-, structural and functional level, is the motivation for the present work, which aims to a deeper, more complete analysis of the ribosome and the translational appa-
Trang 22ratus as a whole A detailed characterization of these biological complexes can potentially have an enormous impact in the attempt to understand how Nature engineered this ex- tremely complex machinery, which is universally conserved across all phylogenetic domains and plays such a crucial role in all living organisms
The present work aimed to perform a multi-level analysis of the protein translation apparatus, with particular focus on the ribosome, by using the new and more broadly representative data now available, with three specific objectives: (2) characterization of functional, structural and sequence features of the translation-related proteins across the three phylogenetic domains (Bacteria, Archaea and Eukarya); (ii) phylogenetic analysis of translation-related proteins and its evolutionary implications, with particular focus on the origin of eukaryotes; (ii7) design and implementation of a new, more sophisticated compu- tational tool that uses both sequence and structural information for the identification of distant homologs of small conserved or regions, identified in the translation-related proteins
The significance of the project here undertaken is two-fold: (2) dissecting the components and interactions of the protein translational machinery allows for a thorough understanding
of some of the mechanisms that drive protein translation; (ii) the phylogenetic reconstruc- tions derived by using the translation-associated proteins across a wide range of organisms can shed light on the history of the protein synthesis mechanisms and can suggest a more accurate evolutionary hypothesis for the origin of Life on earth
Trang 231.2 Protein Translation
The basic principle of translation is that the genetic information stored in the chromosomal DNA and subsequently transcribed into the messenger RNA (mRNA) directs the ordered arrangement of the amino acids into the polypeptide chain This process takes place in the ribosome, which consists of two unequally sized subunits composed of proteins and RNAs The small ribosomal subunit contains the decoding center where the recognition of the mRNA codon by the transfer RNA (tRNA) anticodon takes place The large riboso- mal subunit catalyzes the formation of peptide bonds between amino acids, thus forming the polypeptide chain Each amino acid used in the protein synthesis reach the ribosome attached to its specific tRNA, which plays the role of an adaptor between the signal de- coded by its anticodon sequence and the amino acid that it is carrying Aminoacyl tRNA synthetases are the enzymes that charge the tRNA with the specific amino acid
Despite some differences among the three phylogenetic domains, the main characteristics and mechanisms involved in translation appear to be universally conserved In initiation, the ribosome is assembled at the initiation codon in the mRNA with a methionyl initiator tRNA bound in its peptidyl (P) site In elongation, aminoacyl tRNAs enter the acceptor (A) site where decoding takes place If they are the correct (cognate) tRNA, the ribosome catalyzes the formation of a peptide bond After the tRNAs and mRNA are translocated
such that the next codon is moved into the A-site, the process is repeated Termination
takes place when a stop codon in the mRNA is encountered and the complete peptide is released from the ribosome In the final stage of recycling, the ribosomal subunits are
Trang 24dissociated, releasing the mRNA and deacylated tRNA, and setting the stage for another round of initiation See [72, 93, 83, 16] for complete reviews of the mechanisms involved in protein translation
A complex network of molecular chaperones aids in the correct folding of the nascent polypeptide chains and in their translocation across the membranes to reach their sites of functions Some of these chaperones are trigger factors [60, 92], proteins involved in the Sec translocase pathway (63, 89] and the signal recognition particle (SRP) (62, 31, 75]
1.3 Protein Translation Apparatus
1.3.1 The Ribosome
The ribosome is a ribonucleic particle whose size vary with the species In most ribosomes, the mass of the ribosomal RNA (rRNA) is significantly larger than that of the ribosomal proteins, so it is not surprising that ribosomal proteins interact extensively with the rRNA molecules The rRNA forms the core of the ribosome and provides the binding sites for the ribosomal proteins These serve primarily to stabilize the rRNA and to organize its proper functional three-dimensional structure, but other extra-ribosomal functions have been associated with the ribosomal proteins [122]
In prokaryotes, the small subunit has one rRNA molecules (16S) and the large subunit has two rRNA molecules (23S and 58), In eukaryotes, the small subunit has one rRNA molecule (18S) and the large subunit has three rRNA molecules (5.88, 285, 5S) A com-
Trang 25parative analysis of rRNA sequences from hundreds of species [85] gave rise to a consensus secondary structure that showed arrangement of the rRNA into helices and domains The 23S rRNA has approximately 100 helices arranged into six domains; the 16S rRNA has only about 45 helices arranged into four domains Despite the significant variation in size of the rRNAs among the three phylogenetic domains, the core of the secondary structure remains,
since the main differences are found in the size of some loops [67, 94, 87]
A large number of ribosomal proteins are bound to the rRNA The exact enumeration of the proteins has met with some difficulties There are approximately 54 proteins in Bacteria and chloroplasts and 70-80 proteins in Eukarya and mitochondria Archaea seem to have
an intermediate number of ribosomal proteins It has been observed that a large number
of ribosomal proteins can be deleted without apparent effects on the cell viability [26]
On the other hand, a broad-range comparison of the completely sequenced genomes shows that the few proteins that are universally conserved are primarily ribosomal proteins [71] and translation factors [86] This suggests that, as expected, a subgroup of the ribosomal proteins must be essential for the viability of the cell
The recent elucidation of the crystal structures of the ribosomal subunits has opened
a new era for the work on protein translation [8, 119, 98, 40, 126] The structures have confirmed and confuted a number of previous expectations, and are a firm, solid base for the formulation of hypotheses, planning of new experiments, and interpretation of results
Irrespective of the species, the small ribosomal subunit has the general shape of a right- hand mitten Some of its features have been given names: the body, the thumb (also called
Trang 26platform), the head (corresponding to the finger parts of the mitten), the nose or beak (end
of the head), the shoulder (upper part of the body on the opposite side to the platform), and finally the toe or spur (minor protuberance at the bottom part of the body) The inner part of the mitten interacts with the large subunit, whereas the outer part (also called back side) is exposed to the solvent
When seen from the side of the subunit interface, the large subunit presents a structure similar to a crown The three protuberances on the particle are called the central protu- berance, right-hand side or L12-stalk, and left-hand side or Ll-stalk When observed from the side, the particle has an hemispherical shape, and the flatter surface corresponds to the interface side The exit tunnel for the nascent polypeptide is located through the large subunit from the interface side to the external surface {79, 123, 8]
1.3.2 Translation Factors
In vivo protein synthesis is catalyzed by a number of translation factors that bind transiently
to the ribosome during different phases of translation Most bacterial translation factors have been studied extensively and this has led to a proposed detailed mechanism for bacterial translation In Archaea and Eukarya this mechanism is still under investigation
Initiation
Initiation of protein synthesis is performed on the small subunit In Bacteria, the mRNA is wrapped around the neck of the small subunit and the initiator AUG codon is placed in the
Trang 27P-site with the aid of the Shine-Dalgarno interaction, where an A- and G- rich region of the mRNA binds to a region of the 3’ terminal sequence of 165 RNA Three bacterial initation factors (IF1, IF2, IF3) assist the placement of the fMet-tRNA at the AUG start codon
in the P-site and IF2 catalyzes the association of the two ribosomal subunits Eukaryotic initiation is performed with the aid of more than twelve intiation factors, some of which have homologs in other phylogenetic domains, suggesting that at least some common elements
of initiation were present at the universal ancestor of evolution [69]
Elongation
During elongation, the elongation factor EF-tu (EF1 in Archaea and Eukarya) is a GTP activated enzyme that binds charged tRNAs to the ribosomal A-site The complex between EF-tu, GTP and aminoacyl tRNA - called ternary complex (TC) - binds to the ribosome
If the anticodon of the aminoacyl tRNA bound to EF-tu matches the codon of the mRNA
in the decoding part of the A-site, EF-tu hydrolyzes its GTP to GDP and undergoes con- formational changes leading to its dissociation from the ribosome and the tRNA
A number of antibiotics inhibit the function of EF-tu [68], either by blocking the for- mation of the ternary complex (e.g., pulvomycin) or blocking the release of EF-tu from the ribosome after GTP hydrolysis (e.g., kirromycin)
The elongation factor EF-G (EF2 in Archaea and Eukarya) functions as translocase in protein synthesis After the peptidyl transfer, a peptidyl tRNA is located in the A-site and
a deacylated tRNA in the P-site The EF-G catalyzes the translocation of both tRNAs to
Trang 28the P- and E- sites respectively, and also the movement of the mRNA so that a new codon
is exposed in the A-site The deacylated tRNA in the E-site will subsequently fall off from the ribosome and when the EF-G is dissociated, the ribosome is ready for a new cycle of elongation
The antibiotic fusidic acid affects the function of EF-G by trapping the elongation factor
on the ribosome after GTP hydrolysis and translocation This inhibition is similar to that performed by kirromycin that locks EF-tu on the ribosome In the case of EF2, the antibiotic sordarin has similar effects as fusidic acid
Termination
Prokaryotic termination (or release) factors RF1 and RF2 hydrolyze and release the com- pleted polypeptide from the P-site tRNA when a stop codon is read (37, 65] RF1 and RF2 are homologous, the only difference being the type of stop codon recognized In Eukarya there is only one factor, eRF1, which recognizes all three stop codons [129] Once the re- lease factors have terminated the synthesis of a protein, the ribosome recycling factor RRF prevents the random reinitiation of the synthesis by recycling the components bound to the mRNA [58] There are two different views on the specific role of RRF: one view is that
RRF dissociates the polysomes into monosomes and release mRNA and tRNA [53} and the
other view is that RRF separates the two ribosomal subunits without releasing any other
molecule [61]
Trang 291.3.3 Chaperone Proteins in Translation
The proper folding of the nascent chain is an important aspect of translation because a newly synthesized protein is exposed to a crowd of proteolytic enzymes and other proteins
in the cell and therefore is at risk of aggregating to the wrong partners or being degraded Even though some proteins may fold spontaneously at the end of the exit tunnel, in many cases chaperone proteins are needed to ensure proper folding of the emerging polypeptide [42] Some of these chaperones are heat shock proteins (Hsps), like the bacterial trigger factor and DnaK, which are monomeric proteins that primarily prevent the aggregation of the growing polypeptide [29, 111] Trigger factor homologs have been proposed in Eukarya,
as well as Hsp70 and Hsp40 versions [36]
Proteins are synthesized in the cytoplasm and then transported to their final destination
by a complex transport system, which includes the signal recognition particle (SRP), the signal recognition particle receptor and the translocon [60, 92, 63] The ribosome interacts with several components of this machinery, although the interactions and the processes involved are only partially characterized [62]
Trang 30structure- , statistical, and phylogenetic analysis on the following subsets: (4) eukaryotic
sequences of the universal ribosomal proteins; (ii) archaeal and eukaryotic sequences of the archaeal-eukaryotic specific proteins; (1/2) other sequences of translation-associated proteins,
10
Trang 31such as translation factors
2.2 Sequence Dataset
An initial set of protein sequences was identified by considering the database of ribosomal proteins at Swiss-Prot [13] for three representative organisms: £ coli for Bacteria, AM jannaschii for Archaea and S cerevisiae for Eukarya These organisms were chosen because they represent the best studied species in each phylogenetic domain and their annotation could be assumed to be reliable with high confidence This initial set of sequences was used
to query the current versions of Swiss-Prot, Genbank [2] and Tigr [3] databases using BLAST
and PSI-BLAST [5, 4, 97] in order to identify homologous protein sets in other organisms Finally, from the resulting sequences, a smaller subset of representative sequences was drawn
by considering the organisms for which structural and functional information was available, and such that the widest taxonomic and habitat range possible was included In this way, not only the analysis could benefit from structural and functional insights, but also could account for the maximum extent of sequence variation that could occur in each protein family Table 2.1 presents a list of the species considered and their phylogenetic domain assignments
1]
Trang 32Table 2.1: List of eukaryotic, archaeal and bacterial species used in the analysis
ARATH Arabidopsis thaliana Viridiplantae; Streptophyta
CAEEL Caenorhabditis elegans Metazoa; Nematoda
DROME | Drosophila melanogaster Metazoa; Arthropoda; Insecta; Drosophila
ENNCU Encephalitozoon cuniculi Fungi; Microsporidia
HUMAN | Homo sapiens sapiens Metazoa; Vertebrata; Mammalia; Primates; Hominidae ICTPU Ictalurus punctatus Metazoa; Vertebrata; Otocephala
PLAFA Plasmodium falciparum Alveolata
SCHPO Schizosaccharomyces pombe Fungi; Schizosaccharomyces
THEHY Tetrahymena sp Alveolata; Ciliophora; Oligohymenophorea
Archaea
ARCFU Archaeoglobus fulgidus Euryarchaeota; Archaeoglobi
AERPE Aeropyrum pernix Crenarchaeota; Thermoprotei; Desulfurococcales
HALMA | Haloarcula marismortui Buryarchaeota; Halobacteria
METJA Methanococcus jannaschii Euryarchaeota; Methanococci
METKA | Methanopyrus kandleri Euryarchaeota; Methanopyri
METMA j Methanosarcina mazei Euryarchaeota; Methanomicrobia
METTH Methanobacterium thermoautotrophicum | Euryarchaeota; Methanobacteria
NANEQ Nanoarchaeum equitans Nanoarchaeota
PYRAE Pyrobaculum aerophilum Crenarchaeota; Thermoprotei; Thermoproteales
PYRAB Pyrococcus abyssi Euryarchaeota; Thermococci; Thermococcales
SULSO Sulfolobus solfataricus Crenarchaeota; Thermoprotei; Sulfolobales
THEAC Thermoplasma acidophilum Euryarchaeota; Thermoplasmata
Bacteria
AQUAE Aquifex aeolicus Aquificae; Aquificae (class); Aquificales
BACSU Bacillus subtilis Firmicutes; Bacilli
CAUCR Caulobacter crescentus Proteobacteria; Alphaproteobacteria; Caulobacterales CHLTE Chlorobium tepidum Bacteroidetes/Chlorobi group; Chlorobi
CHLTR Chlamydia trachomatis Chlamydiae/Verrucomicrobia group; Chlamydiae
DEIRA Deinococcus radiodurans Deinococcus-Thermus; Deinococci
ECOLI Escherichia coli Proteobacteria; Gammaproteobacteria; Enterobacteriales FUSNN Fusobacterium nucleatum Fusobacteria; Fusobacterales
HELPY Helicobacter pylori Proteobacteria; Epsilonproteobacteria
SYNY3 Synechocystis sp PCC 6803 Cyanobacteria; Chroococcales
THEMA | Thermotoga maritima Thermotogae; Thermotogae (class); Thermotogales
THETH Thermus thermophilus Deinococcus-Thermus; Deinococci; Thermales
Trang 33
2.3 Multiple Alignments
Preliminary multiple sequence alignments were obtained using the prior-based profile method
in PIMA [103, 27, 28] and the progressive method of CLUSTALW [112] In the PIMA
method, profiles are generated by pairwise alignments of the protein set using local dy- namic programming and are refined through iterative steps that maximize the information content and information density of each profile The progressive multiple alignments pro- duced by CLUSTALW are built considering the most closely related sequences first, and gradually refining the alignment to accommodate the more distant ones The two methods were used in combination to provide a rough alignment for the sequences of each ribosomal protein, limitedly to each single phylogenetic domain, followed by visual inspection and manual refinement where necessary
2.3.1 Alignment Refinements
The editing and optimization of the preliminary alignments were carried out in a manual fashion by considering the amino acid similarity classes (see table 2.2) and the structural information currently available (see Appendix A) The secondary-structure assignment of each residue was determined using DSSP [59] or JPred [23] whenever secondary-structure information was not available The secondary-structure assignment of each residue in the alignment guided the adjustments of positions so as to restrict gaps to loop regions as much as possible The alignments were further refined to maximize the conservation of hydrophobicity and polarity patterns as well as conservation of residues interacting with
13
Trang 34other amino acids or RNA residues
Table 2.2: Amino acid similarity classes used in the refinement of the multi-sequence alignments
14
Trang 352.4 Sequence Analysis
The ribosomal protein set consists of 102 proteins (40 in the small subunit and 62 in the large subunit) subdivided as follows: 34 proteins are found in all three phylogenetic domains and therefore called universal (UN); 33 are found exclusively in Archaea and Eukarya (AE);
11 are found only in Eukarya (EE); and finally, 23 are found only in Bacteria (BB)(Figure
2.1) One protein (Lxa} has been found exclusively in some archaeal representatives, but not in all We excluded this protein from our analysis because of the limited available sequence information and because the true phylogenetic assignment is still controversial Currently there are no known proteins that belong exclusively to Bacteria and Eukarya, or proteins that belong exclusively to Bacteria and Archaea See Table 2.3 and Table 2.4 for
a list of the ribosomal proteins used in the present work and their classification within the three phylogenetic domains
Figure 2.1: Venn diagram of ribosomal proteins distribution among the three phylogenetic domains (a) Complete ribosomal protein set (b) Archaeal-eukaryotic ribosomal protein set UN = Bacteria, Archaea and Eukarya; AE = Archaea and Eukarya; BE = Bacteria and Eukarya; BA = Bacteria
and Archaea; BB= Bacteria; AA = Archaea; EE = Eukarya; C = Crenarchaea; Y = Euryarchaea
The two numbers in parentheses refers to the number of proteins in the smal! and large subunit respectively Adapted from Lecompte et al., 2002, [71]
15
Trang 36Table 2.3: Ribosomal proteins in the 30S subunit and their phylogenetic assignment The conser- vation of the protein in all the organisms investigated within each phylogenetic domain is denoted
by X, whereas the presence of the protein in some, but not all, representative species is denoted by
x Adapted from Lecompte et al., 2002, [71]
16
Trang 37Table 2.4: Ribosomal proteins in the 50S subunit and their phylogenetic assignment The conser- vation of the protein in all the organisms investigated within each phylogenetic domain is denoted
by X, whereas the presence of the protein in some, but not all, representative species is denoted by
x Adapted from Lecompte et al., 2002, [71]
Trang 38Archaea and Eukarya (AE) break up into blocks conserved across both domains (ae) and
blocks specific to each domain (aa, ee; Figure 2.5 and Figure 2.4)
The distinct blocks have lengths varying from 6 to 170 amino acids, with block tran- sitions clear and well defined They are shorter than typical protein domains, yet longer than segments associated with enzyme active-sites On average, in the universal protein set the universal blocks are longer than the archaeal-eukaryotic specific blocks (50 vs 30 amino acids) and longer than other phylodomain-specific blocks (30 and 20 amino acids for the eukaryotic specific blocks and bacterial specific blocks, respectively) Similarly, in the archaeal-eukaryotic specific proteins, the common blocks are longer than the phylodomain- specific blocks (see Figure 2.6) In particular, the average length of the common blocks is quite remarkable in the large subunit (67 amino acids)
In the universal ribosomal protein set we observe on average approximately 2.5 and 2.1 universal blocks in the large and the small subunit respectively (Figure 2.7) This fact, together with the observation that the average length of each universal block is generally greater than the average length of the phylodomain-specific blocks, supports the conclusion that in each universal ribosomal protein the number of positions carrying a universal signal
is on average greater than the number of positions carrying phylodomain-specific signal This is also true in the case of archaeal-eukaryotic specific proteins, where even though the average number of eukaryotic blocks is slightly greater than the average number of common blocks, the latter are much longer
The universal ribosomal proteins present a similar distribution of blocks type in both
18
Trang 39Figure 2.2: Schematic representation of the multiple alignments in the universal ribosomal protein
set (UN) of the 30S subunit Orange denotes blocks alignable among all three phylogenetic domains (uu); purple denotes blocks alignable between Archaea and Eukarya only (ae); green denotes blocks alignable only within Bacteria (bb); blue denotes blocks alignable only within Eukarya (ee); dark green denotes blocks alignable between Archaea and Bacteria only (ab) Dotted lines represent sequence regions of varying length that are not alignable across the phylogenetic domains
19
Trang 40Re Eqresensneuyoacaaillgensamene ged
bt eer Ese
AT GP LH hoi G000nn000090Sn00n60n60900a6000°
Bat 20MN-PSDNBrt98)7:0090y-gỹnonn9t0nnineeonE- ›9 SE Lice
EuK + owe -mmawrơơơơơơzznơa ~—a
Bat mn "nh Lisp
ARC TỒN ems semeenwet
Ỡ- GEE teres sen
Figure 2.3: Schematic representation of the multiple alignments in the universal ribosomal protein
set (UN) of the 50S subunit Orange denotes blocks alignable among all three phylogenetic domains (uu); purple denotes blocks alignable between Archaea and Eukarya only (ae); green denotes blocks alignable only within Bacteria (bb); red denotes blocks alignable only within Archaea (aa); blue denotes blocks alignable only within Eukarya (ee) Dotted lines represent sequence regions of varying length that are not alignable across the phylogenetic domains
20