Luận án tiến sĩ: Analysis of ribosomal protein block structure: Functional characterization, evolutionary implications and distant homology search using discrete state models

Amino acid sequence alignments of ribosomal proteins revealed an unusual taxon-specific block structure, with some blocks universally conserved and others specific to one or two phylogen

Trang 1

BOSTON UNIVERSITY COLLEGE OF ENGINEERING

Dissertation

ANALYSIS OF RIBOSOMAL PROTEIN BLOCK STRUCTURE: FUNCTIONAL CHARACTERIZATION, EVOLUTIONARY IMPLICATIONS AND DISTANT HOMOLOGY SEARCH

USING DISCRETE STATE MODELS

PAOLA FAVARETTO Laurea Degree, Universita’ degli Studi di Padova, Italy, 2001

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy

2007

Trang 2

UMI Number: 3246605

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy submitted Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted Also, if unauthorized copyright material had to be removed, a note will indicate the deletion

ProQuest Information and Learning Company

300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346

Trang 4

Acknowledgments

I would like to express deep gratitude to my advisor, Professor Temple F Smith, for his guidance and support during the completion of this project His unending drive and extraordinary enthusiasm for science have been of great inspiration and motivation in my

work

Iam grateful to Professor Scott C Mohr for his thoughtful advice and encouraging help, especially in the early stage of the project His meticulous work and rigorous attitude have shown me the importance of details and analytical thinking

Iam indebted to Professor Hyman Hartman for his important contribution to the project and his constant interest in my work His comments and insights were always very much appreciated

I would like to acknowledge Professor Sandor Vajda and Professor Lucia M Vaina for serving on my Ph.D committee, and Professor Jadwiga Bienkowska for serving on the prospectus committee

My appreciation goes to all the members of the BioMolecular Engineering Research Cen- ter for their friendship and numerous stimulating conversations: Arjun Bhutkar, Prashant Vishwanath, Kavitha Venkatesan, Filiz Aslan, Hongxian He, Esther Epstein, Nancy Sands and Sean Quinlan

I would also like to thank my parents for always believing in me and for giving me the

HH

Trang 5

opportunity to pursue my goals

Finally, I would like to thank my husband Atul for his profound support and immense help throughout the completion of this degree He has been a constant source of strength and encouragement, and I am honored to have him in my life

iv

Trang 6

ANALYSIS OF RIBOSOMAL PROTEIN BLOCK STRUCTURE:

FUNCTIONAL CHARACTERIZATION, EVOLUTIONARY IMPLICATIONS AND DISTANT HOMOLOGY SEARCH USING DISCRETE STATE MODELS

(Order No )

PAOLA FAVARETTO Boston University College of Engineering, 2007 Major Professor: Temple F Smith, Professor of Biomedical Engineering

ABSTRACT

The ribosome, a very complex molecular machine, plays a fundamental role in all living organisms and exhibits extraordinary engineering design concepts This investigation examined the complexity of the translational apparatus, seeking to understand how its components evolved to their present configuration It includes detailed sequence and structural comparative analyses of the ribosome and its associated proteins with functional characterization of these components across the three phylogenetic domains

Amino acid sequence alignments of ribosomal proteins revealed an unusual taxon-specific block structure, with some blocks universally conserved and others specific to one or two phylogenetic domains Statistical and phylogenetic analyses of the universal blocks imply that modern Bacteria, Archaea and Eukarya clearly have a common ancestor, while the

Trang 7

phylodomain-specific blocks suggest that these groups also share more recent, taxon-specific cenancestors Major evolutionary implications of the observed block structure are: (7) the crenarchaeal, endosymbiotic origin of the modern eukaryotic translational apparatus; and (it) the occurrence of a prokaryotic bottleneck that drastically reduced the diversity of modern species progenitors about 2.2 billion years ago

Surprisingly, the highly conserved blocks identified in most of the translation-related proteins do not associate consistently with any identifiable particular function or structural feature, or even with rRNA contacts A comprehensive investigation of the rRNA-ribosomal protein interactions, however, demonstrated a major role of ribosomal proteins in constraining the rRNA conformational space and stabilizing its correct, universally conserved core fold

In order to identify possible evolutionary relationships between the taxon-specific block structure and other proteins, a new stochastic tool for the identification of distant homologous domains in single-, repeated- and multi-domain contexts was implemented The approach uses sequence and structure information embedded in Discrete State Models, and

a Markov threading technique to estimate the compatibility of any query sequence with the models under consideration The method was successfully applied to a variety of cases, in- cluding the ribosomal blocks, the WD40-repeat domain and the very diverse ubiquitin-like family

vi

Trang 9

2 Taxon-specific Block Structure in Ribosomal Proteins 10

2.1 Introduction © 0 c c c c Q Q Q n ng kg kg gi kg kia 10

2.2 Sequence DataseE Q Q Q Q u ng kg g kg kg cv kia 11

2.3.1 Alignment Refinements 0.0.00 eee eee 13

2.4 Sequence Analysis 2 ng và kg ki à kia a 15

2.5 Structural Analysis 2 nà gà cv v k kg k k va 29

2/71 Ribosomal Phylogeny Reconstruction .004 51

3.1 Structural Role of Ribosomal Proteins 000 55

3.2 RNA-chaperone Activity of Ribosomal Proteins 56

3.3 Consensus Secondary Structure of rRNA 0 0 eee ee 58

3.4 RNA Secondary-structure Prediction Software: mfold .4 59

3.5 Fold Prediction Scoring Scheme 00.22.0022 eee 63

viii

Trang 10

3.6.1 Individual Protein Constraints .0 2.0.2.0 0 500 0G 64 3.6.2 Binding Pathway Constraints "1< 67

3.6.3 Artificial and Hypothetical Protein Constraints 73

3.6.4 Conclusions: Role of Ribosomal Proteins in rRNA Folding 76

3.7.2 Conclusions: Sequence Composition Impact on rRNA Folding 85

3.9 Conclusions: Constraining rRNA Conformational Space 87

Discrete State Models for Distant Homology Search of Ribosomal Blocks 89

4.2 Discrete State-space Models (DSMs) .2.000.- 92

4.2.1 Creation of Discrete State Models .0-., 94

4.2.2 Distant Homology Search: Problem Formulation 97

4.4 Method and Models Validation 2.0 0.00 0000 106

Trang 11

4.4.1 Identification of Repeated-domain Proteins: WD40-repeat Family

44.2 WD40-repeat Model Validation 2.0 0.2.0.0 0004 ae

4.4.3 Identification of Small Domain Proteins and Concatenated Domains:

4.5 Discrete State Models for Ribosomal Protein Blocks

Discussion

5.1 Evolutionary Implications of the Ribosomal Protein Block Structure

5.2 The Archaeal Origin of the Eukaryotic Translational Apparatus

5.3 Ribosomal Block Structure and Interactions with rRNA ., 5.4 Protein Domain Identification Using DSMs

PDB Structures of Proteins in the Translational Apparatus

rRNA and Ribosomal Protein Interactions

Phylogenetic Reconstruction Parameters

149

152

Trang 12

D WD40-repeat Prediction in S cerevisiae 162

xi

Trang 13

Ribosomal proteins in the 30S subunit and their phylogenetic assignment

Ribosomal proteins in the 50S subunit and their phylogenetic assignment

Structural folds of ribosomal proteins and their spatial location in the ribo-

Interactions between ribosomal proteins and RNA helices in the 30S subunit

Interactions between ribosomal proteins and RNA helices in the 50S subunit

Examples of binding pathways and their impact on the conformational variability of 16S and accuracy of predicted folds Examples of binding pathways and their impact on the conformational variability of 23S and accuracy of predicted folds .0

Trang 14

Domain variability in the 30S and 50Ssubunits

Generic DSMs used as competing models in the Bayesian estimation of pos-

Comparison of predicted repeat boundaries for sequence PDB code 1GXRA

Ubiquitin-like protein subfamilies 0 Q Q Q Q Q ga

Ubiquitin and ubiquitin-like model classes 1 1 ee ee

Ubiquitin-like domain occurrences identified in a sample of representative complete eukaryotic gnomes 2 0 ee ee va

Ubiquitin-like domain occurrences identified in a sample of representative archaeal genome@s HQ ng kg gi k k k kg sa

Ubiquitin-like domain occurrences identified in a sample of representative bacterial gnomes 1 4a

List of translational protein structures in the Protein Data Bank (PDB)

Distribution of rRNA-ribosomal protein interaction (RPI) patterns

Distribution of rRNA-ribosomal protein interaction across the amino acid types associated with residues in the universal blocks

Distribution of hydrogen-bond interactions between rRNA and ribosomal proteins (RPI)across block types 2.2 ee

Trang 15

Distribution of rRNA-ribosomal protein interaction across the amino acid

Distribution of individual amino acids in the proteins of the small ribosomal

Distribution of individual amino acids in the proteins of the large ribosomal

Results of WD40-repeat search on the entire S cerevisiae genome, identified

by both the DSM method and the profile-based approach

Results of WD40-repeat search on the entire S cerevisiae genome, identified

List of PDB structures for ubiquitin-like proteins

Trang 16

Ribosomal proteins distribution among the three phylogenetic domains

Schematic representation of the multiple alignments in the universal riboso- mai protein set (UN) of the 305 subunit uc 2

Schematic representation of the multiple alignments in the universal ribosomal protein set (UN) of the 50S subunit cố

Schematic representation of the multiple alignments in the archaeal-eukaryotic ribosomal protein set (AE) of the 30S subunit

Schematic representation of the multiple alignments in the archaeal-eukaryotic

ribosomal protein set (AP) of the 508 subunit

Length distribution of the taxon-specific blocks identified in the ribosomal

Average number of blocks in ribosomal proteins

Block distribution among the ribosomal proteins

Trang 17

2.9 Average number of conserved positions in subsets of universal ribosomal pro-

2.10 Average number of informative positions in subsets of universal ribosomal

2.11 Distribution of rRNA-ribosomal protein interactions (RPI) among block types 36

2.12 Distribution of rRNA-ribosomal protein interaction (RPI) types in both ribosomal subunits 0 c c cv cu 2 kg L k v va 37 2.13 Distributions of rRNA-ribosomal protein interactions among amino acids in

2.14 rRNA-ribosomal protein interactions mapped into the secondary structure of the 238 rRNA of H marismortui (3? end) 2.2.0.0 00 000004 42

2.15 rRNA-ribosomal protein interactions mapped into the secondary structure of

the 238 rRNA of H marismortui (5’ end) 2 0 ee 43 2.16 rRNA-ribosomal protein interactions mapped into the secondary structure of 16S rRNA of T thermophilus c c c c cv v g g g Q gà cv ga và va 44

2.17 Phylogenetic reconstructions using the positional variation among aligned

3.1 Consensus secondary-structure diagram for 16S rRNA 60

3.2 Consensus secondary-structure diagram for 238 rRNA (5’end) 61

Xvi

Trang 18

Consensus secondary-structure diagram for 238 rRNA (3’ end)

Number of folds predicted when individual ribosomal protein constraints were applied to the FE coli 16S sequence 1 kg va

Number of folds predicted when individual ribosomal protein constraints were applied to the H marismortui 23S sequence .0 000.4

Assembly map for the small ribosomal subunit (308)

Assembly map for the large ribosomal subunit (50S) Validation of results using artificial and hypothetical protein constraints Correlation between the number of predicted folds and the number of applied

Average percentage of native base pairs predicted correctly for 16S È cob sequences with canonical base-pair substitutions Average total number of base pairs predicted for sequences with canonical

Trang 19

3.14 Average number of secondary-structure folds predicted within 5% of the min- imum free-energy fold for H marismortui 23S sequences with canonical base-

4.5 WD40-repeat model validation results 2 2 ee ee eee 110

4.6 Log Likelihood Ratio (LLR) plot for sequence PDB code 1GKR 111

4.7 Log Likelihood Ratio (LLR) plot for sequence PDB code 1GXRA 112

4.8 Specificity and sensitivity of ubiquitin-like model classes 118

4.9 Validation of the ubig model 2 0 Q Q L ng nà và va 119 4.10 Log Likelihood Ratio (LLR) obtained by threading yeast’s RUB1 sequence (SGD YDRI139©) through the ubi@model 120

4.11 Log Likelihood Ratio (LLR) obtained by threading yeast ribosomal protein L40 sequence (SGD YIL148W) through the ubig model 121

Xvili

Trang 20

4.12 Log Likelihood Ratio (LLR) obtained by threading yeast polyubiquitin se-

quence (SGD YLLO039C) through the ii model

4.13 Distribution of the ubiquitin domain location within protein hits across a representative sample of eukaryotic gnomes 0

XIX

Trang 21

Chapter 1

Introduction

The remarkable accomplishment represented by the completion of the ribosomal subunit crystal structures [119, 8, 17, 16, 126, 83, 14, 66], together with the solution of the ribosome

in complex with ancillary proteins, have spurred new interest among the scientific com- munity and brought renovated attention to the protein translation machinery in general

At the same time, the database of proteins involved in translation has rapidly expanded, mainly due to the ever-increasing number of completely sequenced genomes Considerable information has become available also on the functional sites of the ribosome as well as

protein-RNA and protein-protein interactions This wealth of new biological information

that spans sequence-, structural and functional level, is the motivation for the present work, which aims to a deeper, more complete analysis of the ribosome and the translational appa-

Trang 22

ratus as a whole A detailed characterization of these biological complexes can potentially have an enormous impact in the attempt to understand how Nature engineered this ex- tremely complex machinery, which is universally conserved across all phylogenetic domains and plays such a crucial role in all living organisms

The present work aimed to perform a multi-level analysis of the protein translation apparatus, with particular focus on the ribosome, by using the new and more broadly representative data now available, with three specific objectives: (2) characterization of functional, structural and sequence features of the translation-related proteins across the three phylogenetic domains (Bacteria, Archaea and Eukarya); (ii) phylogenetic analysis of translation-related proteins and its evolutionary implications, with particular focus on the origin of eukaryotes; (ii7) design and implementation of a new, more sophisticated compu- tational tool that uses both sequence and structural information for the identification of distant homologs of small conserved or regions, identified in the translation-related proteins

The significance of the project here undertaken is two-fold: (2) dissecting the components and interactions of the protein translational machinery allows for a thorough understanding

of some of the mechanisms that drive protein translation; (ii) the phylogenetic reconstructions derived by using the translation-associated proteins across a wide range of organisms can shed light on the history of the protein synthesis mechanisms and can suggest a more accurate evolutionary hypothesis for the origin of Life on earth

Trang 23

1.2 Protein Translation

The basic principle of translation is that the genetic information stored in the chromosomal DNA and subsequently transcribed into the messenger RNA (mRNA) directs the ordered arrangement of the amino acids into the polypeptide chain This process takes place in the ribosome, which consists of two unequally sized subunits composed of proteins and RNAs The small ribosomal subunit contains the decoding center where the recognition of the mRNA codon by the transfer RNA (tRNA) anticodon takes place The large ribosomal subunit catalyzes the formation of peptide bonds between amino acids, thus forming the polypeptide chain Each amino acid used in the protein synthesis reach the ribosome attached to its specific tRNA, which plays the role of an adaptor between the signal de- coded by its anticodon sequence and the amino acid that it is carrying Aminoacyl tRNA synthetases are the enzymes that charge the tRNA with the specific amino acid

Despite some differences among the three phylogenetic domains, the main characteristics and mechanisms involved in translation appear to be universally conserved In initiation, the ribosome is assembled at the initiation codon in the mRNA with a methionyl initiator tRNA bound in its peptidyl (P) site In elongation, aminoacyl tRNAs enter the acceptor (A) site where decoding takes place If they are the correct (cognate) tRNA, the ribosome catalyzes the formation of a peptide bond After the tRNAs and mRNA are translocated

such that the next codon is moved into the A-site, the process is repeated Termination

takes place when a stop codon in the mRNA is encountered and the complete peptide is released from the ribosome In the final stage of recycling, the ribosomal subunits are

Trang 24

dissociated, releasing the mRNA and deacylated tRNA, and setting the stage for another round of initiation See [72, 93, 83, 16] for complete reviews of the mechanisms involved in protein translation

A complex network of molecular chaperones aids in the correct folding of the nascent polypeptide chains and in their translocation across the membranes to reach their sites of functions Some of these chaperones are trigger factors [60, 92], proteins involved in the Sec translocase pathway (63, 89] and the signal recognition particle (SRP) (62, 31, 75]

1.3 Protein Translation Apparatus

1.3.1 The Ribosome

The ribosome is a ribonucleic particle whose size vary with the species In most ribosomes, the mass of the ribosomal RNA (rRNA) is significantly larger than that of the ribosomal proteins, so it is not surprising that ribosomal proteins interact extensively with the rRNA molecules The rRNA forms the core of the ribosome and provides the binding sites for the ribosomal proteins These serve primarily to stabilize the rRNA and to organize its proper functional three-dimensional structure, but other extra-ribosomal functions have been associated with the ribosomal proteins [122]

In prokaryotes, the small subunit has one rRNA molecules (16S) and the large subunit has two rRNA molecules (23S and 58), In eukaryotes, the small subunit has one rRNA molecule (18S) and the large subunit has three rRNA molecules (5.88, 285, 5S) A com-

Trang 25

parative analysis of rRNA sequences from hundreds of species [85] gave rise to a consensus secondary structure that showed arrangement of the rRNA into helices and domains The 23S rRNA has approximately 100 helices arranged into six domains; the 16S rRNA has only about 45 helices arranged into four domains Despite the significant variation in size of the rRNAs among the three phylogenetic domains, the core of the secondary structure remains,

since the main differences are found in the size of some loops [67, 94, 87]

A large number of ribosomal proteins are bound to the rRNA The exact enumeration of the proteins has met with some difficulties There are approximately 54 proteins in Bacteria and chloroplasts and 70-80 proteins in Eukarya and mitochondria Archaea seem to have

an intermediate number of ribosomal proteins It has been observed that a large number

of ribosomal proteins can be deleted without apparent effects on the cell viability [26]

On the other hand, a broad-range comparison of the completely sequenced genomes shows that the few proteins that are universally conserved are primarily ribosomal proteins [71] and translation factors [86] This suggests that, as expected, a subgroup of the ribosomal proteins must be essential for the viability of the cell

The recent elucidation of the crystal structures of the ribosomal subunits has opened

a new era for the work on protein translation [8, 119, 98, 40, 126] The structures have confirmed and confuted a number of previous expectations, and are a firm, solid base for the formulation of hypotheses, planning of new experiments, and interpretation of results

Irrespective of the species, the small ribosomal subunit has the general shape of a right- hand mitten Some of its features have been given names: the body, the thumb (also called

Trang 26

platform), the head (corresponding to the finger parts of the mitten), the nose or beak (end

of the head), the shoulder (upper part of the body on the opposite side to the platform), and finally the toe or spur (minor protuberance at the bottom part of the body) The inner part of the mitten interacts with the large subunit, whereas the outer part (also called back side) is exposed to the solvent

When seen from the side of the subunit interface, the large subunit presents a structure similar to a crown The three protuberances on the particle are called the central protuberance, right-hand side or L12-stalk, and left-hand side or Ll-stalk When observed from the side, the particle has an hemispherical shape, and the flatter surface corresponds to the interface side The exit tunnel for the nascent polypeptide is located through the large subunit from the interface side to the external surface {79, 123, 8]

1.3.2 Translation Factors

In vivo protein synthesis is catalyzed by a number of translation factors that bind transiently

to the ribosome during different phases of translation Most bacterial translation factors have been studied extensively and this has led to a proposed detailed mechanism for bacterial translation In Archaea and Eukarya this mechanism is still under investigation

Initiation

Initiation of protein synthesis is performed on the small subunit In Bacteria, the mRNA is wrapped around the neck of the small subunit and the initiator AUG codon is placed in the

Trang 27

P-site with the aid of the Shine-Dalgarno interaction, where an A- and G- rich region of the mRNA binds to a region of the 3’ terminal sequence of 165 RNA Three bacterial initation factors (IF1, IF2, IF3) assist the placement of the fMet-tRNA at the AUG start codon

in the P-site and IF2 catalyzes the association of the two ribosomal subunits Eukaryotic initiation is performed with the aid of more than twelve intiation factors, some of which have homologs in other phylogenetic domains, suggesting that at least some common elements

of initiation were present at the universal ancestor of evolution [69]

Elongation

During elongation, the elongation factor EF-tu (EF1 in Archaea and Eukarya) is a GTP activated enzyme that binds charged tRNAs to the ribosomal A-site The complex between EF-tu, GTP and aminoacyl tRNA - called ternary complex (TC) - binds to the ribosome

If the anticodon of the aminoacyl tRNA bound to EF-tu matches the codon of the mRNA

in the decoding part of the A-site, EF-tu hydrolyzes its GTP to GDP and undergoes conformational changes leading to its dissociation from the ribosome and the tRNA

A number of antibiotics inhibit the function of EF-tu [68], either by blocking the formation of the ternary complex (e.g., pulvomycin) or blocking the release of EF-tu from the ribosome after GTP hydrolysis (e.g., kirromycin)

The elongation factor EF-G (EF2 in Archaea and Eukarya) functions as translocase in protein synthesis After the peptidyl transfer, a peptidyl tRNA is located in the A-site and

a deacylated tRNA in the P-site The EF-G catalyzes the translocation of both tRNAs to

Trang 28

the P- and E- sites respectively, and also the movement of the mRNA so that a new codon

is exposed in the A-site The deacylated tRNA in the E-site will subsequently fall off from the ribosome and when the EF-G is dissociated, the ribosome is ready for a new cycle of elongation

The antibiotic fusidic acid affects the function of EF-G by trapping the elongation factor

on the ribosome after GTP hydrolysis and translocation This inhibition is similar to that performed by kirromycin that locks EF-tu on the ribosome In the case of EF2, the antibiotic sordarin has similar effects as fusidic acid

Termination

Prokaryotic termination (or release) factors RF1 and RF2 hydrolyze and release the com- pleted polypeptide from the P-site tRNA when a stop codon is read (37, 65] RF1 and RF2 are homologous, the only difference being the type of stop codon recognized In Eukarya there is only one factor, eRF1, which recognizes all three stop codons [129] Once the release factors have terminated the synthesis of a protein, the ribosome recycling factor RRF prevents the random reinitiation of the synthesis by recycling the components bound to the mRNA [58] There are two different views on the specific role of RRF: one view is that

RRF dissociates the polysomes into monosomes and release mRNA and tRNA [53} and the

other view is that RRF separates the two ribosomal subunits without releasing any other

molecule [61]

Trang 29

1.3.3 Chaperone Proteins in Translation

The proper folding of the nascent chain is an important aspect of translation because a newly synthesized protein is exposed to a crowd of proteolytic enzymes and other proteins

in the cell and therefore is at risk of aggregating to the wrong partners or being degraded Even though some proteins may fold spontaneously at the end of the exit tunnel, in many cases chaperone proteins are needed to ensure proper folding of the emerging polypeptide [42] Some of these chaperones are heat shock proteins (Hsps), like the bacterial trigger factor and DnaK, which are monomeric proteins that primarily prevent the aggregation of the growing polypeptide [29, 111] Trigger factor homologs have been proposed in Eukarya,

as well as Hsp70 and Hsp40 versions [36]

Proteins are synthesized in the cytoplasm and then transported to their final destination

by a complex transport system, which includes the signal recognition particle (SRP), the signal recognition particle receptor and the translocon [60, 92, 63] The ribosome interacts with several components of this machinery, although the interactions and the processes involved are only partially characterized [62]

Trang 30

structure- , statistical, and phylogenetic analysis on the following subsets: (4) eukaryotic

sequences of the universal ribosomal proteins; (ii) archaeal and eukaryotic sequences of the archaeal-eukaryotic specific proteins; (1/2) other sequences of translation-associated proteins,

10

Trang 31

such as translation factors

2.2 Sequence Dataset

An initial set of protein sequences was identified by considering the database of ribosomal proteins at Swiss-Prot [13] for three representative organisms: £ coli for Bacteria, AM jannaschii for Archaea and S cerevisiae for Eukarya These organisms were chosen because they represent the best studied species in each phylogenetic domain and their annotation could be assumed to be reliable with high confidence This initial set of sequences was used

to query the current versions of Swiss-Prot, Genbank [2] and Tigr [3] databases using BLAST

and PSI-BLAST [5, 4, 97] in order to identify homologous protein sets in other organisms Finally, from the resulting sequences, a smaller subset of representative sequences was drawn

by considering the organisms for which structural and functional information was available, and such that the widest taxonomic and habitat range possible was included In this way, not only the analysis could benefit from structural and functional insights, but also could account for the maximum extent of sequence variation that could occur in each protein family Table 2.1 presents a list of the species considered and their phylogenetic domain assignments

1]

Trang 32

Table 2.1: List of eukaryotic, archaeal and bacterial species used in the analysis

ARATH Arabidopsis thaliana Viridiplantae; Streptophyta

CAEEL Caenorhabditis elegans Metazoa; Nematoda

DROME | Drosophila melanogaster Metazoa; Arthropoda; Insecta; Drosophila

ENNCU Encephalitozoon cuniculi Fungi; Microsporidia

HUMAN | Homo sapiens sapiens Metazoa; Vertebrata; Mammalia; Primates; Hominidae ICTPU Ictalurus punctatus Metazoa; Vertebrata; Otocephala

PLAFA Plasmodium falciparum Alveolata

SCHPO Schizosaccharomyces pombe Fungi; Schizosaccharomyces

THEHY Tetrahymena sp Alveolata; Ciliophora; Oligohymenophorea

Archaea

ARCFU Archaeoglobus fulgidus Euryarchaeota; Archaeoglobi

AERPE Aeropyrum pernix Crenarchaeota; Thermoprotei; Desulfurococcales

HALMA | Haloarcula marismortui Buryarchaeota; Halobacteria

METJA Methanococcus jannaschii Euryarchaeota; Methanococci

METKA | Methanopyrus kandleri Euryarchaeota; Methanopyri

METMA j Methanosarcina mazei Euryarchaeota; Methanomicrobia

METTH Methanobacterium thermoautotrophicum | Euryarchaeota; Methanobacteria

NANEQ Nanoarchaeum equitans Nanoarchaeota

PYRAE Pyrobaculum aerophilum Crenarchaeota; Thermoprotei; Thermoproteales

PYRAB Pyrococcus abyssi Euryarchaeota; Thermococci; Thermococcales

SULSO Sulfolobus solfataricus Crenarchaeota; Thermoprotei; Sulfolobales

THEAC Thermoplasma acidophilum Euryarchaeota; Thermoplasmata

Bacteria

AQUAE Aquifex aeolicus Aquificae; Aquificae (class); Aquificales

BACSU Bacillus subtilis Firmicutes; Bacilli

CAUCR Caulobacter crescentus Proteobacteria; Alphaproteobacteria; Caulobacterales CHLTE Chlorobium tepidum Bacteroidetes/Chlorobi group; Chlorobi

CHLTR Chlamydia trachomatis Chlamydiae/Verrucomicrobia group; Chlamydiae

DEIRA Deinococcus radiodurans Deinococcus-Thermus; Deinococci

ECOLI Escherichia coli Proteobacteria; Gammaproteobacteria; Enterobacteriales FUSNN Fusobacterium nucleatum Fusobacteria; Fusobacterales

HELPY Helicobacter pylori Proteobacteria; Epsilonproteobacteria

SYNY3 Synechocystis sp PCC 6803 Cyanobacteria; Chroococcales

THEMA | Thermotoga maritima Thermotogae; Thermotogae (class); Thermotogales

THETH Thermus thermophilus Deinococcus-Thermus; Deinococci; Thermales

Trang 33

2.3 Multiple Alignments

Preliminary multiple sequence alignments were obtained using the prior-based profile method

in PIMA [103, 27, 28] and the progressive method of CLUSTALW [112] In the PIMA

method, profiles are generated by pairwise alignments of the protein set using local dy- namic programming and are refined through iterative steps that maximize the information content and information density of each profile The progressive multiple alignments pro- duced by CLUSTALW are built considering the most closely related sequences first, and gradually refining the alignment to accommodate the more distant ones The two methods were used in combination to provide a rough alignment for the sequences of each ribosomal protein, limitedly to each single phylogenetic domain, followed by visual inspection and manual refinement where necessary

2.3.1 Alignment Refinements

The editing and optimization of the preliminary alignments were carried out in a manual fashion by considering the amino acid similarity classes (see table 2.2) and the structural information currently available (see Appendix A) The secondary-structure assignment of each residue was determined using DSSP [59] or JPred [23] whenever secondary-structure information was not available The secondary-structure assignment of each residue in the alignment guided the adjustments of positions so as to restrict gaps to loop regions as much as possible The alignments were further refined to maximize the conservation of hydrophobicity and polarity patterns as well as conservation of residues interacting with

13

Trang 34

other amino acids or RNA residues

Table 2.2: Amino acid similarity classes used in the refinement of the multi-sequence alignments

14

Trang 35

2.4 Sequence Analysis

The ribosomal protein set consists of 102 proteins (40 in the small subunit and 62 in the large subunit) subdivided as follows: 34 proteins are found in all three phylogenetic domains and therefore called universal (UN); 33 are found exclusively in Archaea and Eukarya (AE);

11 are found only in Eukarya (EE); and finally, 23 are found only in Bacteria (BB)(Figure

2.1) One protein (Lxa} has been found exclusively in some archaeal representatives, but not in all We excluded this protein from our analysis because of the limited available sequence information and because the true phylogenetic assignment is still controversial Currently there are no known proteins that belong exclusively to Bacteria and Eukarya, or proteins that belong exclusively to Bacteria and Archaea See Table 2.3 and Table 2.4 for

a list of the ribosomal proteins used in the present work and their classification within the three phylogenetic domains

Figure 2.1: Venn diagram of ribosomal proteins distribution among the three phylogenetic domains (a) Complete ribosomal protein set (b) Archaeal-eukaryotic ribosomal protein set UN = Bacteria, Archaea and Eukarya; AE = Archaea and Eukarya; BE = Bacteria and Eukarya; BA = Bacteria

and Archaea; BB= Bacteria; AA = Archaea; EE = Eukarya; C = Crenarchaea; Y = Euryarchaea

The two numbers in parentheses refers to the number of proteins in the smal! and large subunit respectively Adapted from Lecompte et al., 2002, [71]

15

Trang 36

Table 2.3: Ribosomal proteins in the 30S subunit and their phylogenetic assignment The conservation of the protein in all the organisms investigated within each phylogenetic domain is denoted

by X, whereas the presence of the protein in some, but not all, representative species is denoted by

x Adapted from Lecompte et al., 2002, [71]

16

Trang 37

Table 2.4: Ribosomal proteins in the 50S subunit and their phylogenetic assignment The conservation of the protein in all the organisms investigated within each phylogenetic domain is denoted

by X, whereas the presence of the protein in some, but not all, representative species is denoted by

x Adapted from Lecompte et al., 2002, [71]

Trang 38

Archaea and Eukarya (AE) break up into blocks conserved across both domains (ae) and

blocks specific to each domain (aa, ee; Figure 2.5 and Figure 2.4)

The distinct blocks have lengths varying from 6 to 170 amino acids, with block tran- sitions clear and well defined They are shorter than typical protein domains, yet longer than segments associated with enzyme active-sites On average, in the universal protein set the universal blocks are longer than the archaeal-eukaryotic specific blocks (50 vs 30 amino acids) and longer than other phylodomain-specific blocks (30 and 20 amino acids for the eukaryotic specific blocks and bacterial specific blocks, respectively) Similarly, in the archaeal-eukaryotic specific proteins, the common blocks are longer than the phylodomain- specific blocks (see Figure 2.6) In particular, the average length of the common blocks is quite remarkable in the large subunit (67 amino acids)

In the universal ribosomal protein set we observe on average approximately 2.5 and 2.1 universal blocks in the large and the small subunit respectively (Figure 2.7) This fact, together with the observation that the average length of each universal block is generally greater than the average length of the phylodomain-specific blocks, supports the conclusion that in each universal ribosomal protein the number of positions carrying a universal signal

is on average greater than the number of positions carrying phylodomain-specific signal This is also true in the case of archaeal-eukaryotic specific proteins, where even though the average number of eukaryotic blocks is slightly greater than the average number of common blocks, the latter are much longer

The universal ribosomal proteins present a similar distribution of blocks type in both

18

Trang 39

Figure 2.2: Schematic representation of the multiple alignments in the universal ribosomal protein

set (UN) of the 30S subunit Orange denotes blocks alignable among all three phylogenetic domains (uu); purple denotes blocks alignable between Archaea and Eukarya only (ae); green denotes blocks alignable only within Bacteria (bb); blue denotes blocks alignable only within Eukarya (ee); dark green denotes blocks alignable between Archaea and Bacteria only (ab) Dotted lines represent sequence regions of varying length that are not alignable across the phylogenetic domains

19

Trang 40

Re Eqresensneuyoacaaillgensamene ged

bt eer Ese

AT GP LH hoi G000nn000090Sn00n60n60900a6000°

Bat 20MN-PSDNBrt98)7:0090y-gỹnonn9t0nnineeonE- ›9 SE Lice

EuK + owe -mmawrơơơơơơzznơa ~—a

Bat mn "nh Lisp

ARC TỒN ems semeenwet

Ỡ- GEE teres sen

Figure 2.3: Schematic representation of the multiple alignments in the universal ribosomal protein

set (UN) of the 50S subunit Orange denotes blocks alignable among all three phylogenetic domains (uu); purple denotes blocks alignable between Archaea and Eukarya only (ae); green denotes blocks alignable only within Bacteria (bb); red denotes blocks alignable only within Archaea (aa); blue denotes blocks alignable only within Eukarya (ee) Dotted lines represent sequence regions of varying length that are not alignable across the phylogenetic domains

20

Tiêu đề	Analysis of ribosomal protein block structure: Functional characterization, evolutionary implications and distant homology search using discrete state models
Tác giả	Paola Favaretto
Người hướng dẫn	Temple F. Smith, Ph.D., Scott Mohr, Ph.D., Hyman Hartman, Ph.D., Sandor Vajda, Ph.D.
Trường học	Boston College, University of Engineering
Chuyên ngành	Biomedical Engineering
Thể loại	Dissertation
Năm xuất bản	2007
Thành phố	Boston

Định dạng
Số trang	207
Dung lượng	7,41 MB