Luận án tiến sĩ: Exploring the relationship between secondary structure and native topology in protein domains

Przytycka and Rose proposed that the sequence of secondary structure elements issufficient to capture a protein's native conformation, and they tested this proposal for alarge collection

Trang 1

Exploring The Relationship Between Secondary Structure And

Native Topology In Protein Domains

by

Haipeng Gong

A dissertation submitted to Johns Hopkins University in conformity with the

requirements for the degree of Doctor of Philosophy

Baltimore, MarylandAugust, 2006

Trang 2

UMI Number: 3240716

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copysubmitted Broken or indistinct print, colored or poor quality illustrations andphotographs, print bleed-through, substandard margins, and improper

alignment can adversely affect reproduction

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted Also, if unauthorizedcopyright material had to be removed, a note will indicate the deletion

®

UMI

ProQuest Information and Learning Company

300 North Zeeb Road

P.O Box 1346Ann Arbor, MI 48106-1346

Trang 3

Since the introduction of Pauling's groundbreaking model’, numerous experiments

have shown that hydrogen-bonded secondary structure is an important factor in proteinfolding Under folding conditions, the linear polypeptide chain can form marginallystable elements of secondary structure on a rapid time scale Such elements, which are indynamic equilibrium with their respective coil states, interact with one another, furtherorganizing and stabilizing the protein We hypothesize that this latter step is rate limiting

in the folding of a protein domain To validate this idea, I tested whether the logarithm ofthe folding rate constant is linearly correlated with a protein's secondary structure content.The observed, large correlation coefficient is consistent with our hypothesis andunderscores the importance of secondary structure elements in organizing the folding

process.

Przytycka and Rose proposed that the sequence of secondary structure elements issufficient to capture a protein's native conformation, and they tested this proposal for alarge collection of representative protein domains by showing that the hierarchic treederived by aligning secondary structure sequences is almost identical to the one derived

by direct three-dimensional structure comparison

To extend this idea, I developed a dynamic programming algorithm to comparedomain structures by aligning mesostate sequences, where a mesostate is a coarse-grained

Trang 4

representation of a backbone torsion angle Comparison of the performance of thisalgorithm against several existing fold recognition algorithms further supports theproposition that the sequence of secondary structure elements determines the protein'sthree-dimensional conformation.

To retrieve the information about native conformation that is implicit in themesostate sequence, I developed a fragment replacement Monte-Carlo algorithm thatuses only this information to generate tertiary structure Specifically, a crude potentialincluding only hydrogen bonding, steric exclusion, and spatial confinement was sufficient

to regenerate native-like backbone topology from the coarse-grained torsion anglerestraints imposed by the native mesostate sequence

This dissertation is divided into three major parts, each of which corresponds to one

of the three topics mentioned above Together, these three inter-related approacheshighlight the central role that secondary structure plays in the protein folding process

Thesis Advisor: George D Rose

Second Reader: Douglas E Barrick

Trang 6

The work for this dissertation could not be completed without the help from manypersons, only a few of whom I would name here However, my acknowledgment should

be given to all of these people who helped me

First, I thank my advisor George Rose, not only because all of the work were doneunder his direction, but also because I learned how to conduct scientific research fromhim From George, I learned how to partition a terribly huge problem into several smallones, which could be solved sequentially On the other hand, his suggestion always kept

me from sinking too much into the details of small projects to forget the physicalmeaning of the original huge problem Additionally, his insight in both science andphilosophy shaped my view of the world: Science should be simple and elegant; oneshould keep skeptic about any existing theory George not only helped me as an advisor,but as a friend Neither my written English could be improved so much, nor could I beaccustomed to American life so quickly without his help

I also thank the faculty members taking part in all the courses I took in JohnsHopkins University, especially Biophysics I and II, which convey important definitions,ideas, and experimental methods in modern biophysics The faculty membersparticipating my annual thesis review also helped me much both in expediting myresearch and in improving my presentation skills

Trang 7

My colleagues in Rose lab are also very helpful to me Rajgopal Srinivasan guided

me through all the detailed implementation of LINUS simulation and helped me a lot ingrasping the programming language Python Teresa Przytycka and Rohit Puppufrequently gave me suggestions in mathematical and physical terms Patrick Fleming notonly taught me so much in molecular simulation, but also participate my project andwrote useful programs for me Additionally, he is usually the first polisher of my writtenworks Nicholas Fitzkee is always the encyclopedia for computer and programminglanguage Nicholas Panasik and Timothy Street usually supplied insightful suggestions ingroup meeting

I thank Ranice Crosby, Lisa Jia, Jerry Levins, and Ken Rutledge for their support inadministration and patience Ranice, who is usually my first consultant, has helped memuch on daylife things even beyond regular administration

I must thank my parents who are so supportive to me both personally and financiallyduring the last six years The money for my transportation and living expenses for myfirst couple of months in America is astronomical to them and could not be saved withoutten years of diligent working This dissertation is dedicated to my father, who is now inthe convalescence of coronary heart disease

Trang 8

Table of Contents

Abstract ii Acknowledgement V Table of Contents Vii Abbreviations xi List of Tables xii List of Figures xiii Chapter 1 Introduction 1

1.1 Protein structure 11.1.1 Hierarchical definition of structures 11.1.2 Parameters to define topology 51.1.3 Classification of protein domain structures 71.1.4 Structure Prediction 1]1.2 Protein folding problem 141.2.1 Thermodynamics of the folding process 151.2.2 Dynamic view of folding 211.3 Unfolded state 291.3.1 Polyproline II helix 29

Trang 9

1.3.2 Native-like residual structure 311.3.3 Invalidity of Flory’s independent pair hypothesis 33

Chapter 2 Local Secondary Structure Content Predicts Folding Rate for

Simple, Two-state Proteins 36

2.1 Abstract 362.2 Introduction 372.3 Methods 392.4 Results 412.5 Discussion 422.6 Acknowledgements 46

Chapter 3 Does Secondary Structure Determine Tertiary Structure in

Proteins? 51

3.1 Abstract 513.2 Introduction 513.3 Materials and Methods 543.3.1 Dynamic Programming 543.3.2 Substitution Matrices 553.3.3 Mesostate and Secondary Structure Assignment 563.3.4 SCOP benchmark test 57

Trang 10

3.4 Results and Discussion 593.4.1 Substitution Matrices 593.4.2 Benchmark Tests 603.4.3 Discussion 613.5 Acknowledgements 63

Chapter 4 Building native protein conformation from highly

approximate backbone torsion angles 71

4.1 Abstract 714.2 Introduction 724.3 Methods 754.3.1 Fragment Library Construction 764.3.2 Fragment Replacement Criteria 764.3.3 Fragment Assembly by Monte Carlo Simulation 784.3.4 Energy Function 784.3.5 Clustering 804.3.6 Test Protein Set 804.4 Results 814.5 Discussion 824.6 Acknowledgements 85

Trang 11

Reference 100 Vita 122

Trang 12

Polyproline II conformationRelative contact orderRadius of gyrationRoot mean square distanceTight B-turns

Backbone phi torsion angleBackbone psi torsion angle

Trang 13

List of Tables

Table 2.1 Predicted folding rates for four recently characterized proteins

Table 3.1 Programs used in SCOP benchmark tests

Table 3.2 Examples of Structural Similarities Identified by Meso_Align

But Not by Other Methods

Table 4.1 Protein test set

Table 4.2 Backbone rmsd of the most stable conformation

Table 4.3 Topological clusters from each ensemble

5065

66979899

Trang 14

List of Figures

Figure 1.1 Mesostate definition in dihedral angle space

Figure 2.2 Correlation between folding rate and DSSP secondary structure

Figure 2.3 Correlation between folding rates and secondary structure prediction

Figure 3.1 Substitution matrices

Figure 3.2 Specificity vs Sensitivity curve of SCOP family benchmark test

Figure 3.3 Specificity vs Sensitivity curve of SCOP superfamily benchmark test

Figure 3.4 Specificity vs Sensitivity curve of SCOP fold benchmark test

Figure 4.1 Flowchart

Figure 4.2 Distribution of Rg

Figure 4.3AB Superposition of simulation and native conformations

Figure 4.3CD Superposition of simulation and native conformations

Figure 4.3EF Superposition of simulation and native conformations

Figure 4.4A Distribution of energy potentials for 2GB1

Figure 4.4B Distribution of energy potentials for 1UBQ

Figure 4.4C Distribution of energy potentials for 1C9OA

Figure 4.4D Distribution of energy potentials for 1IFB

Figure 4.4E Distribution of energy potentials for 1 VII

Figure 4.4F Distribution of energy potentials of 1R69

3548496768

708687888990919293949596

Trang 15

Chapter 1 Introduction

1.1 Protein structure

Proteins, one of the key macromolecules in living organisms are involved inalmost all aspects of biological activity A protein could not perform its normal biologicalfunction without folding into a specific three-dimensional structure, called the nativeconformation, although natively unfolded proteins may be an exception The first proteinstructure to be solved by x-ray crystallography was myoglobin, and its conformationrevealed that globular proteins are not repetitive structures like DNA; rather, they arecompact objects with complex topologies Kendrew solved the structure of myoglobin

almost a half-century ago.” Since that time, many more protein structures have been

solved by x-ray diffraction and NMR, and the number of protein structures in the proteindatabase increases exponentially with each passing year

1.1.1 Hierarchical definition of structures

Protein molecules are linear polymers of amino acid residues, covalently joinedvia peptide bonds This linear sequence of residues is called the primary structure Theseresidues self-organize into specific hydrogen-bonded spatial arrangements calledsecondary structures, which include a-helices, strands of B-sheet, and tight turns Recent

Trang 16

studies have shown that a significant population of polyproline II conformation (PP) isalso found in both folded and unfolded proteins PPII is a sterically forced conformation

for polyproline peptides in aqueous solution.? Consecutive residues in PPII conformation

form PPII helices, which are left-handed, all-trans extended helices with averagebackbone torsion angles of (Ø,ø) = (—75°,+145°) With exactly three residues per turn

of helix, PPII helix is more extended than œ-helix PPII helical conformation is observedfrequently in collagen, where three left-handed PPII helices intertwine to form aright-handed, coiled-coil collagen helix

Despite its name, there is no restriction on the residue composition of a PPI helix;other residues besides proline can adopt this conformation As discussed below,polyalanine has a marked propensity to form PPII helices at room temperature in water,and PPII conformation is observed frequently in the unfolded state of proteins.Additionally, PPII conformation is also observed frequently in folded globular proteins.Sreerama and Woody estimated that about 10% of individual amino acid residues in

proteins are found in PPII conformation.* Owing to the absence of intrachain backbone

hydrogen bonds, PPII helices are important in binding and recognition motifs whereligand:protein recognition may involve hydrogen bonding with unsatisfied backbonehydrogen bond donors and acceptors in PPII helices Studies have shown that PPII

conformation is the common binding motif in both WW domains and SH3 domains.”

Both sequential and non-sequential regions of the protein can interact to form

Trang 17

compact, independently stable structural and/or functional domains Many isolateddomains can fold into their unique three-dimensional conformation independently From

an evolutionary point of view, domain swapping is a possible path for generating newproteins Thus, an early step in protein structure analysis is usually domain

decomposition because direct comparison of multi-domain proteins might lead to

spurious results in structure clustering and classification.® Additionally, domain

decomposition is necessary when predicting protein structure by “threading” because

homologs are retained at the domain level, not the protein level.’

Structural domains are defined as compact structures in which there is a tendencyfor hydrophobic residues to be buried in the interior and hydrophilic residues to be

exposed to polar solvent at the surface.® The most authoritative database of existing

protein domains is SCOP, which codifies and classifies domains by several criteria, some

of them subjective.” In view of the exponential increase of known protein structures, it is

desirable to develop domain decomposition algorithms that can be run automatically and

do not rely on human intervention

Several algorithms, such as PUU, DETECTIVE, and DOMAK, have beendeveloped for domain decomposition based on the operational premise that residueswithin the same domain will experience more intra-domain than inter-domain contacts.However, according to Jones et al., none of these automatic methods have an accuracythat exceeds 80% Although 100% of the proteins tested could be successfully

Trang 18

decomposed into domains when all three of these methods are in agreement, a successful

consensus prediction was possible for only 52% of the dataset.'°

This result was later adopted for use in the CATH database, another codifiedcollection of protein domains In this case, proteins are decomposed into domainsautomatically when all of the preceding three methods are in agreement Otherwise

human judgment is employed.”

Guo et al developed a new algorithm using graph theory and neural networks in

2003, which predicts domains in excess of the 80% threshold as judged by comparison

with manually curated SCOP domains.’ In another approach, Kundo et al employed a

Gaussian Network Model to recognize domains within proteins based on the rationale

that residues within the same domain will move in concert.'? Given that none of the

automatic methods can parse proteins into domains with 100% accuracy, manualdecomposition remains indispensable, especially in the construction of domain databases

In 1979, Rose discovered that proteins have a hierarchic architecture'*, with

structural domains emerging naturally from such an organization Regardless of whetherthey are regarded as structural, functional, or evolutionary units, domains can behierarchically decomposed into smaller units of supersecondary and secondary structure

In this dissertation we adopt this reductive approach, focusing on the structure andtopology of domains and their constituent parts rather than on whole proteins

Trang 19

1.1.2 Parameters to define topology

Backbone topology is frequently invoked as an important criterion in theclassification of domains However, it is relatively difficult to quantify backbonetopology in a strictly mathematical sense because diverse architectures are topologicallyequivalent Consequently, several characteristic parameters have been used to describe aprotein's backbone topology, including the order parameter, contact order, Ca-distancematrices, backbone torsion angles, and the mesostate sequence We will discuss these,each in turn

The order parameter, Q, is usually defined as the degree of similarity between agiven conformation and the corresponding native structure Q is often used as aconvenient reaction coordinate when describing the progress of the folding reaction.Usually Q is normalized such that Q=0 represents the completely unfolded state and Q=1represents the native conformation When calculating Q, similarity can be assessed ineither Cartesian space or torsion angle space

Since the discovery of a linear correlation between the relative contact order(equation 1.1) and the folding rate constant for two-state proteins, contact order has beenadopted as a topological signature As further described below in Folding Models, Plaxco

et al hypothesized that the contact order is actually a proxy for topology, which is highly

correlated with the folding rate constant.

Trang 20

The relative contact order, RCO, is defined as:

1

where AS, is the number of residues in the linear chain between the spatially contactingresidue pair i and j, N is the overall number of contacting residue pairs, and L is thelength of the protein In other words, the relative contact order is the average residueseparation between pairs of residues in physical contact, normalized by protein size

The Ca-distance matrix is defined as a symmetric, square matrix of dimension

N x N, where N is the number of residues in proteins The matrix element in the ith rowand jth column is the distance between the a-carbons of the ith and jth residues Thismatrix has been widely applied in structural biology to represent backbone topology as,for example, by the structural comparison program DALI, in the FSSP domain

database.'° The distance matrix is also frequently plotted as a distance map to facilitate

topological comparison

Backbone torsion angles, @ and w, can also be used to capture protein topology

To simplify the structural representation of a protein using backbone torsion angles, wediscretize @,w-space into 60x60 degree bins, called mesostates Each of the 36 possiblemesostates is represented by a letter of the alphabet (See figure 1.1) In this way, overallprotein topology can be represented simply but coarsely by a one-dimensional mesostatesequence, where each character represents the approximate position of the corresponding

Trang 21

residue in @, w-space The utility of this approximation will be discussed later in thisdissertation.

1.1.3 Classification of protein domain structures

The classification of protein domains is usually based on their structures Given thathomologous proteins have similar structures, domain classification can be a key stepwhen recognizing evolutionary relationships Additionally, domain classification is anindispensable step in structural prediction methods like homology modeling and

threading.” '” !Ở The essential step in structural classification is structural comparison,

which provides a distance between any pair of domains in domain space, and from which

a hierarchic tree can be drawn

Given the fact that a protein's three-dimensional structure is a consequence of itsprimary sequence, a set of structural comparison methods based on sequence alignment

has been developed Examples of such methods include BLAST and FASTA”, which use dynamic programming and the Smith-Waterman algorithm?’ Domains with a

sufficiently high degree of aligned sequence similarity are likely homologs, and thus highsequence similarity is inversely proportional to structural distance Information of thissort can be obtained from sequence alignment, with similarity between residues

quantified by similarity matrices such as BLOSUM and PAM”?

The success of these pairwise sequence alignment programs in detecting

Trang 22

evolutionary relationships diminishes markedly when sequence identity falls below ~30%

where proteins with similar backbone topologies can have dissimilar sequences.”° ? In

response, newer sequence-to-profile algorithms and profile-to-profile algorithms havebeen developed that boost detection sensitivity for distantly related proteins having

similar structures 2? Although newer programs that incorporate multiple sequence

alignment instead of pairwise alignment do improve detection sensitivity substantially,structure-based methods still outperform sequence based methods in both sensitivity andspecificity

To overcome some of the shortcomings of sequence alignment, many investigatorshave developed domain classification approaches based on optimal three-dimensionalalignment Domains in SCOP, the most authoritative domain database, are classifiedhierarchically into CLASS, FOLD, SUPERFAMILY, and FAMILY, based on structuraland evolutionary information as mediated by the subjective judgment of the humanclassifier Proteins are first clustered into families based on their sequence and thengrouped into superfamilies based on known evolutionary relationships Superfamilies orfamilies are then further grouped into folds by backbone topology, resulting finally infive classes: (a) all alpha proteins, (b) all beta proteins, (c) proteins with interspersedalpha-helices and beta-strands, (d) proteins with segregated alpha-helices andbeta-strands, and (e) proteins composed of domains in different folds or domains with noknown homologues The SCOP database is generally accepted as the gold standard

Trang 23

because classification is performed subjectively by experts in this field.” ?8 ?2

CATH, another well-known domain database!! 30 classifies domains

hierarchically into CLASS, ARCHITECTURE, TOPOLOGY, and HOMOLOGOUSSUPERFAMILY, analogous to SCOP Although classification is asserted to be

"semi-automatic", the key levels, i.e the structure-determined levels (architecture andtopology), are assessed manually by human experts

With the exponential increase of deposited structures in the Protein database(PDB), fully automatic classification methods become increasingly necessary The FSSPdatabase, based solely on domain structures, uses the DALI algorithm to recognize

structural neighbours.°' DALI measures the structural similarity between two proteins by

matching their Ca-distance matrices (described in Parameters to define topology

above).'° In contrast, the ENTREZ database identifies protein neighbors using VAST, an

alignment algorithm that compares superimposed arrays of vectors between the

secondary structure elements within each respective domain.*” 33 34

The structural comparison methods described above are based on either sequence

or direct three-dimensional structure alignment Recently, Honig and coworkersintroduced a profile-to-profile alignment program that improves the performance forremote homolog detection by combining both primary and secondary structure

information.3

Earlier, Przytycka and Rose had already proposed that secondary structure alone

Trang 24

may be sufficient to recognize tertiary structure In their study, 183 proteins with less

than 30% aligned sequence identity were represented as linear strings of secondarystructure elements, including turns and loops Using a simple scoring matrix,conventional pairwise sequence comparisons between these strings were performed andused to construct a Przytycka-tree (P-tree), in which the distance between any two nodes

is proportional to the difference in score between their aligned secondary structure strings.The P-tree is generated completely automatically, and it reflects the global secondarystructure relationships among the proteins used to construct it: the closer the nodes, thegreater the similarity of secondary structure among their corresponding proteins.Surprisingly, the straightforward P-tree was found to be largely in agreement with theSCOP tree, although the latter is a complex construct based on structure, evolutionaryknowledge, and human judgment This result lends support to the hypothesis thatsuccessful fold recognition can be derived solely from knowledge of secondary structure

In my graduate work, I sought to extend this idea by quantifying the degree towhich approximate backbone conformation can determine the protein fold (chapter 3) Adynamic programming algorithm was devised to compare domain structures by aligningtheir approximate backbone torsion angles, represented as mesostate sequences(described above)

The specific hypothesis being tested is that domains with similar mesostatesequences have similar structures The converse proposition is certainly true — similar

Trang 25

structures always have similar mesostate sequences Consequently, validation of ourhypothesis means that structure recognition could be accomplished successfully bymesostate alignment.

Having shown that domain structures can indeed be recognized successfully fromtheir approximate backbone conformations, this work was extended by designing analgorithm to actually rebuild the native structures from its mesostate sequence (chapter 4).The process was implemented by fragment assembly Monte-Carlo simulation startingfrom an extended polyalanine chain, with mesostate-constrained backbone torsion angles

1.1.4 Structure Prediction

All information needed to encode a protein's structure is stored in its sequence,°’ but

the question of how to extract and utilize this information for successful predictionremains unanswered The hierarchic architecture of proteins suggests that the initialpredictive step should be focused on secondary structure Historically, bioinformaticsresearch started with secondary structure prediction The earliest research in this areaconcentrated on mapping a protein's sequence into a three-state secondary structuremodel, comprising a-helices, B-strands, and coils (i.e all other structures) In thisapproach, the operating assumption was that different residues would exhibit differing

propensities to populate distinct local regions in ,y-space Both the Chou-Fasman”Š and GOR” algorithms are of this type The most recent version of the GOR algorithm, GOR

Trang 26

IV, has a prediction accuracy of about 65% and remains one of most popular secondary

structure prediction programs in current use?

A new generation of secondary structure prediction methods, however, has beendeveloped to take advantage of machine learning strategies, such as neural networks,hidden Markov models, and support vector machines Such methods capture localsequence patterns by multiple sequence alignment and extend prediction categories

beyond the conventional three-states.*” For example, PHDsec, one of most successful

among this new generation of methods, uses a neural network together with evolutionaryinformation from related sequences to predict secondary structures with greater than 70%

accuracy 4!”

The current accuracy of secondary structure predictions has reached a plateaunear 80% A systematic analysis of several popular secondary structure predictionalgorithms, including PHDsec, PSIPRED, Jnet, and PREDATOR using a large dataset of

2777 non-homologous proteins, concluded that prediction accuracy correlates negativelywith residue contact order and that inclusion of long-range interactions would be needed

for any further improvement."°

The accuracy of tertiary structure prediction still lags that of secondary structureprediction The most accurate tertiary structure prediction method is comparativemodeling, which utilizes known homologs to the unknown target protein to build asuitable structural model under the assumption that homologous proteins share the same

Trang 27

backbone topology.'” !* Sidechain torsions are assigned once the backbone conformation

is established, based on other work showing that sidechain and backbone conformationare tightly coupled This method works well in those cases where a target proteinhomolog exists in the protein databank and can be identified successfully

In cases where a homolog is not available, the only recourse is to resort to abinitio prediction, in which tertiary structure is predicted solely from the sequence Twowell known ab initio algorithms, LINUS and ROSETTA, will be discussed

LINUS simulations attempt to capture the native conformation from first

principles using the Metropolis Monte-Carlo algorithm to search conformational space

(torsion angle space) The energy scoring function includes only hard-sphere sterics,hydrogen bonds, contact energy, and solvation energy

ROSETTA employs a fragment assembly method, again using Monte-Carlo

simulation “> 4° 47 In each step of the simulation, a three- or nine-residue database

fragment is substituted in the target, a strategy that avoids most local collisions by

adopting viable local fragments directly from existing PDB structures.*® * Recently,

Baker and coworkers built several models with surprisingly high accuracy (sometimes

less than 1A root-mean-square difference from the native conformation) by employing several homologous sequences.

Trang 28

1.2 Protein folding problem

Anfinsen and coworkers have shown that the native conformation of a protein is

uniquely determined solely by information encoded within its residue sequenee.””

Although Anfinsen's results show that proteins adopt their functional three-dimensionalstructure spontaneously, the mechanism by which this process occurs has remainedelusive This famous folding problem has been formulated as the reversible transition

between folded and unfolded states: unfolded folded , and it can be further

decomposed into two views: (1) the thermodynamic and macroscopic view or (2) thedynamic and microscopic view Lacking experimental methods that can track the foldingpathway of single macromolecules, the focus of most current experiments is onthermodynamic and kinetic studies of protein ensembles Protein folding is a highlycooperative process, as shown by numerous thermodynamical experiments on the foldingtransition obtained under different chemical and physical conditions Kinetic experiments

on folding rates following perturbations such as a temperature jump have been performed

on many proteins Macroscopic variables, such as the free energy, entropy, and enthalpychanges, and even reaction rate constants, can be obtained from such experiments Incontrast, microscopic data cannot be extracted from existing experiments readily, andmust usually be obtained from computer simulations and/or theory

Trang 29

1.2.1 Thermodynamics of the folding process

What stabilizes the native conformation?

The folded state, regarded as the ensemble of near-native conformations, hassome degree of flexibility, without which proteins could not perform their biologicalfunctions Although folded conformations have lower energies than unfolded ones,globular proteins are only marginally stable at room temperature No covalent bonds aremade or broken in the folding process except for disulfide bonds, and the majorcontributions stabilizing the native conformation are thought to be from nonbondingpotentials, including: (1) electrostatics (2) van der Waals interactions (3) hydrogenbonding and (4) hydrophobic interactions Despite the gain in solvent entropy, the hugeloss in configurational entropy that accompanies protein folding, however, counteractsthese favorable contributions, resulting in the marginal stability of the nativeconformation Against this backdrop of marginal stability, each of these favorable energyterms is important because even small differences could change the direction of the

folding transition.® These four energy terms are now discussed in greater detail.

Charged particles give rise to electrostatic interactions, which are long-range

According to Coulomb's law, the electrostatic potential is inversely proportional to both

charge separation and the dielectric constant Despite their long-range nature, the highdielectric constant of water (~80) screens electrostatic interactions substantially The

Trang 30

protein interior, however, is much more hydrophobic than bulk water, with a lowdielectric constant (~12-20) Consequently, ionizable groups inside the protein areenergetically significant In most globular proteins, the charged groups are localized onthe surface as anticipated, given the large energy cost of burying uncompensated charges

in the interior At physiological pH, ionizable groups are not uniformly positive ornegative Rather, the protein surface is "bristling with both positive and negative charges"that facilitate solubility and stabilize the native conformation At extremes of pH, belowthe pK of acidic groups or above the pK of basic groups, the resultant repulsion betweenions can denature proteins Although ion pairing may contribute ~1-3kcal/mol to protein

stability, it is clear that ion pairing is not a dominant force in protein stability.’ The folded

conformation shows little dependence on pH or salt, both of which could influence theelectrostatic potential substantially Additionally, structural studies of protein homologs

show that ion pairs are not well conserved in evolution."

The van der Waals force is a weak, short-range interaction occurring between allatoms, both polar and non-polar It can be decomposed into two components, anattractive part arising from the interaction between transient dipoles, and a repulsive partthat arises when the electron clouds of two non-bonded atoms overlap The potential isfrequently approximated by the Lennard-Jones function

A B

ưng _ re (1.2)

Trang 31

where 7„is the distance between two atoms, i and j? Van der Waals forces are expected

to stabilize the native conformation rather than unfolded state owing to short-range

interactions within the close-packed protein core.”°

A hydrogen bond is a short-range interaction between a polarized hydrogen bonddonor, D-H, where D is the hydrogen-donating atom, and the polarized nonbondingorbitals of an acceptor atom, A Although hydrogen bonds have some degree of covalentbond character, they are viewed primarily as electrostatic dipole-dipole interactions The

optimum distance separating the donor and the acceptor ranges from 0.26 to 0.30 nmTM,

with an almost co-linear geometric arrangement of the three participating atoms: donor

(D), hydrogen (H), and acceptor (A).Š It was Pauling who first proposed that hydrogen

bonds play a significant role in macromolecular folding and stability In seminal articlespublished in 1951, Pauling et al hypothesized that hydrogen bonds provide the driving

force to form a-helices and B-sheets”” là, and estimated hydrogen bond strength to range

between 2-10 kcal/mol.*° Soon after, Schellman proposed that an intrachain hydrogen

bond is energetically favorable relative to a hydrogen bond with water by ~1.5 kcal/mol,

based on the measured formation of urea dimers in solution.” ** Experiments on

helix-coil transitions and B-sheet formation provided additional support for Pauling's

hypothesis that hydrogen bonds are the principle driving force in such processes.®

Intramolecular hydrogen bond donors and acceptors are abundant in foldedproteins On average, there are 1.1 intramolecular hydrogen bonds per residue, which

Trang 32

compensate for lost intra- and intermolecular hydrogen bonds in the unfolded protein."

Accordingly, hydrogen bonds were considered to be a key factor in stabilizing the nativeconformation However, Kauzmann questioned this conclusion, arguing that the energy

of intrachain hydrogen bonds in the folded state would not differ significantly from theenergy of corresponding peptide:water hydrogen bonds in the unfolded state Reasoningfrom the strengths of hydrogen bond interactions in model compounds, he hypothesizedthat the hydrophobic effect, rather than hydrogen bonds and van der Waals interactions,

would be the principle driving force in protein stability.°° Kauzmann’s proposal was

bolstered by several later studies For example, Klotz and Franzen measured hydrogenbond formation in N-methylacetamide (NMA) and found that in water, the enthalpy ofNMA dimerization is approximately zero, which implies that hydrogen bonds are not

stabilizing after including the entropic cost of intrachain hydrogen bond formation.®

Similar experiments on another small molecule, e-caprolactam, also reached this same

conclusion.” These experiments persuaded the field that hydrophobic interactions, rather

than hydrogen bonds, are the dominant source of protein stabilization, although thereremained some evidence to the contrary from studies of cyclic dipeptides and

diketopiperzines.® S

In more recent work, the pendulum seems to be swinging back toward Pauling'sdirection Pace and co-workers performed numerous site-directed mutagenesis studies,and concluded that the enthalpic stabilization provided by hydrogen bonds is about 1.6

Trang 33

kcal/mol, larger than hydrophobic effect After inclusion of an entropy correction term, ahydrogen bond still stabilizes the native conformation by 0.6 kcal/mol; summed over the

entire protein, this is comparable to the magnitude of hydrophobic effect.”” Makhatadze

and Privalov reached a similar conclusion from different proteins and model compound

data Analyses of protein x-ray structures confirm the importance of hydrogen bonds by

showing that most buried groups in globular proteins are hydrogen-bonded Fleming andRose recently proposed that all peptide hydrogen bond donors or acceptors should besatisfied either by intrachain partners or by water molecules Their conclusion wassupported by careful examination of high resolution crystal structures, where it was found

that apparent exceptions can be rationalized.

Nowadays, hydrogen bonds are accepted as key factors that influence proteinfolding Although the question of whether they are the dominant contributor to foldingstability remains controversial, it is widely accepted that hydrogen bonds are important inthe specificity of protein folding Along these lines, Depristo et al implemented anexplicit hydrogen bond potential in their x-ray refinement program and successfully

generated an ensemble of conformations compatible with experimental diffraction data.

And, Baker and colleagues successfully built high resolution protein structures using de

novo methods by emphasizing hydrogen bonds and van der Waals interactions.°°

Kauzmann invented the words "hydrophobic bond" in connection with his famous

1,

"oil drop" model.” A hydrophobic bond is not a direct physical interaction between

Trang 34

atoms; rather, it describes the tendency of non-polar residues to reduce theirsolvent-accessible surface by clustering together, and in this sense it is not a conventionalchemical bond.

Kauzmann’s proposition that the hydrophobic interaction is the principle drivingforce for protein folding was later supported by at least four sets of experimentalobservations (1) It was observed that non-polar solvents can denature proteins (2)Kauzman's proposal slightly predated successful protein x-ray crystallography Soon after,

as x-ray crystal structures became available, it could be seen that apolar residues arepreferentially buried in the molecular interior and, conversely, non-hydrogen bondedpolar residues are preferentially solvent-exposed (3) Calorimetry experiments revealedthe similarity between the temperature dependence of the free energy change uponprotein folding and that of the transfer free energy of non-polar molecules from waterinto non-polar solvent (4) Proteins were also found to unfold at low temperature, aprocess called cold denaturation

In the initial model proposed by Kauzmann, the hydrophobic effect was measured

by the transfer free energy of a non-polar compound from organic liquid into water.Ben-Naim and coworkers modified this simple model by introducing backbone atoms,arguing that non-polar sidechains connected to backbones are more appropriate modelswhen studying the protein hydrophobic effect because the hydrophilic component also

plays an important role in the free energy of transfer.°’ Thus, solvation energy, rather

Trang 35

than the traditional hydrophobic interaction, should be used to characterize this effect.Although Pace and coworkers have argued persuasively that the hydrophobic interactionand hydrogen bonds make a comparable contribution to protein stability, it is stillgenerally accepted that hydrophobic interactions are the principle driving force in proteinstability Of course, this view may change in time.

Counteracting these four weak forces that stabilize the native conformation,entropic terms destabilize the native conformation The entropy change during foldingcan be partitioned into two parts, configurational entropy loss and solvent entropy gain

Solvent entropy gain is included in the hydrophobic interaction The configurational

entropy loss was assumed to be large and positive because the denatured ensemble hasbeen regarded as a featureless statistical coil Recent studies, however, indicate thatunfolded state is not actually a featureless statistical coil with an astronomical number ofavailable states; instead, it retains some degree of order Thus, the entropy loss on foldingwill be smaller than previously expected This topic will be discussed later in theIntroduction (Unfolded State)

1.2.2 Dynamic view of folding

Levinthal paradox

Perhaps the most famous concept in protein folding theory is the "Levinthal

paradox".® In this back-of-the-envelope calculation, each residue in a protein molecule

Trang 36

can adopt several different conformations, consistent with the two degrees of backbonefreedom, ¿ and w, for each amino acid residue For a polypeptide chain of 100 residues,the total number of conceivable conformations is then an astronomical number Even ifeach residue were limited to only two states (the alpha and beta regions on a

Ramachandran plot), the total number of possible conformations is still 2° «10%.

Taking the rate of single-bond rotations into account, the native state could not beattained in a biologically relevant time frame by random search In fact, the typicalfolding time for a protein molecule is in the range of microseconds to seconds ForLevinthal, this was no paradox at all but rather a demonstration that protein folding is not

a random search Instead, there must be a specific pathway or mechanism that guides the

protein from the unfolded ensemble to its native conformation.”

Energy Landscape Theory

To solve the Levinthal paradox, Wolynes and colleagues considered folding from

the vantage point of the protein’s energy surface 75” Using a minimalist model, they

studied the behavior of random heteropolymers using statistical mechanics and computersimulations and found that the energy landscape looks like a smooth funnel Later studiesextended this work to proteins, in which the energy landscape is also funnel-shaped, butwith a more rugged surface In the protein funnel, a small set of conformations at thebottom corresponds to native conformers

Seen from this vantage point, the resolution of the Levinthal paradox is to include

Trang 37

those cooperative, energetically favorable interactions that accumulate as proteins foldand drag the molecules toward native conformations As depicted in the Levinthalparadox, the energy landscape is analogous to a flat golf course, with a single holerepresenting the native conformation The probability that a molecule would find such ahole via random search is negligibly small In sharp contrast, protein energy landscapetheory predicts a funnel-shaped energy surface, in which an unfolded molecule is dragged

to its native conformation The protein energy surface is rugged because polypeptidechains can sample multiple conformations during folding, some of which are non-native.Indeed, some low troughs on the rough surface may correspond to major off-pathwaymisfolders found in experiments The degree of roughness also affects the folding rate;the funnel landscape for smail, topologically simple proteins is comparatively smooth,

with correspondingly faster folding rates.” Although this theory rationalizes the

Levinthal paradox, it has not been validated by experiment and does not address thefolding pathway in microscopic detail

The next section discusses some other theoretical folding models that are moreclosely tied to experimental work

Folding Models

In the Karplus and Weaver diffusion-collision model, proposed in 1976”* TM, the

initial folding phase involves the rapid equilibrium formation and decomposition offluctuating quasiparticles, called microdomains These can be hydrophobic clusters or

Trang 38

even elements of secondary structure These microdomains diffuse and collide,sometimes productively, leading to coalescence into larger intermediates The nativeconformation is then produced by iterative coalescence of such intermediates TheKarplus and Weaver model does not exclude parallel folding pathways, and it assumesthat folding is a solvent-dependent process, in accord with later experiments.

The Kim and Baldwin framework model”, proposed six years after the diffusion-collision model, emphasized hierarchic folding’ In the framework model,

simple, local structures form first and then assemble into more complex structures bylonger-range interactions The model derives its name from the idea that elements ofhydrogen bonded secondary structure form early and act as a framework for subsequenttertiary structure formation The framework model and the diffusion-collision model arenot mutually exclusive If the microdomains of the diffusion-collision model correspond

to hydrogen-bonded secondary structures, then the two models are essentially identical

Hierarchic folding’ is implicit in the diffusion-collision model, and appropriately so, as shown later by several experiments and computer simulations.” ” Together, the

combination of the diffusion-collision model and the framework model provide a way out

of the Levinthal paradox by uncoupling the formation of secondary structure from that oftertiary structure and restricting the search process to the diffusion and collision ofsecondary structure elements

In 1985, Dill proposed a hydrophobic collapse model” based on the idea that

Trang 39

collapse precedes formation of specific structures In refolding experiments, solvent

conditions change from "good" solvent, which favors unfolding, to "poor" solvent, whichfavors folding In poor solvent, proteins collapse around their hydrophobic sidechains tominimize solvent access to hydrophobic residues Collapse to a smaller volume reducesthe number of accessible conformations, thereby partially resolving Levinthal paradox

The nucleation condensation model, proposed by Fersht in 1997, was motivated

by his @-value analysis of several small proteins e.g., CI-2.” The @-value is defined as

the degree to which a residue of interest is native-like at the protein's transition state.Experimentally, the @-value is obtained by measuring how much a mutation affects thetransition state relative to the extent to which it affects the native state Based on @-valueanalysis, secondary and tertiary structure form in parallel in CI-2, and several long rangeinteractions are important in forming the transition state topology Innucleation-condensation, the most stable residues in the transition state are considered to

be a nucleus around which condensation occurs, followed by rapid folding to the nativeconformation The rate-limiting step in this model is the formation of nuclei, so thesearch process is not a random one The model is assumed to apply only to the folding ofsmall, two-state proteins For larger proteins, it is assumed that smaller modules formfirst by nucleation condensation and then dock hierarchically

Two years later, Debe et al proposed the topomer-sampling model.®° A topomer

is defined as a cluster of conformations with the same degree of compaction, all of which

Trang 40

are interchangeable under local backbone coordinate transformations that maintain thebackbone covalent bonding To test this idea, they exhaustively enumerated all topomersfor a polypeptide chain of length N residues and found that a 100-residue protein coulddiffusively search the entire set and find the native topomer within a biologically realistictime frame (100ms) Once found, the molecule is postulated to fold to its nativeconformation by rapid condensation Plaxco and coworkers renamed this model thetopomer search model and used it to successfully predict the folding rate of two-stateproteins, operating on the assumption that relative folding rates are proportional to theprobability of achieving the native topomer which, in turn, depends on the topological

complexity of the protein in question.'> *!

We proposed our own model for two-state protein folding, a combination of theframework and topomer search models that proceeds in three distinct stages In the first

stage, marginally stable elements of secondary structure are formed *” * Collapse occurs

in the second stage, which involves a diffusive search for the native topomer The rateconstant of this second stage is postulated to correlate negatively with the length ofinterconnecting loops between secondary structure elements formed in the first stage.Finally, after reaching the native topomer, there is a third stage in which condensation tothe native fold occurs For two-state proteins lacking cis-trans isomerization of prolines,the rate-limiting step would occur in the second stage Consequently, the measuredfolding rate should correlate negatively with the length of linking loops, the larger the

Tiêu đề	Exploring The Relationship Between Secondary Structure And Native Topology In Protein Domains
Tác giả	Haipeng Gong
Người hướng dẫn	George D. Rose, Douglas E. Barrick
Trường học	Johns Hopkins University
Chuyên ngành	Biochemistry
Thể loại	Dissertation
Năm xuất bản	2006
Thành phố	Baltimore

Định dạng
Số trang	137
Dung lượng	10,59 MB