Przytycka and Rose proposed that the sequence of secondary structure elements issufficient to capture a protein's native conformation, and they tested this proposal for alarge collection
Trang 1Exploring The Relationship Between Secondary Structure And
Native Topology In Protein Domains
by
Haipeng Gong
A dissertation submitted to Johns Hopkins University in conformity with the
requirements for the degree of Doctor of Philosophy
Baltimore, MarylandAugust, 2006
Trang 2UMI Number: 3240716
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copysubmitted Broken or indistinct print, colored or poor quality illustrations andphotographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted Also, if unauthorizedcopyright material had to be removed, a note will indicate the deletion
®
UMI
UMI Microform 3240716Copyright 2007 by ProQuest Information and Learning Company.All rights reserved This microform edition is protected againstunauthorized copying under Title 17, United States Code
ProQuest Information and Learning Company
300 North Zeeb Road
P.O Box 1346Ann Arbor, MI 48106-1346
Trang 3Since the introduction of Pauling's groundbreaking model’, numerous experiments
have shown that hydrogen-bonded secondary structure is an important factor in proteinfolding Under folding conditions, the linear polypeptide chain can form marginallystable elements of secondary structure on a rapid time scale Such elements, which are indynamic equilibrium with their respective coil states, interact with one another, furtherorganizing and stabilizing the protein We hypothesize that this latter step is rate limiting
in the folding of a protein domain To validate this idea, I tested whether the logarithm ofthe folding rate constant is linearly correlated with a protein's secondary structure content.The observed, large correlation coefficient is consistent with our hypothesis andunderscores the importance of secondary structure elements in organizing the folding
process.
Przytycka and Rose proposed that the sequence of secondary structure elements issufficient to capture a protein's native conformation, and they tested this proposal for alarge collection of representative protein domains by showing that the hierarchic treederived by aligning secondary structure sequences is almost identical to the one derived
by direct three-dimensional structure comparison
To extend this idea, I developed a dynamic programming algorithm to comparedomain structures by aligning mesostate sequences, where a mesostate is a coarse-grained
Trang 4representation of a backbone torsion angle Comparison of the performance of thisalgorithm against several existing fold recognition algorithms further supports theproposition that the sequence of secondary structure elements determines the protein'sthree-dimensional conformation.
To retrieve the information about native conformation that is implicit in themesostate sequence, I developed a fragment replacement Monte-Carlo algorithm thatuses only this information to generate tertiary structure Specifically, a crude potentialincluding only hydrogen bonding, steric exclusion, and spatial confinement was sufficient
to regenerate native-like backbone topology from the coarse-grained torsion anglerestraints imposed by the native mesostate sequence
This dissertation is divided into three major parts, each of which corresponds to one
of the three topics mentioned above Together, these three inter-related approacheshighlight the central role that secondary structure plays in the protein folding process
Thesis Advisor: George D Rose
Second Reader: Douglas E Barrick
Trang 6The work for this dissertation could not be completed without the help from manypersons, only a few of whom I would name here However, my acknowledgment should
be given to all of these people who helped me
First, I thank my advisor George Rose, not only because all of the work were doneunder his direction, but also because I learned how to conduct scientific research fromhim From George, I learned how to partition a terribly huge problem into several smallones, which could be solved sequentially On the other hand, his suggestion always kept
me from sinking too much into the details of small projects to forget the physicalmeaning of the original huge problem Additionally, his insight in both science andphilosophy shaped my view of the world: Science should be simple and elegant; oneshould keep skeptic about any existing theory George not only helped me as an advisor,but as a friend Neither my written English could be improved so much, nor could I beaccustomed to American life so quickly without his help
I also thank the faculty members taking part in all the courses I took in JohnsHopkins University, especially Biophysics I and II, which convey important definitions,ideas, and experimental methods in modern biophysics The faculty membersparticipating my annual thesis review also helped me much both in expediting myresearch and in improving my presentation skills
Trang 7My colleagues in Rose lab are also very helpful to me Rajgopal Srinivasan guided
me through all the detailed implementation of LINUS simulation and helped me a lot ingrasping the programming language Python Teresa Przytycka and Rohit Puppufrequently gave me suggestions in mathematical and physical terms Patrick Fleming notonly taught me so much in molecular simulation, but also participate my project andwrote useful programs for me Additionally, he is usually the first polisher of my writtenworks Nicholas Fitzkee is always the encyclopedia for computer and programminglanguage Nicholas Panasik and Timothy Street usually supplied insightful suggestions ingroup meeting
I thank Ranice Crosby, Lisa Jia, Jerry Levins, and Ken Rutledge for their support inadministration and patience Ranice, who is usually my first consultant, has helped memuch on daylife things even beyond regular administration
I must thank my parents who are so supportive to me both personally and financiallyduring the last six years The money for my transportation and living expenses for myfirst couple of months in America is astronomical to them and could not be saved withoutten years of diligent working This dissertation is dedicated to my father, who is now inthe convalescence of coronary heart disease
Trang 8Table of Contents
Abstract ii Acknowledgement V Table of Contents Vii Abbreviations xi List of Tables xii List of Figures xiii Chapter 1 Introduction 1
1.1 Protein structure 11.1.1 Hierarchical definition of structures 11.1.2 Parameters to define topology 51.1.3 Classification of protein domain structures 71.1.4 Structure Prediction 1]1.2 Protein folding problem 141.2.1 Thermodynamics of the folding process 151.2.2 Dynamic view of folding 211.3 Unfolded state 291.3.1 Polyproline II helix 29
Trang 91.3.2 Native-like residual structure 311.3.3 Invalidity of Flory’s independent pair hypothesis 33
Chapter 2 Local Secondary Structure Content Predicts Folding Rate for
Simple, Two-state Proteins 36
2.1 Abstract 362.2 Introduction 372.3 Methods 392.4 Results 412.5 Discussion 422.6 Acknowledgements 46
Chapter 3 Does Secondary Structure Determine Tertiary Structure in
Proteins? 51
3.1 Abstract 513.2 Introduction 513.3 Materials and Methods 543.3.1 Dynamic Programming 543.3.2 Substitution Matrices 553.3.3 Mesostate and Secondary Structure Assignment 563.3.4 SCOP benchmark test 57
Trang 103.4 Results and Discussion 593.4.1 Substitution Matrices 593.4.2 Benchmark Tests 603.4.3 Discussion 613.5 Acknowledgements 63
Chapter 4 Building native protein conformation from highly
approximate backbone torsion angles 71
4.1 Abstract 714.2 Introduction 724.3 Methods 754.3.1 Fragment Library Construction 764.3.2 Fragment Replacement Criteria 764.3.3 Fragment Assembly by Monte Carlo Simulation 784.3.4 Energy Function 784.3.5 Clustering 804.3.6 Test Protein Set 804.4 Results 814.5 Discussion 824.6 Acknowledgements 85
Trang 11Reference 100 Vita 122
Trang 12Polyproline II conformationRelative contact orderRadius of gyrationRoot mean square distanceTight B-turns
Backbone phi torsion angleBackbone psi torsion angle
Trang 13List of Tables
Table 2.1 Predicted folding rates for four recently characterized proteins
Table 3.1 Programs used in SCOP benchmark tests
Table 3.2 Examples of Structural Similarities Identified by Meso_Align
But Not by Other Methods
Table 4.1 Protein test set
Table 4.2 Backbone rmsd of the most stable conformation
Table 4.3 Topological clusters from each ensemble
5065
66979899
Trang 14List of Figures
Figure 1.1 Mesostate definition in dihedral angle space
Figure 2.2 Correlation between folding rate and DSSP secondary structure
Figure 2.3 Correlation between folding rates and secondary structure prediction
Figure 3.1 Substitution matrices
Figure 3.2 Specificity vs Sensitivity curve of SCOP family benchmark test
Figure 3.3 Specificity vs Sensitivity curve of SCOP superfamily benchmark test
Figure 3.4 Specificity vs Sensitivity curve of SCOP fold benchmark test
Figure 4.1 Flowchart
Figure 4.2 Distribution of Rg
Figure 4.3AB Superposition of simulation and native conformations
Figure 4.3CD Superposition of simulation and native conformations
Figure 4.3EF Superposition of simulation and native conformations
Figure 4.4A Distribution of energy potentials for 2GB1
Figure 4.4B Distribution of energy potentials for 1UBQ
Figure 4.4C Distribution of energy potentials for 1C9OA
Figure 4.4D Distribution of energy potentials for 1IFB
Figure 4.4E Distribution of energy potentials for 1 VII
Figure 4.4F Distribution of energy potentials of 1R69
3548496768
708687888990919293949596
Trang 15Chapter 1 Introduction
1.1 Protein structure
Proteins, one of the key macromolecules in living organisms are involved inalmost all aspects of biological activity A protein could not perform its normal biologicalfunction without folding into a specific three-dimensional structure, called the nativeconformation, although natively unfolded proteins may be an exception The first proteinstructure to be solved by x-ray crystallography was myoglobin, and its conformationrevealed that globular proteins are not repetitive structures like DNA; rather, they arecompact objects with complex topologies Kendrew solved the structure of myoglobin
almost a half-century ago.” Since that time, many more protein structures have been
solved by x-ray diffraction and NMR, and the number of protein structures in the proteindatabase increases exponentially with each passing year
1.1.1 Hierarchical definition of structures
Protein molecules are linear polymers of amino acid residues, covalently joinedvia peptide bonds This linear sequence of residues is called the primary structure Theseresidues self-organize into specific hydrogen-bonded spatial arrangements calledsecondary structures, which include a-helices, strands of B-sheet, and tight turns Recent
Trang 16studies have shown that a significant population of polyproline II conformation (PP) isalso found in both folded and unfolded proteins PPII is a sterically forced conformation
for polyproline peptides in aqueous solution.? Consecutive residues in PPII conformation
form PPII helices, which are left-handed, all-trans extended helices with averagebackbone torsion angles of (Ø,ø) = (—75°,+145°) With exactly three residues per turn
of helix, PPII helix is more extended than œ-helix PPII helical conformation is observedfrequently in collagen, where three left-handed PPII helices intertwine to form aright-handed, coiled-coil collagen helix
Despite its name, there is no restriction on the residue composition of a PPI helix;other residues besides proline can adopt this conformation As discussed below,polyalanine has a marked propensity to form PPII helices at room temperature in water,and PPII conformation is observed frequently in the unfolded state of proteins.Additionally, PPII conformation is also observed frequently in folded globular proteins.Sreerama and Woody estimated that about 10% of individual amino acid residues in
proteins are found in PPII conformation.* Owing to the absence of intrachain backbone
hydrogen bonds, PPII helices are important in binding and recognition motifs whereligand:protein recognition may involve hydrogen bonding with unsatisfied backbonehydrogen bond donors and acceptors in PPII helices Studies have shown that PPII
conformation is the common binding motif in both WW domains and SH3 domains.”
Both sequential and non-sequential regions of the protein can interact to form
Trang 17compact, independently stable structural and/or functional domains Many isolateddomains can fold into their unique three-dimensional conformation independently From
an evolutionary point of view, domain swapping is a possible path for generating newproteins Thus, an early step in protein structure analysis is usually domain
decomposition because direct comparison of multi-domain proteins might lead to
spurious results in structure clustering and classification.® Additionally, domain
decomposition is necessary when predicting protein structure by “threading” because
homologs are retained at the domain level, not the protein level.’
Structural domains are defined as compact structures in which there is a tendencyfor hydrophobic residues to be buried in the interior and hydrophilic residues to be
exposed to polar solvent at the surface.® The most authoritative database of existing
protein domains is SCOP, which codifies and classifies domains by several criteria, some
of them subjective.” In view of the exponential increase of known protein structures, it is
desirable to develop domain decomposition algorithms that can be run automatically and
do not rely on human intervention
Several algorithms, such as PUU, DETECTIVE, and DOMAK, have beendeveloped for domain decomposition based on the operational premise that residueswithin the same domain will experience more intra-domain than inter-domain contacts.However, according to Jones et al., none of these automatic methods have an accuracythat exceeds 80% Although 100% of the proteins tested could be successfully
Trang 18decomposed into domains when all three of these methods are in agreement, a successful
consensus prediction was possible for only 52% of the dataset.'°
This result was later adopted for use in the CATH database, another codifiedcollection of protein domains In this case, proteins are decomposed into domainsautomatically when all of the preceding three methods are in agreement Otherwise
human judgment is employed.”
Guo et al developed a new algorithm using graph theory and neural networks in
2003, which predicts domains in excess of the 80% threshold as judged by comparison
with manually curated SCOP domains.’ In another approach, Kundo et al employed a
Gaussian Network Model to recognize domains within proteins based on the rationale
that residues within the same domain will move in concert.'? Given that none of the
automatic methods can parse proteins into domains with 100% accuracy, manualdecomposition remains indispensable, especially in the construction of domain databases
In 1979, Rose discovered that proteins have a hierarchic architecture'*, with
structural domains emerging naturally from such an organization Regardless of whetherthey are regarded as structural, functional, or evolutionary units, domains can behierarchically decomposed into smaller units of supersecondary and secondary structure
In this dissertation we adopt this reductive approach, focusing on the structure andtopology of domains and their constituent parts rather than on whole proteins
Trang 191.1.2 Parameters to define topology
Backbone topology is frequently invoked as an important criterion in theclassification of domains However, it is relatively difficult to quantify backbonetopology in a strictly mathematical sense because diverse architectures are topologicallyequivalent Consequently, several characteristic parameters have been used to describe aprotein's backbone topology, including the order parameter, contact order, Ca-distancematrices, backbone torsion angles, and the mesostate sequence We will discuss these,each in turn
The order parameter, Q, is usually defined as the degree of similarity between agiven conformation and the corresponding native structure Q is often used as aconvenient reaction coordinate when describing the progress of the folding reaction.Usually Q is normalized such that Q=0 represents the completely unfolded state and Q=1represents the native conformation When calculating Q, similarity can be assessed ineither Cartesian space or torsion angle space
Since the discovery of a linear correlation between the relative contact order(equation 1.1) and the folding rate constant for two-state proteins, contact order has beenadopted as a topological signature As further described below in Folding Models, Plaxco
et al hypothesized that the contact order is actually a proxy for topology, which is highly
correlated with the folding rate constant.
Trang 20The relative contact order, RCO, is defined as:
1
where AS, is the number of residues in the linear chain between the spatially contactingresidue pair i and j, N is the overall number of contacting residue pairs, and L is thelength of the protein In other words, the relative contact order is the average residueseparation between pairs of residues in physical contact, normalized by protein size
The Ca-distance matrix is defined as a symmetric, square matrix of dimension
N x N, where N is the number of residues in proteins The matrix element in the ith rowand jth column is the distance between the a-carbons of the ith and jth residues Thismatrix has been widely applied in structural biology to represent backbone topology as,for example, by the structural comparison program DALI, in the FSSP domain
database.'° The distance matrix is also frequently plotted as a distance map to facilitate
topological comparison
Backbone torsion angles, @ and w, can also be used to capture protein topology
To simplify the structural representation of a protein using backbone torsion angles, wediscretize @,w-space into 60x60 degree bins, called mesostates Each of the 36 possiblemesostates is represented by a letter of the alphabet (See figure 1.1) In this way, overallprotein topology can be represented simply but coarsely by a one-dimensional mesostatesequence, where each character represents the approximate position of the corresponding
Trang 21residue in @, w-space The utility of this approximation will be discussed later in thisdissertation.
1.1.3 Classification of protein domain structures
The classification of protein domains is usually based on their structures Given thathomologous proteins have similar structures, domain classification can be a key stepwhen recognizing evolutionary relationships Additionally, domain classification is anindispensable step in structural prediction methods like homology modeling and
threading.” '” !Ở The essential step in structural classification is structural comparison,
which provides a distance between any pair of domains in domain space, and from which
a hierarchic tree can be drawn
Given the fact that a protein's three-dimensional structure is a consequence of itsprimary sequence, a set of structural comparison methods based on sequence alignment
has been developed Examples of such methods include BLAST and FASTA”, which use dynamic programming and the Smith-Waterman algorithm?’ Domains with a
sufficiently high degree of aligned sequence similarity are likely homologs, and thus highsequence similarity is inversely proportional to structural distance Information of thissort can be obtained from sequence alignment, with similarity between residues
quantified by similarity matrices such as BLOSUM and PAM”?
The success of these pairwise sequence alignment programs in detecting
Trang 22evolutionary relationships diminishes markedly when sequence identity falls below ~30%
where proteins with similar backbone topologies can have dissimilar sequences.”° ? In
response, newer sequence-to-profile algorithms and profile-to-profile algorithms havebeen developed that boost detection sensitivity for distantly related proteins having
similar structures 2? Although newer programs that incorporate multiple sequence
alignment instead of pairwise alignment do improve detection sensitivity substantially,structure-based methods still outperform sequence based methods in both sensitivity andspecificity
To overcome some of the shortcomings of sequence alignment, many investigatorshave developed domain classification approaches based on optimal three-dimensionalalignment Domains in SCOP, the most authoritative domain database, are classifiedhierarchically into CLASS, FOLD, SUPERFAMILY, and FAMILY, based on structuraland evolutionary information as mediated by the subjective judgment of the humanclassifier Proteins are first clustered into families based on their sequence and thengrouped into superfamilies based on known evolutionary relationships Superfamilies orfamilies are then further grouped into folds by backbone topology, resulting finally infive classes: (a) all alpha proteins, (b) all beta proteins, (c) proteins with interspersedalpha-helices and beta-strands, (d) proteins with segregated alpha-helices andbeta-strands, and (e) proteins composed of domains in different folds or domains with noknown homologues The SCOP database is generally accepted as the gold standard
Trang 23because classification is performed subjectively by experts in this field.” ?8 ?2
CATH, another well-known domain database!! 30 classifies domains
hierarchically into CLASS, ARCHITECTURE, TOPOLOGY, and HOMOLOGOUSSUPERFAMILY, analogous to SCOP Although classification is asserted to be
"semi-automatic", the key levels, i.e the structure-determined levels (architecture andtopology), are assessed manually by human experts
With the exponential increase of deposited structures in the Protein database(PDB), fully automatic classification methods become increasingly necessary The FSSPdatabase, based solely on domain structures, uses the DALI algorithm to recognize
structural neighbours.°' DALI measures the structural similarity between two proteins by
matching their Ca-distance matrices (described in Parameters to define topology
above).'° In contrast, the ENTREZ database identifies protein neighbors using VAST, an
alignment algorithm that compares superimposed arrays of vectors between the
secondary structure elements within each respective domain.*” 33 34
The structural comparison methods described above are based on either sequence
or direct three-dimensional structure alignment Recently, Honig and coworkersintroduced a profile-to-profile alignment program that improves the performance forremote homolog detection by combining both primary and secondary structure
information.3
Earlier, Przytycka and Rose had already proposed that secondary structure alone
Trang 24may be sufficient to recognize tertiary structure In their study, 183 proteins with less
than 30% aligned sequence identity were represented as linear strings of secondarystructure elements, including turns and loops Using a simple scoring matrix,conventional pairwise sequence comparisons between these strings were performed andused to construct a Przytycka-tree (P-tree), in which the distance between any two nodes
is proportional to the difference in score between their aligned secondary structure strings.The P-tree is generated completely automatically, and it reflects the global secondarystructure relationships among the proteins used to construct it: the closer the nodes, thegreater the similarity of secondary structure among their corresponding proteins.Surprisingly, the straightforward P-tree was found to be largely in agreement with theSCOP tree, although the latter is a complex construct based on structure, evolutionaryknowledge, and human judgment This result lends support to the hypothesis thatsuccessful fold recognition can be derived solely from knowledge of secondary structure
In my graduate work, I sought to extend this idea by quantifying the degree towhich approximate backbone conformation can determine the protein fold (chapter 3) Adynamic programming algorithm was devised to compare domain structures by aligningtheir approximate backbone torsion angles, represented as mesostate sequences(described above)
The specific hypothesis being tested is that domains with similar mesostatesequences have similar structures The converse proposition is certainly true — similar
Trang 25structures always have similar mesostate sequences Consequently, validation of ourhypothesis means that structure recognition could be accomplished successfully bymesostate alignment.
Having shown that domain structures can indeed be recognized successfully fromtheir approximate backbone conformations, this work was extended by designing analgorithm to actually rebuild the native structures from its mesostate sequence (chapter 4).The process was implemented by fragment assembly Monte-Carlo simulation startingfrom an extended polyalanine chain, with mesostate-constrained backbone torsion angles
1.1.4 Structure Prediction
All information needed to encode a protein's structure is stored in its sequence,°’ but
the question of how to extract and utilize this information for successful predictionremains unanswered The hierarchic architecture of proteins suggests that the initialpredictive step should be focused on secondary structure Historically, bioinformaticsresearch started with secondary structure prediction The earliest research in this areaconcentrated on mapping a protein's sequence into a three-state secondary structuremodel, comprising a-helices, B-strands, and coils (i.e all other structures) In thisapproach, the operating assumption was that different residues would exhibit differing
propensities to populate distinct local regions in ,y-space Both the Chou-Fasman”Š and GOR” algorithms are of this type The most recent version of the GOR algorithm, GOR
Trang 26IV, has a prediction accuracy of about 65% and remains one of most popular secondary
structure prediction programs in current use?
A new generation of secondary structure prediction methods, however, has beendeveloped to take advantage of machine learning strategies, such as neural networks,hidden Markov models, and support vector machines Such methods capture localsequence patterns by multiple sequence alignment and extend prediction categories
beyond the conventional three-states.*” For example, PHDsec, one of most successful
among this new generation of methods, uses a neural network together with evolutionaryinformation from related sequences to predict secondary structures with greater than 70%
accuracy 4!”
The current accuracy of secondary structure predictions has reached a plateaunear 80% A systematic analysis of several popular secondary structure predictionalgorithms, including PHDsec, PSIPRED, Jnet, and PREDATOR using a large dataset of
2777 non-homologous proteins, concluded that prediction accuracy correlates negativelywith residue contact order and that inclusion of long-range interactions would be needed
for any further improvement."°
The accuracy of tertiary structure prediction still lags that of secondary structureprediction The most accurate tertiary structure prediction method is comparativemodeling, which utilizes known homologs to the unknown target protein to build asuitable structural model under the assumption that homologous proteins share the same
Trang 27backbone topology.'” !* Sidechain torsions are assigned once the backbone conformation
is established, based on other work showing that sidechain and backbone conformationare tightly coupled This method works well in those cases where a target proteinhomolog exists in the protein databank and can be identified successfully
In cases where a homolog is not available, the only recourse is to resort to abinitio prediction, in which tertiary structure is predicted solely from the sequence Twowell known ab initio algorithms, LINUS and ROSETTA, will be discussed
LINUS simulations attempt to capture the native conformation from first
principles using the Metropolis Monte-Carlo algorithm to search conformational space
(torsion angle space) The energy scoring function includes only hard-sphere sterics,hydrogen bonds, contact energy, and solvation energy
ROSETTA employs a fragment assembly method, again using Monte-Carlo
simulation “> 4° 47 In each step of the simulation, a three- or nine-residue database
fragment is substituted in the target, a strategy that avoids most local collisions by
adopting viable local fragments directly from existing PDB structures.*® * Recently,
Baker and coworkers built several models with surprisingly high accuracy (sometimes
less than 1A root-mean-square difference from the native conformation) by employing several homologous sequences.
Trang 281.2 Protein folding problem
Anfinsen and coworkers have shown that the native conformation of a protein is
uniquely determined solely by information encoded within its residue sequenee.””
Although Anfinsen's results show that proteins adopt their functional three-dimensionalstructure spontaneously, the mechanism by which this process occurs has remainedelusive This famous folding problem has been formulated as the reversible transition
between folded and unfolded states: unfolded folded , and it can be further
decomposed into two views: (1) the thermodynamic and macroscopic view or (2) thedynamic and microscopic view Lacking experimental methods that can track the foldingpathway of single macromolecules, the focus of most current experiments is onthermodynamic and kinetic studies of protein ensembles Protein folding is a highlycooperative process, as shown by numerous thermodynamical experiments on the foldingtransition obtained under different chemical and physical conditions Kinetic experiments
on folding rates following perturbations such as a temperature jump have been performed
on many proteins Macroscopic variables, such as the free energy, entropy, and enthalpychanges, and even reaction rate constants, can be obtained from such experiments Incontrast, microscopic data cannot be extracted from existing experiments readily, andmust usually be obtained from computer simulations and/or theory
Trang 291.2.1 Thermodynamics of the folding process
What stabilizes the native conformation?
The folded state, regarded as the ensemble of near-native conformations, hassome degree of flexibility, without which proteins could not perform their biologicalfunctions Although folded conformations have lower energies than unfolded ones,globular proteins are only marginally stable at room temperature No covalent bonds aremade or broken in the folding process except for disulfide bonds, and the majorcontributions stabilizing the native conformation are thought to be from nonbondingpotentials, including: (1) electrostatics (2) van der Waals interactions (3) hydrogenbonding and (4) hydrophobic interactions Despite the gain in solvent entropy, the hugeloss in configurational entropy that accompanies protein folding, however, counteractsthese favorable contributions, resulting in the marginal stability of the nativeconformation Against this backdrop of marginal stability, each of these favorable energyterms is important because even small differences could change the direction of the
folding transition.® These four energy terms are now discussed in greater detail.
Charged particles give rise to electrostatic interactions, which are long-range
According to Coulomb's law, the electrostatic potential is inversely proportional to both
charge separation and the dielectric constant Despite their long-range nature, the highdielectric constant of water (~80) screens electrostatic interactions substantially The
Trang 30protein interior, however, is much more hydrophobic than bulk water, with a lowdielectric constant (~12-20) Consequently, ionizable groups inside the protein areenergetically significant In most globular proteins, the charged groups are localized onthe surface as anticipated, given the large energy cost of burying uncompensated charges
in the interior At physiological pH, ionizable groups are not uniformly positive ornegative Rather, the protein surface is "bristling with both positive and negative charges"that facilitate solubility and stabilize the native conformation At extremes of pH, belowthe pK of acidic groups or above the pK of basic groups, the resultant repulsion betweenions can denature proteins Although ion pairing may contribute ~1-3kcal/mol to protein
stability, it is clear that ion pairing is not a dominant force in protein stability.’ The folded
conformation shows little dependence on pH or salt, both of which could influence theelectrostatic potential substantially Additionally, structural studies of protein homologs
show that ion pairs are not well conserved in evolution."
The van der Waals force is a weak, short-range interaction occurring between allatoms, both polar and non-polar It can be decomposed into two components, anattractive part arising from the interaction between transient dipoles, and a repulsive partthat arises when the electron clouds of two non-bonded atoms overlap The potential isfrequently approximated by the Lennard-Jones function
A B
ưng _ re (1.2)
Trang 31where 7„is the distance between two atoms, i and j? Van der Waals forces are expected
to stabilize the native conformation rather than unfolded state owing to short-range
interactions within the close-packed protein core.”°
A hydrogen bond is a short-range interaction between a polarized hydrogen bonddonor, D-H, where D is the hydrogen-donating atom, and the polarized nonbondingorbitals of an acceptor atom, A Although hydrogen bonds have some degree of covalentbond character, they are viewed primarily as electrostatic dipole-dipole interactions The
optimum distance separating the donor and the acceptor ranges from 0.26 to 0.30 nmTM,
with an almost co-linear geometric arrangement of the three participating atoms: donor
(D), hydrogen (H), and acceptor (A).Š It was Pauling who first proposed that hydrogen
bonds play a significant role in macromolecular folding and stability In seminal articlespublished in 1951, Pauling et al hypothesized that hydrogen bonds provide the driving
force to form a-helices and B-sheets”” là, and estimated hydrogen bond strength to range
between 2-10 kcal/mol.*° Soon after, Schellman proposed that an intrachain hydrogen
bond is energetically favorable relative to a hydrogen bond with water by ~1.5 kcal/mol,
based on the measured formation of urea dimers in solution.” ** Experiments on
helix-coil transitions and B-sheet formation provided additional support for Pauling's
hypothesis that hydrogen bonds are the principle driving force in such processes.®
Intramolecular hydrogen bond donors and acceptors are abundant in foldedproteins On average, there are 1.1 intramolecular hydrogen bonds per residue, which
Trang 32compensate for lost intra- and intermolecular hydrogen bonds in the unfolded protein."
Accordingly, hydrogen bonds were considered to be a key factor in stabilizing the nativeconformation However, Kauzmann questioned this conclusion, arguing that the energy
of intrachain hydrogen bonds in the folded state would not differ significantly from theenergy of corresponding peptide:water hydrogen bonds in the unfolded state Reasoningfrom the strengths of hydrogen bond interactions in model compounds, he hypothesizedthat the hydrophobic effect, rather than hydrogen bonds and van der Waals interactions,
would be the principle driving force in protein stability.°° Kauzmann’s proposal was
bolstered by several later studies For example, Klotz and Franzen measured hydrogenbond formation in N-methylacetamide (NMA) and found that in water, the enthalpy ofNMA dimerization is approximately zero, which implies that hydrogen bonds are not
stabilizing after including the entropic cost of intrachain hydrogen bond formation.®
Similar experiments on another small molecule, e-caprolactam, also reached this same
conclusion.” These experiments persuaded the field that hydrophobic interactions, rather
than hydrogen bonds, are the dominant source of protein stabilization, although thereremained some evidence to the contrary from studies of cyclic dipeptides and
diketopiperzines.® S
In more recent work, the pendulum seems to be swinging back toward Pauling'sdirection Pace and co-workers performed numerous site-directed mutagenesis studies,and concluded that the enthalpic stabilization provided by hydrogen bonds is about 1.6
Trang 33kcal/mol, larger than hydrophobic effect After inclusion of an entropy correction term, ahydrogen bond still stabilizes the native conformation by 0.6 kcal/mol; summed over the
entire protein, this is comparable to the magnitude of hydrophobic effect.”” Makhatadze
and Privalov reached a similar conclusion from different proteins and model compound
data Analyses of protein x-ray structures confirm the importance of hydrogen bonds by
showing that most buried groups in globular proteins are hydrogen-bonded Fleming andRose recently proposed that all peptide hydrogen bond donors or acceptors should besatisfied either by intrachain partners or by water molecules Their conclusion wassupported by careful examination of high resolution crystal structures, where it was found
that apparent exceptions can be rationalized.
Nowadays, hydrogen bonds are accepted as key factors that influence proteinfolding Although the question of whether they are the dominant contributor to foldingstability remains controversial, it is widely accepted that hydrogen bonds are important inthe specificity of protein folding Along these lines, Depristo et al implemented anexplicit hydrogen bond potential in their x-ray refinement program and successfully
generated an ensemble of conformations compatible with experimental diffraction data.
And, Baker and colleagues successfully built high resolution protein structures using de
novo methods by emphasizing hydrogen bonds and van der Waals interactions.°°
Kauzmann invented the words "hydrophobic bond" in connection with his famous
1,
"oil drop" model.” A hydrophobic bond is not a direct physical interaction between
Trang 34atoms; rather, it describes the tendency of non-polar residues to reduce theirsolvent-accessible surface by clustering together, and in this sense it is not a conventionalchemical bond.
Kauzmann’s proposition that the hydrophobic interaction is the principle drivingforce for protein folding was later supported by at least four sets of experimentalobservations (1) It was observed that non-polar solvents can denature proteins (2)Kauzman's proposal slightly predated successful protein x-ray crystallography Soon after,
as x-ray crystal structures became available, it could be seen that apolar residues arepreferentially buried in the molecular interior and, conversely, non-hydrogen bondedpolar residues are preferentially solvent-exposed (3) Calorimetry experiments revealedthe similarity between the temperature dependence of the free energy change uponprotein folding and that of the transfer free energy of non-polar molecules from waterinto non-polar solvent (4) Proteins were also found to unfold at low temperature, aprocess called cold denaturation
In the initial model proposed by Kauzmann, the hydrophobic effect was measured
by the transfer free energy of a non-polar compound from organic liquid into water.Ben-Naim and coworkers modified this simple model by introducing backbone atoms,arguing that non-polar sidechains connected to backbones are more appropriate modelswhen studying the protein hydrophobic effect because the hydrophilic component also
plays an important role in the free energy of transfer.°’ Thus, solvation energy, rather
Trang 35than the traditional hydrophobic interaction, should be used to characterize this effect.Although Pace and coworkers have argued persuasively that the hydrophobic interactionand hydrogen bonds make a comparable contribution to protein stability, it is stillgenerally accepted that hydrophobic interactions are the principle driving force in proteinstability Of course, this view may change in time.
Counteracting these four weak forces that stabilize the native conformation,entropic terms destabilize the native conformation The entropy change during foldingcan be partitioned into two parts, configurational entropy loss and solvent entropy gain
Solvent entropy gain is included in the hydrophobic interaction The configurational
entropy loss was assumed to be large and positive because the denatured ensemble hasbeen regarded as a featureless statistical coil Recent studies, however, indicate thatunfolded state is not actually a featureless statistical coil with an astronomical number ofavailable states; instead, it retains some degree of order Thus, the entropy loss on foldingwill be smaller than previously expected This topic will be discussed later in theIntroduction (Unfolded State)
1.2.2 Dynamic view of folding
Levinthal paradox
Perhaps the most famous concept in protein folding theory is the "Levinthal
paradox".® In this back-of-the-envelope calculation, each residue in a protein molecule
Trang 36can adopt several different conformations, consistent with the two degrees of backbonefreedom, ¿ and w, for each amino acid residue For a polypeptide chain of 100 residues,the total number of conceivable conformations is then an astronomical number Even ifeach residue were limited to only two states (the alpha and beta regions on a
Ramachandran plot), the total number of possible conformations is still 2° «10%.
Taking the rate of single-bond rotations into account, the native state could not beattained in a biologically relevant time frame by random search In fact, the typicalfolding time for a protein molecule is in the range of microseconds to seconds ForLevinthal, this was no paradox at all but rather a demonstration that protein folding is not
a random search Instead, there must be a specific pathway or mechanism that guides the
protein from the unfolded ensemble to its native conformation.”
Energy Landscape Theory
To solve the Levinthal paradox, Wolynes and colleagues considered folding from
the vantage point of the protein’s energy surface 75” Using a minimalist model, they
studied the behavior of random heteropolymers using statistical mechanics and computersimulations and found that the energy landscape looks like a smooth funnel Later studiesextended this work to proteins, in which the energy landscape is also funnel-shaped, butwith a more rugged surface In the protein funnel, a small set of conformations at thebottom corresponds to native conformers
Seen from this vantage point, the resolution of the Levinthal paradox is to include
Trang 37those cooperative, energetically favorable interactions that accumulate as proteins foldand drag the molecules toward native conformations As depicted in the Levinthalparadox, the energy landscape is analogous to a flat golf course, with a single holerepresenting the native conformation The probability that a molecule would find such ahole via random search is negligibly small In sharp contrast, protein energy landscapetheory predicts a funnel-shaped energy surface, in which an unfolded molecule is dragged
to its native conformation The protein energy surface is rugged because polypeptidechains can sample multiple conformations during folding, some of which are non-native.Indeed, some low troughs on the rough surface may correspond to major off-pathwaymisfolders found in experiments The degree of roughness also affects the folding rate;the funnel landscape for smail, topologically simple proteins is comparatively smooth,
with correspondingly faster folding rates.” Although this theory rationalizes the
Levinthal paradox, it has not been validated by experiment and does not address thefolding pathway in microscopic detail
The next section discusses some other theoretical folding models that are moreclosely tied to experimental work
Folding Models
In the Karplus and Weaver diffusion-collision model, proposed in 1976”* TM, the
initial folding phase involves the rapid equilibrium formation and decomposition offluctuating quasiparticles, called microdomains These can be hydrophobic clusters or
Trang 38even elements of secondary structure These microdomains diffuse and collide,sometimes productively, leading to coalescence into larger intermediates The nativeconformation is then produced by iterative coalescence of such intermediates TheKarplus and Weaver model does not exclude parallel folding pathways, and it assumesthat folding is a solvent-dependent process, in accord with later experiments.
The Kim and Baldwin framework model”, proposed six years after the diffusion-collision model, emphasized hierarchic folding’ In the framework model,
simple, local structures form first and then assemble into more complex structures bylonger-range interactions The model derives its name from the idea that elements ofhydrogen bonded secondary structure form early and act as a framework for subsequenttertiary structure formation The framework model and the diffusion-collision model arenot mutually exclusive If the microdomains of the diffusion-collision model correspond
to hydrogen-bonded secondary structures, then the two models are essentially identical
Hierarchic folding’ is implicit in the diffusion-collision model, and appropriately so, as shown later by several experiments and computer simulations.” ” Together, the
combination of the diffusion-collision model and the framework model provide a way out
of the Levinthal paradox by uncoupling the formation of secondary structure from that oftertiary structure and restricting the search process to the diffusion and collision ofsecondary structure elements
In 1985, Dill proposed a hydrophobic collapse model” based on the idea that
Trang 39collapse precedes formation of specific structures In refolding experiments, solvent
conditions change from "good" solvent, which favors unfolding, to "poor" solvent, whichfavors folding In poor solvent, proteins collapse around their hydrophobic sidechains tominimize solvent access to hydrophobic residues Collapse to a smaller volume reducesthe number of accessible conformations, thereby partially resolving Levinthal paradox
The nucleation condensation model, proposed by Fersht in 1997, was motivated
by his @-value analysis of several small proteins e.g., CI-2.” The @-value is defined as
the degree to which a residue of interest is native-like at the protein's transition state.Experimentally, the @-value is obtained by measuring how much a mutation affects thetransition state relative to the extent to which it affects the native state Based on @-valueanalysis, secondary and tertiary structure form in parallel in CI-2, and several long rangeinteractions are important in forming the transition state topology Innucleation-condensation, the most stable residues in the transition state are considered to
be a nucleus around which condensation occurs, followed by rapid folding to the nativeconformation The rate-limiting step in this model is the formation of nuclei, so thesearch process is not a random one The model is assumed to apply only to the folding ofsmall, two-state proteins For larger proteins, it is assumed that smaller modules formfirst by nucleation condensation and then dock hierarchically
Two years later, Debe et al proposed the topomer-sampling model.®° A topomer
is defined as a cluster of conformations with the same degree of compaction, all of which
Trang 40are interchangeable under local backbone coordinate transformations that maintain thebackbone covalent bonding To test this idea, they exhaustively enumerated all topomersfor a polypeptide chain of length N residues and found that a 100-residue protein coulddiffusively search the entire set and find the native topomer within a biologically realistictime frame (100ms) Once found, the molecule is postulated to fold to its nativeconformation by rapid condensation Plaxco and coworkers renamed this model thetopomer search model and used it to successfully predict the folding rate of two-stateproteins, operating on the assumption that relative folding rates are proportional to theprobability of achieving the native topomer which, in turn, depends on the topological
complexity of the protein in question.'> *!
We proposed our own model for two-state protein folding, a combination of theframework and topomer search models that proceeds in three distinct stages In the first
stage, marginally stable elements of secondary structure are formed *” * Collapse occurs
in the second stage, which involves a diffusive search for the native topomer The rateconstant of this second stage is postulated to correlate negatively with the length ofinterconnecting loops between secondary structure elements formed in the first stage.Finally, after reaching the native topomer, there is a third stage in which condensation tothe native fold occurs For two-state proteins lacking cis-trans isomerization of prolines,the rate-limiting step would occur in the second stage Consequently, the measuredfolding rate should correlate negatively with the length of linking loops, the larger the