1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận án tiến sĩ: Adaptation and use of four-body statistical potential to examine thermodynamic properties of proteins

204 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Adaptation and Use of Four-Body Statistical Potential to Examine Thermodynamic Properties of Proteins
Tác giả Gregory M. Reck
Người hướng dẫn Iosif I. Vaisman, Associate Professor
Trường học George Mason University
Chuyên ngành Bioinformatics
Thể loại Dissertation
Năm xuất bản 2006
Thành phố Fairfax, VA
Định dạng
Số trang 204
Dung lượng 19,65 MB

Cấu trúc

  • 1.1 Computational Approaches to Folding and Stability Analysis...........................-...ôô- 3 (17)
  • 1.2 Experimental measurement .............ccccccssscccssecessscesseecesssecesseesessescessaeecsseascensenss 8 (22)
  • 1.3 Specific Aims and Dissertation FOFTât........................ G111 g1 ng gen 10 (24)
  • 2.1 GV (0)
    • 2.3.2 Computational Hydration of the Protein Set..........................--- SA sec 17 2.3.3. Delaunay tessellationn....................... . .- - G9 0003 104 20 (31)
    • 2.3.4 DT Simplex Face Match Residue Classification..........................-cs sex. 23 (37)
    • 2.3.5 Water Coordination Number RaiO....................... T9 HT HH 06 655 26 (40)
    • 2.3.6 DT Water Group Parameter and Residue Classification (40)
    • 2.3.7 Accessible Surface AT€A................. HH ch HT HH 0 0 0001 0 97 27 (41)
    • 2.3.8 Residue nh (0)
    • 2.3.9 Circular Variance Tan. .ẻe (42)
  • 2.4 Results and DisCussiOn.......................- - -- cọ HH n0 T0 0000 0 g1 008 29 (43)
    • 2.4.1 Examination of the Protein HydratiOns......................- .- - HH ng se 29 (43)
    • 2.4.2 Comparison of Tessellation Classification Methods............................--..‹ s55 40 2.4.3. Relationship of Simplex Face Matching Method to Location Parameters (54)
    • 2.4.4 Relationship of Water Group Method to Location Parameters (63)
    • 2.4.5 Application of the Residue Classification Methods to a Specific Protein (65)
    • 2.4.6 Examination of the Water Group Parameter and the Water CNR (67)
    • 2.4.7 Comparison of Water Group Parameter with Hydrophobicity Scales (67)
  • 2.5 COn€ẽusiOn............................... . . Q0 Họ TH 0004 06 80 04 55 (0)
  • 3.2 Introduction...........................-- - c cọ ng. nọ 0009.05.0500 8 0 e6 61 3.3. A) (10 (0 [: G004 00981 50 66 (75)
  • 3.4 Results and DiSCUSSIOTI......................... - -- Á G G0 000005046850 76 (90)
    • 3.4.1 Characteristics of the Potential Functions........................ do - co c1 ng gu 76 (90)
    • 3.4.2 Decoy Discrimination Using the Tessellation Potential Functions (97)
  • 3.5 COnclusiONS 5 (107)
  • 4.1 ADStraCt ........................ố (0)
  • 4.2 Introduction........................... . - Q9 00004408984 97 4.3. Methods.................................. Lọ TT 004 00 101 (111)
    • 4.3.1 Protein Datasets.......................- - -Q c HH Họ th 90096 101 (115)
    • 4.3.2 TesselẽatiOn........................... -- cọ ng 0 102 4.3.3. Derivation of Statistical PotentiaÌS............................-- - - - ng ng vấn 104 (0)
    • 4.3.4 Application of Statistical Potential Functions to Target Proteins (121)
    • 4.3.5 Machine Learning ToOÌS..............................-- -- ĂG ng ng 00 6g 110 (123)
    • 4.3.6 Statistical AnalySẽS.................... cv HH ng nọ ng 00096 113 ` (127)
    • 4.4.1 Application of Tessellation Potential to the Study Proteins (130)
    • 4.4.2 Comparison of Statistical Potential Strategies (CA, WG and SP) (136)
    • 4.4.5 Examination of Transthyretin Mutant Residual Profiles for Amyloid Signal (165)
  • 5.1 Residual Profile Searches............................... -- - G LH. HH ng cư 161 (0)
  • 5.2 Exploration of Amyloid Mutants with Machine Learning Tools (176)
  • 5.3. Comparison of Water Group Parameter with Hydrophobicity Scales (0)

Nội dung

Many of the data sets were generated in support of a protein structural studythat employed site directed mutagenesis to create mutants at specific sites of interest.1.3 Specific Aims and

Computational Approaches to Folding and Stability Analysis - ôô- 3

Biological proteins in the cellular environment have the remarkable ability to rapidly, reliably and repeatedly fold into exactly the same conformation [17] However,the elegance and ease of nature’s processes belies the complexity of the underlying mechanisms While the molecular physics principles that appear to be guiding the folding process are largely known, the ability to rigorously apply them to model even a modest-sized protein for barely a fraction of a second has proven very difficult and expensive And penetrating the dynamics associated with protein behavior such as folding or instability has been even more elusive Nonetheless, major advances in computational methods and computing capacity have enabled great strides in understanding protein folding. alone, energy-based methods that model the interaction energies influencing the molecular behavior, and knowledge-based methods that utilize patterns or rules derived from databases of measured protein characteristics to model and predict behavior. Considerable progress has been made in ab initio methods, but the enormity of conformation search space has been a limitation [18] Since high-resolution protein structural data are growing rapidly and include many proteins of interest in stability studies, energy-based molecular dynamics simulations and knowledge-based methods have seen extensive application.

Energy-based methods comprise a range of methods that seek to minimize protein energy levels to find folding solutions, or that utilize computed energies to simulate protein molecular behavior Molecular dynamics (MD) simulations employ detailed physics-based force-field models to represent the forces acting on each atom in a protein and then successively update the positions and velocities of each atom at extremely small time intervals [19,20] With reasonably accurate expressions of the various force components, MD models can provide valuable insight into the unfolding stability problem [21-24] The referenced efforts included a perturbation approach with an MD simulation to determine the unfolding free energy changes associated with individual site-specific mutations of trypsin, T4 lysozyme and barnase This study derived the unfolding free energy by calculating the energy changes needed to transform the folded and unfolded states from wild type to mutant, and showed good agreement with measured time steps typically used to secure accuracy and computational stability make them more amenable to investigations of short-term folding and interactions Recent reports have documented progress using MD to simulate transition states or aggregation for precursors of several of the amyloid diseases [25-28] MD studies have also suggested that certain transition states may be preferred and may represent common intermediate conformations for small protein or polypeptide systems [25].

Knowledge-based potentials are another category of tools frequently used as a scoring function to evaluate protein folding or to predict protein stability changes [29,30].

In these applications, the tools do not explicitly model the force-field interactions, but instead rely on statistical potential functions derived from databases of high-resolution protein structural information to implicitly capture force field information These models employ Boltzmann’s principle that frequently observed states correspond to low energy states and compute statistical potentials based on the frequency of occurrence of residue contacts in a large representative database Typically, pairwise potentials are developed for residue pairs with respect to a conformation parameter such as separation distance or torsion angle [31] This approach has shown success in modeling some effects that are difficult to address, such as hydrophobic or solvation effects [18,32,33] These methods provide a score that can be used (a) to evaluate sequences “threaded” onto structural templates [30,32], or (b) to correlate with protein characteristics such as the effect of mutations on stability [34]. but has the appeal of a much more rapid, if less exact approach Initial efforts to apply database-derived potentials examined the use of residue contact potentials [35], secondary structure potentials based on dihedral angle [36], residue accessible surface area (for hydrophobic residues) [37], and residue interaction potentials based on separation distance [29] The study using residue interaction potentials achieved correlation coefficients of 0.91 for 19 amino acid replacements at a single alanine position, and 0.66 for 19 replacements at a single serine position [29] A more recent study used both a torsion-based potential and a distance-based potential with variations in weighting based on solvent accessibility of the mutated residue [34] Correlation coefficients of 0.80 and 0.71 were achieved for buried and partially buried residues respectively Mutations in surface residues with > 50% solvent accessibility yielded a correlation coefficient of 0.87 using the torsion potential alone For proteins with mutations in residues with solvent accessibility of 40 to 50%, the results were quite variable depending on the protein and the relative dominance of torsion vs distance The maximum correlation coefficient in this range was 0.55 The same investigators also examined and compared several alternative strategies for deriving statistical potentials from protein datasets in applications of stability and folding prediction [31].

While the use of knowledge-based potentials as mean-force potentials has shown success in these applications, there have been questions regarding the ability to accurately represent some free energy terms For example, components of free energy relating to residue contacts can be analyzed statistically as independent events without consideration of the specific environment of each [41] However, the statistical potentials derived from pairwise potentials may relate directly to mean force for considering the effect of some changes or mutations made at a specific location within a protein [39].

The tessellation-based nearest-neighbor tool used in this study differs in several ways from the mean-force potentials described above Instead of an arbitrary distance criterion to identify contact between pairs of residues, the neighbors of a residue are based on a unique geometric definition in groups of four, thus they convey a higher-order relationship Also, the method does not develop an explicit mean-force potential, but develops a score based on the propensity of residues to appear together drawn from a large representative sample of proteins Specifically, the potential function is a statistical potential that represents the log likelihood of finding a quadruplet of four residues in a set of representative proteins compared to a random assortment of residues in quadruplets (the null assumption) If the technique is applied to the neighboring residues surrounding a particular residue in a protein, then the resulting likelihood score can be interpreted as a measure of the compatibility of that residue with its local environment.

Some attempts have been made to correlate tessellation-derived four-body potentials with protein stability A recent effort focused specifically on single-point mutagenesis of hydrophobic core residues [42] of 5 well-characterized proteins.Individual correlation coefficients for the proteins ranged from 0.70 to 0.94 A of the sequence length and the reciprocal of the mean potential score, that partially corrected for protein-specific effects and improved the correlation coefficient to 0.83, consistent with the pairwise potential described above.

Thus, correlation of database-derived potentials with folding stability as an empirical predictive tool for the effect of mutations has met with mixed results The correlations with buried or partially buried mutations, or fully solvent-accessible residues have been good for single-point mutations Correlations with mutations of partially accessible surface residues have been less successful and inconsistent, although torsion- based potentials may be somewhat better than distance-based potentials These results indicate that additional parameters such as accessible surface area or short-range interactions are important to achieve good predictability The application of four-body contact potentials may address some of these issues if the method is adapted to better represent the protein surface interactions.

Experimental measurement .ccccccssscccssecessscesseecesssecesseesessescessaeecsseascensenss 8

Most bioinformatics efforts ultimately rely on experimental data that is time consuming, laborious and expensive to acquire, and this effort is no exception Two general types of data were needed for this study, (a) a large number of high quality protein structural files, and (b) measurements of the change in stability associated with a large number of individual mutations in a small number of well characterized proteins. through high resolution X-ray crystallography coupled with modeling and validation efforts This need is adequately provided by the well documented and curated Protein Data Bank (PDB) [43] PDB search facilities are invaluable in identifying the set of proteins that meets the requirements for inclusion in the reference set.

The second requirement for stability data was largely met through a data base service as well The ProTherm database [44] is available online and consists of protein mutant stability data collected from a number of sources All sources are well documented and in most cases the referenced reports were accessed in order to fully understand the nature of the data A literature search for additional data did not uncover a significant number of additional mutant stability data points Stability data are typically acquired through either thermal unfolding of the protein or subjecting the protein to a denaturant such as urea or guanidinium chloride and employ a variety of techniques for measuring and monitoring thermodynamic properties and conditions Fortunately, most of the data for each of the proteins in the study were taken at a limited number of laboratories, minimizing variations in procedures between groups, and examinations of duplicate data taken by different groups revealed very minor differences Stability may be expressed in terms of the change in free energy of unfolding of the protein, dG(kcal/mol), or in terms of the temperature of transition (TM) of the protein from its wild type (wt) structure to a thermally denatured structure[45] Since most of the data is associated with changes in stability due to mutations of the wt protein, the data may be presented as the stability difference between the mutant and the wt forms of the protein,ddG or dT Many of the data sets were generated in support of a protein structural study that employed site directed mutagenesis to create mutants at specific sites of interest.

Specific Aims and Dissertation FOFTât G111 g1 ng gen 10

The primary aim of this effort has been to adapt and extend the tessellation-based statistical potential strategy for studies of protein stability and aggregation. Tessellation-derived contact potentials have been used effectively to investigate several aspects of protein structure and function [46-49] However, attempts to utilize the method for analysis of protein stability and protein behavior that involves surface features or interactions have not been as productive A possible explanation is that the formation of geometric cells as part of the tessellation process is incomplete at the boundary points that represent amino acid residues located on the surface of the protein [50] As a result, the number and characteristics of the tessellation-derived elements used in the computation of contact potential are different for surface residues A second consideration is that the environment surrounding the protein should be explicitly included in the model [51] For most proteins the environment will be water The protein molecular environment should be included both in the tessellation analysis as well as the statistical development of the contact potential These shortcomings are a particular concern if the method is to be applied to analyses of protein stability, surface features such as the effect of surface mutations, or surface functions such as analyses of protein-protein interactions.

The effort has progressed in three stages: (1) adaptation of the computations to incorporate environmental water, (2) development of hydrated potential functions, and

(3) application of the extended method to stability studies of well documented proteins.

In the first stage, a technique was adopted for hydrating proteins and then including the water of hydration in the tessellation analysis A reference set of 1375 proteins were hydrated, and the characteristics of the hydration layer and its influence on subsequent tessellation were investigated A novel strategy for classifying surface residues based on tessellation was developed and compared with other topological techniques These results were prepared and submitted for publication and are included as chapter 2 with only minor changes.

In the second stage, several strategies were identified for computing the statistical potential function for the reference set of hydrated proteins The strategies were applied and the resulting potential functions were compared based on their ability to discriminate native protein conformations from large sets of decoys This work was also prepared for publication and is included as chapter 3.

The focus of the third portion of the effort was to apply the best performing potential strategies from the previous work to several sets of protein mutants with well documented stability characteristics These proteins had been the subject of extensive mutagenesis studies of protein stability and folding The mutant potential scores were correlated with the reported stability data and several machine learning tools were employed to examine signal content in the mutant potential profiles.

A new topological strategy is presented for characterizing the relationship between a globular protein and the surrounding water environment, using the results of Delaunay tessellation of computationally hydrated proteins Topological parameters are especially informative in developing knowledge-based methods for analyzing and predicting protein characteristics, such as stability and functionality The new parameter is based on the extent of inclusion of simulated water molecules into the Delaunay tetrahedra surrounding each residue A non-redundant set of 1321 single-chain proteins have been computationally hydrated and are used to evaluate the method and compare the technique with other previously established methods for describing solvent interactions. The ability of the hydration approach to place water in crevices and internal cavities is also assessed The new parameter can be employed in structural genomics methods that use Delaunay tessellation, and should be useful in considering hydrophobic effects.

Knowledge-based methods have been advanced as an alternative to more computationally intensive physics-based simulations for the study of protein stability and folding behavior [34,52-55] Knowledge-based approaches typically employ empirical

12 information derived from high resolution experimentally-determined protein structures such as those archived in the Protein Data Bank (PDB) [43] These methods may include correlations of specific protein characteristics with key structural or topological parameters, or statistical analyses of the frequencies of spatial relationships (including contacts) for various classes of atoms or residues The selection of appropriate correlating parameters or classification methods that capture the desired phenomena is a critical element in establishing knowledge-based strategies.

Since the role of solvent water is important in folding processes, the surface area of an atom or residue that is accessible to solvent is frequently used as a correlating parameter [34,54,56,57] Accessible surface area (ASA) has been defined as the area traced by a probe of specified radius rolling over the exterior van der Waal’s surface of the protein [58] If the probe radius is set equal to the solvent radius (typically 1.4 Angstroms for water), then the surface area becomes the solvent ASA The extent of burial of a residue can be determined by comparing the ASA for the residue in the protein to the ASA for the residue in an extended Ala-X-Ala tripeptide [58].

While ASA has proven valuable in defining the extent of solvent accessibility,alternative parameters may be more useful in characterizing the local environment around residues As an example, ASA is ineffective for discriminating residues or atoms buried just below the protein surface from those that are more deeply buried Atom or residue depth has been proposed as an alternative for characterizing buried residues, and may be defined as the distance from the nearest surface solvent molecule [59], or the distance from the nearest solvent accessible neighbor [60] Residue depth was shown to be more effective than ASA in correlating changes in protein stability resulting from mutations, and also in correlating hydrogen exchange of backbone amide protons [59].

Circular variance (CV) has also been proposed as a correlating topological parameter for proteins [61] CV is the angular variance of all the vectors drawn from a query point (in this case a given residue location) to the set of points that represent the other residues in the protein Values of CV range from near 1 for residues in the center of the protein to approximately 0.5 for residues at the surface Initial applications to protein analysis confirmed the relationship between CV and residue depth and suggested that CV might aid in surface classification of residues [61].

Residues may be classified as either surface or buried by determining the surface accessible area for the residue or its constituent atoms and then applying specific criteria such as a surface area limit to establish the residue location Several alternative methods for surface atom classification are described and compared by Deanda and Pearlman [62]. These include a variation of CV that simply computes the sum of the vectors from the given atom to other atoms (rather than the variance) and also a method that counts the number of neighboring atoms within a specified distance from the given atom Both approaches seek a distinctive change in the parameter between the core and the surface. The technique of counting contacting or neighboring atoms or residues to identify surface positions has also been used previously in several knowledge-based models [55,63].

Voronoi tessellation has been investigated by Angelov, et.al [64] for the analysis of protein topological features, including the determination of surface residues This procedure defines a set of Voronoi cells surrounding each residue location such that all points within the cell are closer to the internal residue than any other residue The Voronoi cells are polyhedra and the cells associated with residues that are nearest neighbors in the protein will share a common face In the referenced study, the space surrounding the protein was filled with cells representing a model solvent using a procedure that involves relaxed random packing of spheres When the solvent is included in the tessellation, any residues whose Voronoi cells share a face with a solvent cell can be identified and defined as a surface residue [64] This strategy has also been used to compute the protein Voronoi surface area by adding the area of all Voronoi polyhedral faces that contact a solvent cell [65] Another surface classification method using Voronoi tessellation placed dummy residue positions at the corners of a large cube surrounding the protein and included the dummy points in the tessellation This led to

“solvent” cells with abnormally large volumes and any residue contacting these cells was identified as a surface or solvent-exposed residue [66].

The current effort examines two strategies for characterizing residues that are both based on Delaunay tessellation of protein residue positions Delaunay tessellation is the dual of Voronoi tessellation and fills the protein space with irregular tetrahedra instead of polyhedra, where the 4 vertices of each tetrahedron are the locations of individual residues Each residue typically participates in a number of tetrahedra, thus other residues that share a Delaunay tetrahedron with a specific residue can be unambiguously defined as neighbors of that specific residue without recourse to an arbitrary criterion, such as distance The first strategy is a classification method similar to the Voronoi classification described above [64], that classifies residues as either (a) on the surface, (b) adjacent to the surface, or (c) deeper in the core based on geometric features of the Delaunay tetrahedra The second approach is a new strategy based on the association of residues with water molecules (water groups), thus it requires hydrated protein structural data as an input and the waters of hydration are included in the tessellation Since PDB structure files normally include only limited crystal water locations, a computational method is used to position simulated water molecules around the protein Then the Delaunay tetrahedra surrounding each residue are examined to provide a water group parameter value that reflects the extent of participation of water molecules in the tetrahedra around the residue Both of these methods are presented and evaluated with alternative topological parameters Residues in a large reference set of proteins are characterized using these methods, and the results are compared with ASA, residue depth, residue CV and water coordination number Residue categories based on the Delaunay tessellation techniques show distinctive differences in these parameters. This suggests that these water group classification strategies can be used in concert with knowledge-based protein studies employing Delaunay tessellation of hydrated proteins.

The reference set of 1321 non-redundant proteins was culled from the PDB using the PISCES web server [67] in a manner similar to that described in reference [68] ThePISCES search was conducted for single-chain proteins with no greater than 30% sequence identity and a resolution of 2.2 A or better using X-ray crystallography with an

GV

Computational Hydration of the Protein Set - SA sec 17 2.3.3 Delaunay tessellationn .- - G9 0003 104 20

Since high-resolution protein structure files typically identify only a few water molecules compared to the likely number of waters of hydration, a computational strategy was used to simulate the placement of water around each protein While many approaches have been used to identify potential water positions including grid-based (or lattice) strategies [50], packing models [64], use of hydration values for individual amino acid residues [70], Monte Carlo methods [62], and molecular dynamic methods [71], this study used the SOLVATE program developed by Grubmuller [72] SOLVATE uses a grid with 1 A spacing to progressively place water molecules on a non-interference basis around the protein solute beginning with the position closest to the protein A limited number of steepest-descent optimization steps are conducted after each placement based on van der Waals forces, but no electrostatic forces or effects are included in the water placement The program is relatively fast and can limit the number of waters to a specified minimum thickness shell around the protein For this study a minimum shell thickness of 5 A was used for all hydrations This distance is consistent with a previous study of the distance dependence of water structure around model solutes [73] that found a strong influence in the first hydration shell, but very little effect in the second hydration shell and beyond The resulting water positions are not expected to be as realistic as with other methods, but the positions of oxygen atoms should be sufficiently accurate for tessellation-based studies of the extent of water association with protein surface residues.

An important consideration in simulating water locations around proteins is how waters are placed in cavities inside the protein envelope, and also how waters are placed with respect to crevices and other irregular features on the surface of the protein. Hubbard, et.al., [74,75] examined the detection and characteristics of cavities in a variety of proteins and found both empty and solvated cavities Since SOLVATE can potentially place waters in any region not occupied by protein atoms, SOLVATE includes an analysis that identifies and reports isolated water molecules or groups of molecules This analysis begins with the water that is farthest from the protein, computes the distance to all other water molecules and successively adds waters to the group that are within a specified distance of any waters in the current group When none of the remaining waters can be added to the current group, a new group is started For a typical large molecule this may lead to a number of water singlets, some doublets and triplets, occasionally a few larger groups, and the majority of waters in one very large group For this study, two hydrations of each of the reference proteins were used, a full hydration that included all the waters identified by SOLVATE, and a bulk hydration in which all of the isolated singlets, doublets and triplets identified by SOLVATE were removed from the full hydration set This definition is different than the “bulk” option available in SOLVATE in which only the largest group is included in the hydration Since the extent of penetration and association of water with protein residues is the central feature of the new topological characterization, the bulk hydration variation was used to examine the potential effects of a more restricted water placement Internal water groups may be a factor in the stability of a protein, thus their inclusion in the development of a knowledge- based tool may be useful.

Another water placement program, that identifies probable positions for water molecules that are internal to the protein, was used to investigate the extent of SOLVATE water placements in internal locations The DOWSER program, developed by Zhang andHermanns [76], first identifies potential sites for water molecules inside internal cavities,and then evaluates the energy associated with placing one or more waters in the cavity.The DOWSER authors defined an energy threshold to predict whether water is likely to occupy an internal cavity Internal or buried crystal water positions are identified and compared with the DOWSER water predictions DOWSER also includes an option that extends the energy analysis to include possible water locations in crevices on the external surface of the protein In this study, these DOWSER options were used to predict both internal and external water positions for a limited number of proteins, and these positions were compared with the SOLVATE predictions and also with waters identified in thePDB crystallographic data, with emphasis on the isolated water groups identified bySOLVATE.

Delaunay tessellation is a computational geometry tool that can be used to examine relationships among a set of three dimensional points, such as those representing the residue positions in a protein structure [46] The location of a residue may be defined as the location of any desired residue feature, in this study both the carbon-alpha atom location (C a) and the center of mass (CM) of the residue are used A two-dimensional illustration of Delaunay tessellation is shown in Figure 2.1 for a group of points that represent the locations of residues in a section of a protein In two-dimensions, the Voronoi cells are polygons that contain points closest to the interior residue, and the Delaunay simplices are triangles that connect residues whose Voronoi cells meet at a common vertex In three dimensions, Delaunay tessellation of a set of protein residue locations decomposes the convex space around the protein into a set of space-filling, irregular tetrahedra (Delaunay simplices), where the vertices of each tetrahedron are formed by 4 of the residues (referred to as a quadruplet) As a result, the space around each residue is filled by a number of simplices, each with a common vertex at that residue Thus the nearest neighbors of a specific residue can be objectively defined as all of the other residues that share simplices with that residue, rather than a more arbitrary criterion, such as a maximum separation distance The tessellations in this study were computed using the Qhull program distributed by the University of Minnesota GeometryCenter that implements the Quickhull algorithm developed by Barber et al [77] The software was written by Zhibin Lu.

Crevice in protein surface one 3$

Figure 2.1 Illustration of a tessellated hydrated protein Two-dimensional depiction of a portion of a tessellated hydrated protein (not to scale) The residue locations are represented by solid red circles, and the dashed blue lines represent the boundaries ofVoronoi cells (polygons) which are shaded in yellow All points inside a Voronoi cell are closer to the central residue than any other residue Delaunay simplices (triangles in 2-D) are formed by connecting the three residues whose Voronoi cells meet at a common vertex with red lines Water locations are depicted as circles and are initially included in the tessellation, but all of the Delaunay simplices that include only water are discarded.This leaves only surface waters (shaded blue in this figure) that participate in Delaunay simplices with residues In this illustration, the water in the crevice interrupts extendedDelaunay simplices that would have linked the residues across the crevice.

This study extends the protein analysis to include the relationship with water molecules at the protein surface by including the locations of the hydration water molecules determined by SOLVATE in the tessellation as additional points When the tessellation is completed, all quadruplets that contain 4 water locations are discarded, so all of the remaining waters are neighbors of protein surface residues and represent the first hydration layer around the protein This results in a substantial reduction in the number of water molecules surrounding the protein Waters of hydration are included in the two-dimensional illustration in Figure 2.1 After tessellation, the waters represented by open circles formed all-water simplices and were no longer considered The waters depicted as blue shaded circles were linked to at least one protein residue by a simplex edge, so they form the first hydration shell around the protein.

When the residues of an unhydrated protein are tessellated, some of the resultingDelaunay simplices at the surface of the protein may be highly distorted with extreme values of edge lengths This occurs as the simplices fill the volumes associated with crevices, depressions and other irregularities on the protein surface Tessellation studies often limit the influence of these distortions by excluding simplices with an edge that exceeds a cutoff value such as 10 angstroms By including hydration water locations in the tessellation, the water positions that fill the surface irregularities interrupt these distorted simplices and form more regular simplices with the residues, reducing the simplex edge lengths and obviating the need for a cutoff edge length This effect is illustrated in Figure 2.1 where several water positions inside a simulated crevice form a number of small simplices in lieu of 2 larger simplices with only residue positions Figure

2.2 quantifies this effect by comparing two histograms of non-redundant simplex edge lengths derived by tessellating both the unhydrated and then the hydrated set of reference proteins Without hydration, the maximum edge length was 113 A and the mean edge length was 8.20 A After hydration, the maximum edge length was reduced to12.9 A and the mean edge length was 4.85 A.

DT Simplex Face Match Residue Classification -cs sex 23

The Delaunay tessellation of an unhydrated protein provides a novel opportunity to define the surface residues of a protein [78] Since the tessellation is completely space-filling within a convex hull enclosing the protein, any internal face (a triangle formed by three specific residues) of a given Delaunay simplex will be exactly matched by another surface triangle (with the same three specific residues) of one other simplex in the tessellation However, simplices lying at the surface of the protein will have at least one exterior triangular face that will not have an exact match in the complete set of simplex faces, and these can be found by searching for unmatched simplex faces The three residues that determine these exterior simplex faces can be defined as surface residues A second class of residues can be identified that are not found in surface simplex faces, but that participate in simplices with surface residues These are connected to surface residues by a simplex edge, and can be described as undersurface residues This classification of residues as surface, undersurface or the remaining buried residues corresponds approximately to classifications of surface, boundary or core residues Figure 2.3 illustrates this in two dimensions.

Simplex edge length (Ang) Simplex edge length (Ang)

Figure 2.2 Effect of hydration on the non-redundant edge lengths of simplices Data are shown for the reference set of 1321 proteins Part (a) shows the results of tessellating the unhydrated proteins, while (b) shows the influence of hydration in eliminating the highly distorted simplices on the surface.

Figure 2.3 Illustration in two dimensions of tessellation-based residue classification methods The face match method is shown in part (a), interior simplex faces are matched with another simplex face but simplex faces at the exterior surface are not matched Residues on unmatched faces are classified as surface (S), residues on a matched face that are linked to a surface residue by a simplex edge are undersurface (U) and remaining residues on a matched face are buried (B) The water group technique is illustrated in part (b), where each residue is considered individually Then the simplices surrounding the target residue are counted to determine how many simplices have noHOH molecules, how many have one HOH, etc Finally, the number of simplices is plotted against the water content as in part (c) The slope of the regression line through the points is the Water Group Parameter, and if the parameter is positive the residue is classified as on the surface If the parameter is negative, the residue is classified as a core residue (In 3 dimensions, simplex faces are triangles with 3 residues and simplices can contain 0, 1, 2 or 3 HOH molecules.)

Water Coordination Number RaiO T9 HT HH 06 655 26

A previous study [62] explored the strategy of identifying the number of neighboring atoms as a means of classifying surface atoms (that could be extended to residues) An atom pair was classified as neighbors if the distance between their centers was less than the sum of their atomic radii (including the radius of a probe that might be used to define the surface) Interior atom positions were expected to have more neighbors than surface atoms, but this strategy was not found to be as effective as other competing strategies [62] However, tessellation of a hydrated protein provides an alternative definition of neighbors, namely the other residues or waters that share a simplex (or are connected by a simplex edge) with the query residue Since the number of neighboring residues determined by tessellation can vary significantly, the ratio of the number of water neighbors compared to the total number of residue neighbors was examined in this study as a parameter for the various classification methods, and described as the water coordination number ratio or water CNR A specific water molecule will appear in more than one simplex around a residue, so care must be taken that the water count is non- redundant.

DT Water Group Parameter and Residue Classification

While water CNR information indicates the exposure of a residue to water molecules, tessellation can provide additional data regarding the extent of association of water with each residue This is obtained by first separating the simplices surrounding a residue into groups that contain no water, one water, two waters, and finally three water molecules and then counting the number of simplices in each of the 4 water groups.

Residues located on or near the surface are expected to participate in more simplices with multiple waters than with one or none, and the converse is expected for residues buried deeper in the core The water group parameter is defined as the slope of the distribution of the number of water molecules in a group (0, 1, 2 or 3) versus the number of simplices around the residue that contain that number of waters The slope was derived from a least squares linear regression line If the slope is greater than 0, the residue is classified as a surface residue, otherwise it is classified as a core residue This method is also illustrated in 2 dimensions in Figure 2.3 Note that this DT classification method differs from the face match method since it employs an arbitrary threshold value of the water group parameter to classify surface residues.

Accessible Surface AT€A HH ch HT HH 0 0 0001 0 97 27

This study uses the NACCESS program developed by Hubbard and Thornton to compute ASA [79], based on the definition introduced by Lee and Richards [58] The program computes the atomic exposure to solvent, then sums the values for the atoms in each residue to determine the residue exposure A probe radius of 1.4 A was used for all calculations Values used in most correlations in this paper are relative accessibilities that are found by dividing the residue ASA by the ASA of that residue (X) in an extended Ala-X-Ala conformation [58] This should reduce the effect of alternative ASA definitions on the results.

This parameter has been suggested as an alternative to accessible surface area that provides better discrimination for residues that are located below the protein surface [59].

Residue depth was initially defined as the distance from the nearest surface water molecule [59] In this reference, waters were placed around the protein using a Monte Carlo simulation, and the resulting ensemble was successively translated and rotated in a procedure to remove waters from internal voids A subsequent investigation introduced an alternate definition, the distance from the nearest solvent accessible neighbor [60], that simplifies the computation by relating it to the accessible surface area The current study uses the latter definition and residue depths were calculated using the DPX program [80], which in turn employs the NACCESS code [81] to compute ASA The DPX code computes atomic depths and provides a mean residue depth by averaging the residue atoms.

CV measures the angular spread of a group of vectors that are drawn from a specific point (representing an atom or a residue) to all the other points in the protein. The CV value can be used to determine the extent to which the specific point is inside or outside (nearer or on the surface) of the other points For the current study, circular variance was calculated using the Simulaid program [61], which computes the CV of each atom or atom group identified in the PDB file To reduce the computational load, only points within 10 A of the query location are used in the CV calculation For this study, the CV associated with the C_ atom was taken as the residue CV value.

Circular Variance Tan .ẻe

CV measures the angular spread of a group of vectors that are drawn from a specific point (representing an atom or a residue) to all the other points in the protein. The CV value can be used to determine the extent to which the specific point is inside or outside (nearer or on the surface) of the other points For the current study, circular variance was calculated using the Simulaid program [61], which computes the CV of each atom or atom group identified in the PDB file To reduce the computational load, only points within 10 A of the query location are used in the CV calculation For this study, the CV associated with the C_ atom was taken as the residue CV value.

Results and DisCussiOn .- - cọ HH n0 T0 0000 0 g1 008 29

Examination of the Protein HydratiOns - - - HH ng se 29

Several approaches were used to examine whether the SOLVATE water placements are representative of hydration waters, including water molecules in internal cavities and crevices or clefts on the surface Since all Delaunay simplices containing only four waters were discarded following tessellation of the hydrated protein, the remaining waters represent a hydration shell in which each water is linked to at least one protein residue by a simplex edge Figure 2.4(a) shows the relationship between the number of water placements in this surface hydration layer and the number of residues in the protein sequence This figure also identifies the range of protein sizes that are included in the reference set of proteins A parameter that has been used to characterize the first hydration shell is the protein ASA divided by the number of water molecules in the first shell Figure 2.4(b) presents the distribution of values of this parameter for all proteins in the reference set The mean value of 14.3 sq A per HOH for the set of 1321 proteins is consistent with previously identified values of 15 sq A per HOH [65,82].However, 20 sq A per HOH has been reported for the minimum hydration of lysozyme needed to achieve dilute solution values (considering IR spectrum, heat capacity, enzyme activity, and other parameters) and characterized as “barely enough for monolayer coverage” [83] Also, a value of 9 to 10 sq A per HOH has been used to estimate the number of water molecules that could be in contact with the protein surface [84,85].Variations in this parameter are not surprising, since the definition of the first hydration layer or shell is not fixed, and Delaunay tessellation provides a new approach to the

Number of residues in protein ASA/number of HOH in first shell (sq A)

Figure 2.4 Characteristics of the 1321 proteins hydrated by SOLVATE Subplot (a) is a correlation of number of HOH in the first hydration shell (defined as HOH linked to a residue by a simplex edge) with residue chain length, and subplot (b) shows a histogram of the ratio of the total accessible surface area for each protein in the reference set divided by the number of water molecules placed by SOLVATE in the first hydration shell. definition Also, the inclusion of internal waters in the calculation will depress the value of the parameter with respect to the surface waters only.

The total number of water molecules in the tessellation hydration shell were also compared with the hydration model proposed by Durchschlag and Zipper [70] This approach starts with normal vectors based at dot surface points on the protein and positions waters using several criteria, including hydration values for specific amino acid residues, distance between waters and distance to the surface In the referenced study, citrate synthase was taken as a case study, model parameters such as the initial dot density and the degree of hydration were varied, several solution parameters were predicted and the results were compared with experimental data For comparison with the referenced study, SOLVATE was used to compute the tessellation hydration shell for dimeric citrate synthase (using PDB file 4CTS [86]) The tessellation hydration shell contained 2766 water molecules (11.5 sq A per HOH), and this value compares approximately with the maximum hydration shell from ref [70] that was derived using the maximum dot surface density and the maximum hydration factor The tessellation shell corresponds to 0.509 grams HOH/gram protein, a value that is higher than the 0.38 grams HOH/gram protein suggested as a typical minimum hydration loading for proteins

[83] For citrate synthase, a hydration loading of 0.38 grams HOH/gram protein would represent 15.4 sq A per HOH.

The SOLVATE water placements were also compared with crystallographic water locations for all proteins with crystal water data in the PDB record (twenty-one files had missing or ambiguous crystal water data) The separation distance between each crystal water position and the nearest SOLVATE water location was computed for all the crystal water positions that are within 7 A from any non-hydrogen atom in the protein (a total of 290,968 locations) The average separation distance is 1.215 A, and the distribution of separation distances is shown in Figure 2.5 Less than 0.66 % of the total separations are in the tail of the curve greater than 2.5 A, extending up to a maximum separation of 9.92 A While the large separations are a small segment of the total, they are of interest since they likely indicate deeply buried crystal waters in regions that were not populated by SOLVATE.

A closer examination of the proteins with large separations revealed 9 proteins with a total of 14 crystal waters separated by greater than 7 A from the nearest SOLVATE water Each of these proteins was subsequently analyzed with the DOWSER program to study how SOLVATE is populating cavities and crevices DOWSER was used to identify both internal and external water positions, and comparisons of the DOWSER waters with the closest crystal and SOLVATE water positions are shown in Figure 2.6 and summarized in Table 2.1 The crystal water locations compare quite well with DOWSER internal positions (Figure 2.6(a)) exhibiting an average separation of 1.34

A, although a few separations extend out to over 8 A However the SOLVATE locations do not compare as well with internal DOWSER sites (Figure 2.6(b)), showing a larger number of separations of up to 8 A and a mean separation of 3.02 A Examination of these proteins with visualization tools suggests that SOLVATE identifies fewer internal sites than DOWSER, and this observation is supported by the internal placement data in

Figure 2.5 Histogram of the separation distance between crystal and SOLVATE water positions Each crystal water identified in the protein PDB records is compared with the closest SOLVATE water position for 1300 proteins from the reference set Any crystal water positions > 7 A from any (non-hydrogen) protein atom location are not included in this figure The bin from 2.5 to 3.0 A includes all data points for separations from 2.5 A to the maximum of 9.92 A, a total of 1928 or 0.66% of all data points.

Distance to nearest crystal HOH, A Distance to nearest SOLVATE HOH, A

8 8 + #8 “ i b ve eo 4L3 ® tỳ o- : o- Ĩ Lj 1 "mg lí M F H H H HN L

0 2 4 6 8 10 1) 2 4 6 8 10 Distance to nearest crystal HOH, A Distance to nearest SOLVATE HOH, A

Figure 2.6 Separation distance between DOWSER HOH placements, and the nearest crystal and SOLVATE HOH locations for 9 selected proteins Subplot (a) shows separation of crystal waters from internal (or buried) DOWSER waters, (b) shows separation of SOLVATE waters from internal DOWSER sites, (c) shows separation of crystal HOH from external DOWSER (crevice) waters, and (d) shows separation ofSOLVATE sites from external DOWSER locations.

Table 2.1 Comparison of the proximity of Dowser placements with the nearest crystal HOH and SOLVATE sites for 9 selected proteins The protein PDB files are listed in the first column The number of Dowser internal placements are shown in the second column followed by the number of crystal HOH sites and SOLVATE sites that are within a separation distance (A) of 1.4 A in the next 2 columns The Dowser external placements are compared with the nearest SOLVATE sites on the right side of the table The middle 3 columns provide data on the “isolated water groups” identified by SOLVATE, listing the total number of HOH in these groups, the number of singlets (S), doublets (D), triplets (T), and the number of HOH in a larger group (+), and finally the number of water group HOH that are within 1.4A of a Dowser placement.

Internal placements _ - SOLVATE water group HOH _Đs External placements _ Protein Numberof Crystal HOH SOLVATE HOH Number of water S/D/T(+) Water group HOH Number of SOLVATE HOH PDB file DowserHOH A< 1.44 A Phe Protein Sci 2, 1285-1290.

198 Pjura, P., Matsumura, M., Baase, W A & Matthews, B W (1993) Development of an in vivo method to identify mutants of phage T4 lysozyme of enhanced thermostability Protein Sci 2, 2217-2225.

199 Gray, T M., Arnoys, E J., Blankespoor, S., Born, T., Jagar, R., Everman, R., Plowman, D., Stair, A & Zhang, D (1996) Destabilizing effect of proline substitutions in two helical regions of T4 lysozyme: Leucine 66 to proline and leucine 91 to proline. Protein Sci 5, 742-751.

200 Xu, J., Baase, W A., Baldwin, E P & Matthews, B W (1998) The response of T4 lysozyme to large-to-small substitutions within the core and its relation to the hydrophobic effect Protein Sci 7, 158-177.

201 Lipscomb, L A., Gassner, N., Snow, S D., Eldridge, A M., Baase, W A., Drew,

D L & Matthews, B W (1998) Context-dependent protein stabilization by methionine- to-leucine substitution shown in T4 lysozyme Protein Sci 7, 765-773.

202 Xu, J., Baase, W A., Quillin, M L., Baldwin, E P & Matthews, B W (2001). Structural and thermodynamic analysis of the binding of solvent at internal sites in T4 lysozyme Protein Sci 10, 1067-1078.

203 He, M M., Wood, Z A., Baase, W A., Xiao, H & Matthews, B W (2004).

‘Alanine-scanning mutagenesis of the beta-sheet region of phage T4 lysozyme suggests that tertiary context has a dominant effect on beta-sheet formation Protein Sci 13, 2716- 2724.

Perturbation of Trp 138 in T4 Lysozyme by Mutations at Gln 105 Used to Correlate Changes in Structure, Stability, Solvation, and Spectroscopic Properties Proteins: Struct. Funct Gen 15, 401-412.

205 Alber, T., Bell, J A., Dao-Pin, S., Nicholson, H., Wozniak, J., Cook, S &

Matthews, B W (1988) Replacements of Pro86 in Phage T4 Lysozyme Extend an Alpha-helix But Do Not Alter Protein Stability Science 239, 631-635.

206 Matsumura, M & Matthews, B W (1989) Control of enzyme activity by an engineered disulfide bond Science 243, 792-794.

207 Eriksson, A E., Baase, W, A., Zhang, X.-J., Heinz, D W., Blaber, M., Baldwin,

E P & Matthews, B W (1992) Response of a Protein Structure to Cavity-Creating Mutations and Its Relation to the Hydrophobic Effect Science 255, 178-183.

208 Buckle, A M., Henrick, K & Fersht, A R (1993) Crystal structural analysis of mutations in the hydrophobic cores of barnase J Mol Biol 234, 847-860.

209 Chen, J., Lu, Z., Sakon, J & Stites, W E (2000) Increasing the thermostability of staphylococcal nuclease: implications for the origin of protein thermostability J Mol. Biol 303, 125-130.

210 Matsumura, M., Wozniak, J A., Sun, D P & Matthews, B W (1989) Structural studies of mutants of T4 lysozyme that alter hydrophobic stabilization J Biol Chem.

211 Hamilton, J A., Steinrauf, L K., Braden, B C., Liepnieks, J., Benson, M D., Holmgren, G., Sandgren, O & Steen, L (1993) The x-ray crystal structure refinements of normal human transthyretin and the amyloidogenic Val-30 >Met variant to 1.7-A resolution J Biol Chem 268, 2416-2424.

212 Saraiva, M J (2001) Transthyretin mutations in hyperthyroxinemia Hum Mutat.

213 Siepen, J & Westhead, D (2002) The fibril_one on-line database: mutations, experimental conditions, and trends associated with amyloid fibril formation Protein Sci.

214 Neto-Silva, R., Macedo-Ribeiro, S., Pereira, P., Coll, M., Saraiva, M & Damas,

A (2005) X-ray crystallographic studies of two transthyretin variants: further insights into amyloidogenesis Acta Crystallogr D Biol Crystallogr 61, 333-339.

215 Hubbard, S J & Argos, P (1994) Cavities and packing at protein interfaces. Protein Sci 3, 2194-2206.

216 Word, J M., Lovell, S C., LaBean, T H., Taylor, H C., Zalis, M E., Presley, B.K., Richardson, J S & Richardson, D C (1999) Visualizing and quantifying molecular

Ngày đăng: 02/10/2024, 01:54