Comparatives study on sequence structure function relationship of human short chain dehydrogenases reductases

INTERNATIONAL UNIVERSITY VIETNAM NATIONAL UNIVERSITY HCMC COMPARATIVE STUDY ON SEQUENCESTRUCTURE-FUNCTION RELATIONSHIP OF HUMAN SHORT-CHAIN DEHYDROGENASES/REDUCTASES A thesis submitted to the School of Biotechnology, International University in partial fulfillment of the requirements for the degree of MSc. in Biotechnology Student name: TANG THI NGOC NU – MBT04011. Supervisor: Dr. LE THI LY May/2013 ABSTRACT The human short-chain dehydrogenases/reductases (SDRs) family has been the subject of many recent studies due to their crucial roles in the human body. There are a growing number of single-nucleotide polymorphisms and a variety of heritable metabolic diseases that have been identified from the SDR genome. Here, we carried out a phylogenetic analysis of homologous SDR sequences, and subsequently utilized a series of bio-informatics and comparative analytical methods to investigate the sequence-structure-function relationships within the human SDR family. Our findings show that Tyrosine, Serine, and Lysine are not only present in all members of the human SDR family, but are also located in a conserved region of both the SDR protein sequence and structure. In contrast, we find a cluster of three residues (Serine-Alanine-Serine, Phenylalanine-Glycine-Valine, Cystein-Serine-Serine, Cystein-Histidine-Serine or Alanine-Alanine-Alanine) that are different in protein sequence and structure and appear to be specific to each group of human SDR family. Finally, our analysis of correlated mutations within the human SDR family reveals the occurrence of residues that are distantly located, but seem to be interacting with one another. We hypothesize that these long-distance interactions may be an adaptive mechanism that allows members of the human SDR family to cope with a changing environment and differing functional demands over evolutionary time. Taken together, our results provide data that will be useful for designing inhibitors targeted at specific groups of human SDRs, such as those that are known to be metabolically disorders. Key words: multiple sequence alignments, mutational variability and correlation. iv consensus sequence, phylogeny, ACKNOWLEDGEMENTS First and foremost I would like to give my special thankful my university advisor, Dr. Ly Le, for her support and encouragement during the time I was carrying out the thesis, for cheering me up and guiding me through temporary standstills. In addition, I would like to give my great thankful to my advisor Dr. Ly Le, for providing me with this interesting topic, and for straighten many question marks concerning the Bioinformatics part. I would also like to thank my best friend, Charlene Mccord Buxan, for taking the time to read my Master thesis and sharing valuable comments. Last but not least, I would like to give my deeply thankful to my parents, who are always by my side. Without my parent’s support, I could not finish successfully my Master. v PUBLICATION Ngoc Nu Tang, Ly Le. Comparative Study on 11β Hydroxysteroid dehydrogenase 1 (11βHSD1)”. Research Journal of Biotechnology, 2012. Accepted. Ngoc Nu Tang, Jacek Leluk, Ly Le. Comparative study on Sequence-structurefunction of Human Short-chain dehydrogenases Bioinformatics, 2013. Submitted. SUPERVISOR’S APPROVAL Dr. LE THI LY vi reductase family”. BMC THESIS CONTENTS ABSTRACT .................................................................................................... iv ACKNOWLEDGEMENT...................................................................................... v PUBLICATION ................................................................................................. vi Ngoc Nu Tang, Ly Le. Comparative Study on 11β Hydroxysteroid dehydrogenase 1 (11βHSD1)”. Research Journal of Biotechnology, 2012. Accepted....................... vi Ngoc Nu Tang, Jacek Leluk, Ly Le. Comparative study on Sequence-structurefunction of Human Short-chain dehydrogenases reductase family”. BMC Bioinformatics, 2013. Submitted. ................................................................... vi 1 INTRODUCTION ......................................................................................... 1 1.1 General Introduction about Bioinformatics ........................................... 1 1.2 General introduction on Human Short-chain dehydrogenases/reducutases (SDR) family ................................................................................................ 1 1.3 2 3 Aims and Objectives ......................................................................... 3 SEQUENCE DATABAES ............................................................................... 4 2.1 Data Collection ................................................................................. 4 2.2 Bioinformatic tools ............................................................................ 4 SEQUENCE ANALYSIS TOOLS ...................................................................... 4 3.1 Sequence alignment of human SDR protein ......................................... 4 3.1.1 Alignment of Pair Sequence ........................................................... 5 3.1.2 Local and global alignment ............................................................ 5 3.1.3 Why Sequence Alignment is performed? .......................................... 6 3.1.4 Substitution Matrices and Gap Penalties .......................................... 6 3.1.5 Multiple Sequence Alignment ......................................................... 7 3.1.5.1 ClustalW .................................................................................... 9 3.1.5.2 MUSCLE (Multiple Sequence Comparison by Log-Expectation) .......... 9 3.1.5.3 KALIGN .................................................................................... 10 3.1.5.4 T-COFFEE (Tree-based Consistency Objective Function of Alignment Evaluation) ................................................................................ 11 3.1.5.5 3.1.6 GEISHA 3 ................................................................................. 11 Multiple sequence alignments of human SDR protein and alignment verification ............................................................................................. 12 3.2 3.2.1 Consensus sequence construction and BLAST search........................... 12 What is BLAST (Basic Local Alignment Search Tool)? ....................... 12 vii 3.2.2 Construction of consensus of Human SDR protein family and BLAST search 13 3.3 Phylogenetic tree construction and comparison of consensus sequences 13 3.3.1 Phylogenetic Tree Prediction ........................................................ 13 3.3.2 Distance-based Method ............................................................... 15 3.3.3 Character-based Method.............................................................. 15 3.3.3.1 PHYLIP ..................................................................................... 15 3.3.3.2 SSSSg ...................................................................................... 15 3.3.4 Human SDR phylogenetic tree and comparison of consensus sequences 16 3.4 Mutational variability of human SDRs ................................................ 16 Mutational Variability (Talana, Consurf) ...................................................... 16 4 3.4.1 Consurf ..................................................................................... 16 3.4.2 Talana ....................................................................................... 17 3.4.3 Mutational variability of human SDR protein family ......................... 17 3.5 Analysis of correlated mutations ....................................................... 18 3.6 Availability of original software generated by authors .......................... 18 RESULTS AND DISCUSSION ...................................................................... 18 4.1 Multiple sequence alignment, consensus sequence generation, and analysis of human SDR specificity ................................................................. 18 4.2 Sequence specificity and interrelationships of the human SDR family .... 20 4.3 Mutational variability of human SDRs ................................................ 23 4.4 Correlated mutations within the human SDR family ............................ 28 5 CONCLUSION .......................................................................................... 32 6 REFERENCES .......................................................................................... 32 7 SUPPLEMENTS......................................................................................... 34 8 LIST OF FIGURES Figure 1: In the motif, “a” represents for aromatic residues, “c” for charged residues, “h” for hydrophobic residues, “L” for aliphatic, “p” for polar and “x”: for any residues. In motif TGxxGhLG the aliphatic residues before the last G has replaced the original aromatic residues, and the last motif has been changed from h[KR]xxNGP into h[KR]xxNxxG. .................................................................................................. 6 Figure 2: Illustration of a local and global alignment [Figure 2.2, [22] .................. 13 Figure 3 : Here A, B and C represents the three highly conserved sequences of the same protein taken from three separate organisms. The phylogenetic tree give a view of the substitution that happened during the evolution, when these substitutions evolved from the same ancestor [21]. .............................................................. 31 Figure 4: Part of result of multiple sequence alignment of 71 human SDR’s member .................................................................................................................... 42 Figure 5: Completed consensus sequence of 71 human SDR’s members................ 42 Figure 6: Phylogenetic tree construction by PHYLIP ............................................ 44 Figure 7: Phylogenetic Tree construction by SSSSg. Both the program shown that human SDR family can be phylogenetically grouped into five distinct classes. ........ 45 Figure 8: Comparison of the five consensus human SDR sequences ...................... 46 Figure 9: The active site (AS), substrate binding sites(BS), and three residues between AS and one of the BS in 5 human SDR groups identified by Talana. ......... 47 Figure 10: The identification of functional regions within group 1 using Consurf and Talana. ........................................................................................................ 49 Figure 11: The result of mutational variability (done by Talana) ........................... 51 Figure 12: Variability profiles for each of the five groups of human SDRs .............. 53 Figure 13: The location of the conserved and variable residues in the template structure of group 1 of human SDR was identified by Talana. ............................. 54 LIST OF TABLES Table 1: PDB code and name of five representative ............................................ 38 Table 2: The core residues in five human SDR groups identified by Talana ............ 57 Table 3: The surface residues in five human SDR groups identified by Talana ........ 58 Table 4: The identification of correlated mutation sets and their core and surface characteristics for group 5 ............................................................................... 59 Table 5: Selected correlated mutations in human SDRs identified by Talana .......... 60 ix 1 INTRODUCTION 1.1 General Introduction about Bioinformatics Bioinformatics is conceptual biology in terms of molecules (in the sense of physical chemistry) and applying "informatics techniques" (derived from disciplines such as applied math, computer science and statistics) to understand and organize the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications [1]. Bioinformatics was born with the response to handle the large quantities of biological data, which has increased dramatically [2]. For example as of August 2000, the GenBank repository of nucleic acid sequences contained 8,214,000 entries [3] and the SWISS-PROT database of protein sequences contained 88,166 [4]. On average, these databases are doubling in size every 15 months [3]. Bioinformatics, the subject of the current review, is often defined as the application of computational techniques to understand and organize the information associated with biological macromolecules. This unexpected union between the two subjects is largely attributed to the fact that life itself is an information technology; an organism’s physiology is largely determined by its genes, which at its most basic can be viewed as digital information [1]. Basically, the aims of bioinformatics are three folds: The first aim of bioinformatics helps to organize the data in an easier way for researchers to access existing information and to submit new entries as they are produced, as the Protein Data Bank for 3D macromolecular structures [5,6]. Thus the purpose of bioinformatics extends much further. The second aim of bioinformatics is to develop tools and resources that aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare it with previously characterized sequences. This need is more than just a simple text-based search and programs such as FASTA [7] and PSI-BLAST [8,9] must consider what comprises a biologically significant match. Development of such resources dictates expertise in computational theory as well as a thorough understanding of biology. The third aim of bioinformatics is to use these tools to analyze the data and interpret the results in a biologically meaningful manner. More specific, bioinformatics can conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features. According to the important of Bioinformatics contribute in Biology area, especially in analyzing the huge biological data effectively. In this study, I applied the third aim of bioinformatics to highlight the general and specific characteristics of human SDR family by covering two aspects on the bioinformatic’s topics, multiple sequence alignment algorithm and identification of conserved motifs. 1.2 General introduction on Human Short-chain dehydrogenases/reducutases (SDR) family Short-chain dehydrogenases/reductases (SDRs) belong to one of the largest enzyme super-families and includes over 46,000 members [10]. Among these, there are at least 140 different enzymes that have been sequenced to date, and about 70 of them are known to belong to the human SDR family [11, 12]. Most SDRs are known to be NAD or NADP-dependent oxidoreductases that share characteristic sequence motifs and mechanisms of action [13, 14]. This SDR enzyme super-family is present in all forms of prokaryotic and eukaryotic life [13], and plays an important role in a variety of key metabolic processes. Indeed, human SDRs have been extensively studied for their critical roles in lipid, amino acid, carbohydrate, cofactor, hormone and xenobiotic metabolism, as well as in redox sensor mechanisms [15]. In addition to their crucial roles in normal 1 metabolic processes, the function of human SDRs in metabolic defects, such as type II diabetes, warrants continued research attention [16,]. Given their part in proper physiological functioning, the human SDR protein family appears to be a suitable target for the development of novel drugs directed at influencing hormone metabolism [17]. Despite their importance to proper metabolic function and potential use for the treatment and prevention of various human diseases, a standardized way of classifying SDRs has yet to be established. According to prior studies, SDR enzymes can be divided into two main types, denoted as “Classical” and “Extended.” [18,19]. The “Classical” type consists of about 250 amino acid residues, while the “Extended” family has an additional 100-residue domain forming the C-terminal region. Another study, alternatively, divided the family into three types, designated as “Intermediate”; “Complex” and “Divergent,” which can be distinguished according to their characteristic sequence motif [15]. Even with conflicting ideas of how to group SDRs, it is clear that members of the human SDR family have diverged over evolutionary time because they share only 15 to 30% of overall sequence identity [16]. Despite clear sequence diversification, human SDRs all have a common sequence motif that defines the cofactor binding site (TGxxxGxG) and the catalytic tetrad (N-S-Y-K) [20]. Moreover, the three-dimensional structures of all human SDRs share common features, such as an alpha/beta-folding motif characterized by a central beta-sheet. This central beta-sheet is typical of a Rossmann-fold with helices on either side [20]. Given these interesting structural similarities, it is important to study the evolutionary history of the SDR super-family to better understand why they have similar 3D structure despite sharing very little sequence identity. It is proposed that these common motifs might be conserved through evolution due to their crucial function in differentiating the human SDR family from other enzyme families [13]. While bio-molecule mutations occur at the level of sequence, the effects of these mutations are noticed at the level of function. Bio-molecule function, in turn, is directly related to 3D structure. As such, by studying and comparing the sequences and 3D structures of the different human SDRs in a phylogenetic context, it may be possible to reveal more pertinent information about the evolutionary and functional diversification of the group. The first enzymes of this type were analysed as early as in the 70’s. These analyses gave the structures of prokaryotic ribitol dehydrogenase and Drosophila alcohol dehydrogenase. The proteins were then not known to be a family but the alcohol dehydrogenase turned out to differ from the previously known alcohol dehydrogenase of liver and yeast. When other dehydrogenases showed the same distinctive pattern as the two alcohol dehydrogenase types the concept of a family of short-chain dehydrogenases was established. [20] This occurred in 1981, and since then the SDR family has grown enormously, both in the number of known members and the variety of their functions. Currently at least 3000 members, including species variants, are known with a substrate spectrum ranging from alcohols, sugars, steroids and aromatic compounds to xenobiotics. [19, 20]. As can be expected, due to its broad variety of different functions, the SDR family is very divergent. The residue identity in pair-wise comparisons is as low as 15-30%. However, although few residues are completely conserved, there are several sequence motifs, consensus patterns, which are distinguishable within the families. The criterion for SDR membership is therefore the occurrence of typical sequence motifs, arranged in a specific manner. These motifs comprise Rossmann-fold elements for nucleotide binding and specific residues for the active site and they reflect common folding patterns [18]. The SDR enzymes can be divided into two main classes, the Classic and the 2 Extended families. The Classic family is the largest family, with 218 of the sequences in the data set as opposed to 118 for the Extended family. Why are there then two classes, what distinguish them from each other? One distinction is the length of the sequences. The classical SDRs have a sequence length of around 250 residues, while the extended SDRs are around 350 residues long. Another difference, although with exceptions, is that the classical SDRs prefer the NADP(H) coenzyme and the extended SDR prefer the NAD(H). There are however NAD(H)-binding classical SDRs as well as NADP(H)-binding extended SDRs [19, 20]. The earlier mentioned sequence motifs do also differ between the two main families. These motifs are what is most distinguishing for the two classes, since for example the length can vary. The motifs can therefore be used to separate the two main families. What do then the motifs look like? They are placed in or near to different secondary structures, as for example the Gly-motif, TGxxxGhG or TGxxGhlG, which is placed in and adjacent to β1 + α1. There are seven motifs each for the Classical and the Extended families. These motifs are based on the motif used by Bengt Persson et al. [20] and can be seen in figure 3 below: Figure 1: In the motif, “a” represents for aromatic residues, “c” for charged residues, “h” for hydrophobic residues, “L” for aliphatic, “p” for polar and “x”: for any residues. In motif TGxxGhLG the aliphatic residues before the last G has replaced the original aromatic residues, and the last motif has been changed from h[KR]xxNGP into h[KR]xxNxxG. It is not denial the fact that human SDR family play important implications for medicine, especially involving in metabolic defects as diabetes type II. Therefore, the identification and functional analysis of human SDR family on sequences and structures is the primary goal of the study leading to new targets for drug design and development. 1.3 Aims and Objectives In this study, a rigorous comparative analysis of homologous sequences and sequence-structure-function relationships in the human SDR family was performed using bioinformatics. Our goal was to gain insight into the mechanisms of action of the human SDR family. Specifically, we sought to identify and compare the convergent and divergent residues of the human SDR nucleotide-binding pocket. We hypothesized that evolutionarily conserved regions in the human SDR family would appear at or near to the location of the active and binding sites of the protein. 3 This is because active and binding sites are responsible for any chemical and or enzymatic reactions that happened in the protein molecules. These interactions help to maintaining proper functioning at these sites is necessary for the protein-protein and protein-ligand interactions that are indispensable in regulating molecular processes. In contrast, we expect to find variable regions in the human SDR family that are near the nucleotide binding sites due to the varying substrate-enzyme interactions that are characteristic of each individual human SDR family. Through periods of adaptive radiation over evolutionary time, these regions of variability allowed each group within the human SDR family to adopt its own, specific features. These nuances in structural design, and potential functionality, are important to identify in order to facilitate the future design of inhibitors that are directly targeted to each subgroup of human SDRs [18,19, 20]. 2 SEQUENCE DATABAES 2.1 Data Collection 75 sequences of human SDR enzymes were collected from UniProtKB database (http://www.uniprot.org). 100 homologous sequences of human SDR enzymes were collected from NCBI-BLAST (http://blast.ncbi.nlm.nih.gov/). Human SDR protein structures were collected RCSB Protein Data Bank (PDB) (http://www.rcsb.org/pdb/home/home.do) 2.2 Bioinformatic tools Tools for doing multiple sequences alignment: ClustalW, MUSCLE, Kalign and T-COFFEE were available at European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) (http://www.ebi.ac.uk/services/all). Geisha3 was available at (http://atama.wnb.uz.zgora.pl/~jleluk/linki.html). Tools for constructing consensus sequence: Consensus sequence constructor was available at (http://atama.wnb.uz.zgora.pl/~jleluk/linki.html) Tools for constructing phylogenetic tree: PHYLIP was available at (http://www.phylip.com). SSSSg was available at (http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/ssssg/ssssg.zip) Tools for studying mutational variability: Consurf was available at (consurf.tau.ac.il) Talana was available at (http://www.bioware.republika.pl/) Tools for visualization the results from the study of Consurf and Talana Rastop was available at (http://www.geneinfinity.org/rastop) Tools for studying correlated mutation Talana was available at (http://www.bioware.republika.pl/) Com was available at (http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/corm.jar). 3 SEQUENCE ANALYSIS TOOLS 3.1 Sequence alignment of human SDR protein Several diseases are caused by disorder in genes or proteins. An understanding of the sequences, how the genes or proteins are related to each other, what functions they have can be of help in the development of remedies for diseases that are caused by these disorders. However, an enormous task is trying to seek for the gene or protein that is responsible for the disease. This is because, for example one single gene can be responsible for several hundreds or thousand’s of base pair 4 combinations that could bear the disorder. There can also be hundreds of gene sequence candidates for intensive studies. Hence, it is hard to choose a good candidate to further investigate if it could be the origin for the disease. The appropriate candidate has previously been found with trial and error technique. However, the finding for remedy of a disease is a time consuming, thus new techniques and algorithms are needed to make the discovery of the gene or protein that cause a particular disease easier [21]. Depending on the biological data is available on several websites as UniprotKB, NCBI, ect. There are several different techniques available for performing the sequencing. For example, if we want to highlight the difference and similar of two protein sequences then, alignment of pair sequences will be the best method to carry out. In contrast, if we want to generate the similar and different characteristics for all members in a certain family then, multiple sequence alignment will be suitable tools. 3.1.1 Alignment of Pair Sequence A pair-wise sequence alignment is performed when two protein sequences are available in the databases (UniProKB or NCBI), then a comparison will be made for a series of characters or character patterns that lies in the same order in both the sequences. For example: The two sequences, V and W, are written in a two-row matrix. The first row contains the characters of V and the second the characters of W. Matching characters, in V and W, are placed in the same column and different characters are placed as a mismatch in the same column. Another way to deal with different characters is to insert a gap into the sequence, such that the character is placed opposite a gap in the other sequence. The reason for introducing a gap is that more matches can be achieved in this way. The column score is usually positive for corresponding characters and negative for dissimilar characters. The sum of all column scores makes up the score for the alignment. [22] 3.1.2 Local and global alignment A pair-wise alignment can either be performed as a global or a local alignment, as shown in Figure 1 below. The global method was firstly to be invented and was used to compare over their entire length (Figure 1). The global alignment works properly when the sequences are conserved or when they are closely related to each other. In contrast, the local method works well when the sequences are not closely related so, they might only share a conserved region and thus, they are not similar over the whole sequence. In local alignment, only substrings in the sequences are aligned. [22] Therefore, depending on the level of similarity in sequences, we can determine whether local or global alignments will be applied. 5 Figure 2: Illustration of a local and global alignment [Figure 2.2, [22] 3.1.3 Why Sequence Alignment is performed? Sequence alignments are important for discovering functional, structural, and evolutionary information in biological sequences. With the help of an alignment, it is simple for us to illustrate whether protein sequences share similar functionality biochemical function and 3D structure. If proteins from different life form are similar, they might have diverged from a common ancestor. Hence, they are homologous and they suffered the mechanism of mutation and selection to evolve into the new sequences. The alignment also implies the changes that have occurred between the sequences and their common ancestor, are considered as substitutions. If there have been insertions of new and deletions of old residues from the sequences, this is referred to as gaps [23]. Therefore, the more similarity the sequences are, the less change have occurred and the protein are likely related, thus the best alignment is the one that best represents the most likely evolutionary scenario. It is however important to remember that even though the sequences are similar; they might not necessarily be homologous. Similarity between short sequence fragments may have evolved by chance or as a result of evolutionary convergence, meaning that the similar regions have the same function but that they have developed independently from different ancestors [24]. This is still a limitation on sequence alignment. In order to overcome that problem, in this study, I carried out five independent approaches of alignment in order to increase the accuracy of the alignment [24]. 3.1.4 Substitution Matrices and Gap Penalties As mentioned earlier, substitutions happen during evolution. When this happens, certain amino acids are more commonly changed than others because they share the similar in physio-chemical properties, other changes take place too, however they are rarer. Knowing which substitutions are the most and least regular in a large number of proteins can aid the prediction of alignments for any set of protein sequences [21]. Matrices that estimate the probabilities of all possible substitutions can therefore be of use here. There are several different methods for building these so called substitution matrices, but the two most commonly used are PAM and BLOSUM. PAM is for example applied in the much used global-multiple alignment program CLUSTALW [22]. Aside from binary cost functions (0: match and 1: mismatch) a transformation matrix of substitution costs can be instituted which will assign a separate penalty for each class of mismatch observe [23]. 6 The minimum mutation distant matrix (Fitch, 1990) is based on the minimum number of nucleic acid/amino acid which must be changed in order to convert the codon for 1 amino acid to the codon of another amino acid. The most common type of transformation table is the log-odds matrix. These log-odds matrices contain the relative frequencies with which amino acid are assumed to replace 1 another overtime. The positive values in the matrix indicate a replacement rate is greater than expected by chance whereas negative values indicate a replacement rate is less than expected by chance. The most relevant of log-odds matrices are the PAM (point allowed mutation) matrix (Dayhoff et al., 1978). PAM matrix is calculated from the original PAM1 by multiplying the PAM1 matrix y X times with itself then giving the probability of X PAM 1 mutation. Low PAM matrix is used with closely related sequences, while high PAM matrix is used with distantly related sequences. The other matrix is BLOSUM matrix (Henikoff, 1992) which is based on well-conserved blocks of multiply aligned sequences segments or motif that represented the most conserved regions of aligned family. As earlier explained, there might contain gaps in an alignment. These gaps are introduced into the alignment in order to align as many of the same characters as possible. There should however not be to many gaps, because if gaps appear everywhere the alignment will show an unlikely change of amino acids. [24, 25] For this reason there exist penalties for inserting gaps. There is one penalty for opening the gap and one penalty for extending the gap. There are several ways to decide the value of the penalties, but the gap extension penalty is usually set to something less than the gap-open penalty, allowing long insertions and deletions to be penalized less than they would be by the linear gap cost. This is desirable when gaps of a few residues are expected almost as frequently as gaps of a single residue [23]. In addition, gap are constructed in the alignment representing implied insertion or deletions. The decision to institute a gap in the alignment is a result of the gap cost calculation during the wave-front update of the matrix elements The first method is the dot matrix analysis. This method shows a matrix of the two sequences. It has one sequence written horizontally across the top of the graph and the other along the left-hand side, and diagonal lines showing alignments [21]. The second method is the dynamic programming algorithm, which solves a problem by combining solutions to sub-problems. It finds the optimal alignment by comparing all character-pairs in the sequences [24]. This algorithm is a commonly used algorithm for sequence analysis. The third method is the word or K-tuple method. This method starts by searching for identical short stretches of sequences, called words or k-tuples. These are then joined into an alignment with the dynamic programming algorithm. Examples of programs using this method are FASTA and BLAST. These programs are commonly used for database searches, when seeking for the sequences that align the best with an input test sequence [25]. This was a few of the most widely used methods for pair-wise sequence alignment. If the sequences to be aligned, though, are more than two then these methods are not a good choice. The next chapter, Multiple Sequence Alignment, will discuss other methods that are more appropriate in this case. 3.1.5 Multiple Sequence Alignment Similar to a pair-wise alignment, multiple sequence alignment is that sequences are searched for a series of individual characters or character patterns that are in the same order in the sequences. The alignment can be also performed as a global or a 7 local alignment, and substitution matrices and gap penalties can be of use. The difference is that a multiple sequence alignment contains more than two sequences. A multiple sequence alignment of a set of sequences can provide information on the most alike regions in the set. In proteins, such regions may represent conserved functional or structural domains. Multiple alignments are the basis for most sensitive sequence searching algorithms. They are also useful for deciphering evolutionary information in biological sequences. For example, they provide information on which residues are important for the function and for stabilization the secondary and three-dimensional structures of the protein. That is, it can illustrate which residues and regions that represent conserved functional or structural domains. Even if only two sequences in a set are supposed to be aligned, it can be meaningful to conduct a multiple alignment of all sequences in the set in order to improve the accuracy of the alignment. In addition, it is difficult to identify the pattern of conserved residues when only comparing the two sequences. Therefore, multiple sequence alignment is the most common methods to study the conserved motif for a certain family [26]. Standard Protocol of multiple sequence alignment (MSA) in traditional way  Pair-wise alignment (align the most closely related sequence first and then gradually adding the more distant one)  A distant matrix (K-tuple) calculated based on pair-wise alignment (based on level of identity) then, giving the divergence of each sequence.  A guide tree is calculated based on distant matrix by applying Neighbor joining  The sequence progressively aligned according to the branches order in the guide tree (based on the statistic matrices, PAM/BLOSUM) then, doing complete alignment [22, 23] However, there are two major problems associated with the progressive approach on MSA of traditional methods. That is the local minimum problems and the choice of alignment parameters. The local minimum related to the “greedy” nature of the alignment strategy. The algorithm greedily adds sequences together, following the initial tree. There is no guarantee that global optimal solution, as defined by some overall measure of multiple alignment quality or anything related to it. More specifically, any mismatches made early in the alignment process can not be corrected later as new information form other sequences is added. This problem mainly related to the result of incorrect branching order in the initial tree. The initial tree are derived from a matrix of distances between separately aligned pairs of sequences and are much less reliable compared to the trees from completed multiple alignment. Thus, if misalignment happen and carry through from the early alignment steps cause local minimum problem. The problem in choosing alignment parameters happen due to traditionally, one weight matrix and two gap penalties (one for opening a new gap and one for extending existing gap) work well for closely related sequence. Because all residues weight matrices give most weight to identities. When identities dominate an alignment, almost any weight matrix will find approximately the correct solution. However, this method does not work well for most distantly related sequences or divergent sequences then, leading to more mismatches. In addition, in closely related sequences, the range of gap penalties values will find the correct solution that can be very broad. However, as more and more divergent sequences are used. It may be very narrow range of values will deliver the best alignment. Therefore, there are many MSA methods are designed to improve these problems which are available in the EBI website at the current time. However, the most 8 important point is which MSA tools will be the best candidate when performing MSA?. The answer is there will be no the best tools but the most crucial point to consider when choosing a program will be the biological accuracy, execution time and memory usage. The most accurate programs according to benchmark [16,17] tests are MUSCLE and T-COFFEE. In practice, accuracy claims can be difficult to validate due to the frequent practice of parameter tuning to optimize performance on 1 or more benchmarks. Benchmark scores are typically based on averages over many alignments [27]. Thus, we employed four independent tools that are available at EBI services as ClustalW, Kalign, MUSCLE and T-Coffee to perform MSA. However, these four tools work based on Hidden Markov Model (HMM) which assumes that the probability of amino acid A is substituted by amino acid B is independently to what amino acid A is transformed from. This model contains some limitations that amino acids can not be considered as equal unit in evolutionary substitution [20, 28]. Because some of them are encoded by 6 codons as Serine, Leucine and Arginine while some of them are encoded by only 1 codon as Methionine. In order to overcome that limitation on HMM, we applied one kind of newest model, called Genetic Semi-homology implemented in Geisha 3. This model focuses on what kind of codons that encoded for amino acid so, it concerns on cryptic mutations which changing in gene compositions without affecting on amino acid sequence. Hence, genetic-semi-homology is more sensitive than HHMs to work on non-homologous sequences [28, 36]. Therefore, in order to improve the accuracy of MSA: In this study, the five independent tools of MSA already applied to address these issues in their programs [29]. 3.1.5.1 ClustalW In ClustalW, there were several improvements on the progressive multiple alignment method which greatly improve the sensitivity without sacrificing any of the speed and efficiency which makes this approach so practical. In ClustalW, the problem in the choice of alignment parameters will be improved by varying the gap penalties in a position and residue-specific manner [19]. All pairs of sequences are aligned separately by using ClustalW two groups of penalties and full amino acids weight matrix are used in dynamic programming (using matrix to score alignment). Guide tree created based on score by using Neighbor Joining. In CLustalW, all of the remaining modifications apply on to the final progressive alignment stages:  Initially, gap penalties are calculated depending on the weight matrix (similarity-length of sequence)  Derive sensible local gap open penalties at every position in each pre-aligned group of sequence will vary as new sequence is added.  The final modification allows us to delay the addition of very divergent sequence until the end of the alignment process when all of the more closely related sequences have been aligned. Initial values can be set by users. Then, the software automatically attempts to choose appropriate gap penalties for each sequence alignment, depending on several factors: the weight matrix, similarity of sequence, length of sequence, differences in length of sequence, position-specific gap penalties [30] 3.1.5.2 MUSCLE (Multiple Sequence Comparison by Log-Expectation) Stage 1: Draft progressive 9 All unaligned sequences are used to align first by using Kmer counting (K-tuple-word matching) to create K-mer distance matrix D1. Based on the distance matrix D1, the tree 1 is calculated by applying UPGMA (Un-weight pair group method with Arithmetic Averages) and distance matrices are clustered using UPGMA. In tree 1, internal node-a pair-wise alignment is constructed to create a new profile. At each leaf, a profile is constructed from input sequences. Nodes in the tree are visited in prefix order-children before their parents). Next, the progressive alignment is calculated based on the tree 1 and multiple sequence alignment 1 is produced. Stage 2: Improve progressive Compute percentage identity in multiple sequence alignment 1 and Kimura distance matrix D2 is produced. Then, the tree 2 is produced based on UPGMA. This is optimized by computing alignment only for sub-trees whose branching order changed relative to tree 1. Finally, the progressive alignment is carried out to produce multiple sequence alignment 2. Stage 3: Refinement Based on the tree 2, deleted edge in tree 2 to produce two sub-trees, then, computing sub-tree profiles. The sub-tree profiles are used to do realignment profiles and multiple sequence alignment is then finally produced. Finally, sum of pairs score is used to confirm the accuracy of multiple sequence alignment. If the sum of pairs score give score better then, the final multiple sequence alignment is produced [20]. 3.1.5.3 KALIGN KALIGN develops by employing the Wu-Manber string algorithm to improve both accuracy and speed of MSA. The Wu-Manber algorithm is used instead of using K-tuple (word matching) matches (run for identical residues), calculated a distant matrix between two strings which is measured by the Levenshtein edit distance. For example: two strings A and B have an edit distance d if A can be transformed into B by applying d mismatches (insertion/deletion) then, providing distant scores to build up the tree. In Wu-Manber, sharing mismatches patterns can be still readily found on enable WuManber algorithm to report the meaningful distances between highly divergences. However, for matching patterns, many spurious (failed but seemly tree) matches are reported. Wu-Manber algorithm also applies in progressive alignment. At each internal node of the guide tree, two profiles are aligned. Optionally, KALIGN uses Wu-Manber as an anchor point during the alignment phases, which requires two extra steps to dynamic programming. KALIGN employs global dynamic programming method-using affine gap penalties mean that residues are assigned into three stages (aligned, gap in sequence A and gap in sequence B). It disallows a gap in one sequence to be immediately followed by a gap in other sequence. When these state matrices are filled, the final cells contain the maximum align score and a trace back procedure (requiring the matrices) is used to retrieve the actual alignment. There are two extra steps in dynamic programming:  Consistency check: this task is sieve through thousand of matches found between two sequences. Find the largest set of matches that can be included in an alignment.  Updating of pattern match positions: this updating step adjusts the absolute position of matches found within sequences to their relative position within profiles generated by dynamic programming step. KALIGN uses a substitution matches (BLOSUM, PAM) an affine gap penalties in dynamic programming. A common idea is that similar sequences should be aligned with hard matrices (PAM50, BLOSUM80) while more distantly related sequences align better using soft matrices (PAM250, BLOSUM40) [31]. 10 3.1.5.4 T-COFFEE (Tree-based Consistency Objective Function of Alignment Evaluation) This method has two main features: provide a simple and flexible mean of generating MSA using heterogeneous source (combing local and global sequence alignment). In addition, the optimization method is used to find the MA that best fit the pair-wise alignment in the input library). First, the ClustalW primary library is created by doing the global pair-wise alignment and the Lalign primary library is created by doing the local pair-wise alignment. Then, the next step is to combine the local and global alignment by addition. If any pairs are duplicated between two libraries, it is merged into a single entry that has a weight equal to the sum of two weights. Otherwise, a new type of entry is created for pair being considered. The second step is weighting or signal addition. T-Coffee assigns each weight to each pair of aligned residues in the library. The thirds step is primary library or listing of weight pair-wise constraints. Each constraint receives a weight equal to percent of identity within the pair-wise alignment it comes from. The fourth step is extension. For each pair of aligned residues in the library, T-Coffee assigns a weight that reflects the degree to which those residues align consistency with residue from all others. The fifth step is extension library. The final weight for any pairs of residues reflects some of the information contained in the whole family. It is based on taking each aligned residue pair from the library and checking the sequences. Thus, the weight of a pair of residues will be the sum of all the weights gathers through the examination of all the triplets involving this pair. The final step is progressive alignment. To replace BLOSUM/PAM by using weight in the extended library to align the residues in two sequences. This pair of sequences is then fixed and any gaps have been introduced can not be shifted later. The next closet two sequences are aligned to the existing alignment of the first two sequences. Finally, completed MSA is created [32]. 3.1.5.5 GEISHA 3 CLustalW, MUSCLE, KALIGN and T-COFFEE calculated based on statistical matrices or Hidden Markov Model (HMMs). HMMs is the probability of amino acid A is substituted by amino acid B is independently what amino acid A is transformed from. However, amino acids can not be considered as equal unit in evolutionary substitution. Some of them are encoded by six codons as Serine, Arginine and Leucine while some are encoded by one codon as Methioine and Trytophan. In addition, Markov model does not concern cryptic mutation which changes composition of the gene without effect on amino acid sequence, single point mutation-common mechanism for protein variability. For example: Met(AUG) is changed to Arg(AGG) and then changed to Lys(AAG). If Arg is originated from Met, need one step to change Arg to Lys. However, Leu(CUR) is changed to Arg(CGR) and then changed to Glu(CAR) and finally to Lys(AAR). In this case, if Arg is originated from Leu, need two steps to change Arg to Lys. Thus, ClustalW, KALIGN, T-COFFEE and MUSCLE do not concern on genetic code. In order to overcome this limitation, genetic semihomology implemented in Geisha3 is used in our study to increase the accuracy of the alignment and it also considers on the closely relationship between amino acids and their codons in related proteins. For instance: 11 Met(AUG) is transferred into Arg(AGG) and then transferring into Lys(AAG). All this steps happened based on the one single point mutation or single step from U to G and to A. However, id the Arg in this example is coded with CGU then, in order to mutate to Lys (AAU), this process does not follow one single transition or tranversion anymore. Thus, Geisha3 in our study not only helps to increase the accuracy of the alignment but also helps to reduce the mismatches happened in the alignment process. In Geisha3, the term Semi-homoloy is used which means that two residues are Semi-homology if there is only one substitution in their codons. Thus, there are three different types of Semi-homology: The first type of Semi-homology concerns amino acids whose codons differ in one nucleotide of the same type such as pyrimidine (T and C) to pyrimidine, purine (A, G) to purine. The second type of Semi-homology concerns amino acids whose codons differ in nucleotide of different types such as pyrimidine to purine. The last type of Semi-homology is not alternative to the former two. It concerns residues whose codons differ in the last codon which is known the most tolerant in encoding amino acids [33, 34, 35]. 3.1.6 Multiple sequence alignments of human SDR protein and alignment verification 75 sequences of human SDR enzymes were collected from UniProtKB database (http://www.uniprot.org). Sequences were initially aligned with ClustalW, T-Coffee, MUSCLE and Kalign using the template sequence Q14376 (UDP-glucose-4epimerase). In order to create the most robust alignment possible, initial alignments using each method were compared against one another and the most differing sequences, with a very low degree of shared identity, were removed before performing subsequent analyses. The potential evolutionary relationship between corresponding non-identical positions from the four different multiple alignments were verified separately using the genetic semi-homology algorithm implemented in version 3 of the program Geisha [33,34,35]. Geisha3 is freely accessible from the Website (http://atama.wnb.uz.zgora.pl/~jleluk/linki.html). Verifying multiple sequence alignments using Geisha helps to identify and reduce potential mismatches that may occur during the initial alignment process. ClustalW, T-Coffee, MUSCLE and Kalign are based on the Hidden Markov Model. Geisha improves alignment accuracy by completing the alignment while considering point mutations. Setting it apart from the programs used for initial alignments, Geisha assumes that the probability of the replacement of one amino acid into another depends significantly on what amino acids occupied that position in the past. Only the sequences who displayed the most similar level of identity (equal or higher than 80% in that case) would be keep in the result of MSA otherwise would be removed. Because these sequences would be target for constructing consensus sequence which shows the most conserved motifs for human SDR family. 3.2 Consensus sequence construction and BLAST search 3.2.1 What is BLAST (Basic Local Alignment Search Tool)? The most widely software for efficiency comparing bio-sequences to a database is BLAST [26]. BLAST computation is organized as thee steps pipeline: Stage 1: Words matching, which detects substring of fixed length w in the stream that perfectly match a substring of query. 12 Stage 2: Ungapped extension, each matching w-mers is forwarded to the second stage, ungapped extension which extends the w-mers to either side to identify a longer pairs of sequences around it that match with at most a small number of mismatch character. These longer matches are high-scoring segment pairs (HSPs) or ungapped extension. Stage 3: Gap extension. Every HSPs has both enough matches and sufficiency few mismatches is passed to the stage of gap extension. The gap extension use the Smith-waterman dynamic programming algorithm to extend it into gapped alignment, a pair of similar regions that may differ by arbitrary edit [37, 38] In this study, we apply NCBI BLAST [39] for searching homologous sequences of human SDR family. 3.2.2 Construction of consensus of Human SDR protein family and BLAST search As a way of summarizing the verified human SDR multiple sequence alignments, a single consensus sequence for the entire human SDR super-family was established. The consensus sequence was obtained using the Consensus Sequence Constructor [33,34,35] with default parameter values. The highly conserved positions (>70% identity) are marked with bolded black letters, whereas Intermediate conservation (>30% identity) is indicated with black characters corresponding to the most commonly occurring residue and the positions marked as X are the variable positions that are occupied by any particular residue in more than 30% of sequences This is an original application designed by our Polish collaborators and is freely available for non-commercial academic purposes from the Website http://atama.wnb.uz.zgora.pl/~jleluk/linki.html. The most robust consensus sequence was then used to identify two types of specificity for all members of the human SDR super-family: 1) the general specificity, which indicates common features of the entire enzyme super-family, and 2) the individual specificity, which distinguishes the unique structural properties of each grouping within the human SDR super-family separately. Put another way, the general specificity is concerned with the more conservative regions of the human SDR protein sequence, while the individual specificity highlights the more variable regions. By investigating both types of specificity, our results may be of better use for future work on developing inhibitors that can be directed to only one or a few enzymes without affecting the activity of others. Lastly, the consensus sequence was also used in a BLAST search for potential new members of the human SDR family. The new sequences supplemented the original 75 SDR family members (about 100 additional sequences) and were aligned in the same way as described above. 3.3 Phylogenetic tree construction and comparison of consensus sequences 3.3.1 Phylogenetic Tree Prediction Phylogenetic tree shows the inferred evolutionary relationships among various biological species or other entities based upon on similarities and the differences in their physical and for genetic characteristics. The taxa joined together in the tree are implied to have descended from a common ancestor [25]. The phylogenetic tree prediction is used for structuring sequences constitutes an important area of sequence analysis. It can be helpful when analyzing changes that have occurred in the evolution of different organisms, or it can be of use when studying the evolution of a family of sequences. Based on these analyses the sequences that are the most closely related can be identified through that they are occupying neighbor branches on a tree. When a phylogenetic analysis of a family of related nucleic acids or protein 13 sequences is performed the evolutionary history of the family is examined and the sequences are shown in the form of an evolutionary tree. The original ancestor sequence will then form the root node of the tree. The branching relation in the tree shows the degree to which the sequences are related. The closest related sequences will be placed as neighbor-leaves and are joined to a common branch beneath them. Phylogenetic analysis is closely related to multiple alignment, which often is the base that the phylogenetic analysis proceeds from. One reason for building a phylogenetic tree of the multiple alignment is that the tree makes the relationships between the sequences clearer. Another reason is that when the genes for the proteins, in the different organisms, have developed during evolution, amino acids have been substituted. A phylogenetic tree can be of use when these substitutions are to be analyzed [21]. An illustration of a small phylogenetic tree with a few substitutions is given in Figure 3. Figure 3 : Here A, B and C represents the three highly conserved sequences of the same protein taken from three separate organisms. The phylogenetic tree give a view of the substitution that happened during the evolution, when these substitutions evolved from the same ancestor [21]. How is then the phylogenetic analysis performed? First a multiple alignment is built, using one of the methods described in the chapter concerning multiple alignment, as for example the CLUSTALW program. Then the substitution model is chosen. The choice of model is based on how similar the sequences are. If they are highly similar the PAM matrix is often useful, since it is designed to track the evolutionary origins of proteins, but if they are less similar the BLOSUM matrix might be superior, because it is designed to find the conserved domains of proteins [21]. After choosing the substitution model the next step is to build the tree. Here there are several tree-building methods to choose of. These methods can be divided into two main groups, namely distance-based and character-based methods. [25] These two groups are just briefly explained below, because phylogenetic prediction was not considered as a solution for the classification problem subjected for this project. The reason for this is that the SDR proteins are distantly related. A phylogenetic analysis of very different sequences is difficult to carry out, as there are several possible evolutionary paths that could have given rise to the observed sequence. This results in a very complex problem that requires considerable expertise to execute. [21] 14 3.3.2 Distance-based Method Distance-based methods use the number of changes, the distance, between two aligned sequences to derive trees. [22] The sequence pairs that have the least number of changes between them are the closest related. They are placed as neighbors in the tree and are both connected to their common ancestor node by a branch. [21] There are several different methods that are classed as distance-based methods, for example Un-weighted Pair Group Method with Arithmetic Mean, UPGMA, Neighbor-joining, and Fitch-Margoliash. [21, 28] 3.3.3 Character-based Method The character-based methods derive trees that optimize the distribution of the actual data patterns for each character. Pair-wise distances are therefore not, as in distance-based methods, fixed as they are determined by the tree topology. This allows the assessment of the reliability of each base position in an alignment on the basis of all other base positions. [25] Examples of methods that belong to the character-based methods are Maximum Parsimony and Maximum Likelihood. The last method for sequence analysis is secondary structure prediction, which is described in the paragraph below. Although both distant and character-based methods can be used to construct phylogenetic tree but in our study, we prefer to construct tree based on characterbased method, especially Maximum Likelihood (ML) and Maximum Parsimony (MP). This is because distant based methods as UPGMA (Unweighted Pair Group with Arithmetic Mean) or Neighbor-Joining (NJ) contain several drawbacks such as: they can work well on closely related sequences but failed on the distantly divergent sequence [28]. However, in our study, the level of identity of SDR family is only 15% to 30% so, in order to produce the most accuracy result, MP and ML can overcome the drawbacks of UPGMA and NJ [28]. 3.3.3.1 PHYLIP PHYLIP implemented with Maximum Parsimony, uses Fitch’s algorithm to find a minimum number of mutation requires changing from one nucleotide to each other. In order word, MP’s work is based on the observed data on the similarities and differences among data, smallest number of evolutionary changes based on Operation Taxon Unit. PHYLIP creates the tree by selecting the tree that minimizing the number of evolutionary steps (transformation of one character state to another required to explain a given set of data). For each site, each leaf is labeled with set containing observed nucleotide at this position. For each internal node I with children j and k, labeled Si and Sk Si = Si U Sk if Sj ∩ Sk is empty Sj ∩ Sk otherwise Total, number of changes is necessary for a site is number of union operations. Weakness: its implicitly assumes that the rate of change along branches is similar. 3.3.3.2 SSSSg SSSSg implemented maximum likelihood. It calculated the possible way to change one amino acid to another. In order word, ML’s work is to create all the possible trees containing the set of organisms considered, using the statistics to evaluate the trees. For example, given a data D, model M, find a tree T Pr (D/M, T) is maximized Make two independent assumptions: 15   Different sites evolve independently Divergent sequences evolve independently after diverging 3.3.4 Human SDR phylogenetic tree and comparison of consensus sequences The results of our multiple sequence alignments were used as input data for constructing phylogenetic trees that would outline the interrelationships of the various members of the human SDR super-family. In this study, two independent approaches were used to construct the phylogenetic trees - PHYLIP (Felsenstein, 1989) and SSSSg (database: Uniprot, matrix: BLOSUM45, number of matches: 10 and E upper value: 5.0). PHYLIP is a free package of programs for inferring phylogenies accessible at (http://www.phylip.com). SSSSg is our original software, and is freely accessible at: (http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/ssssg/ssssg.zip). PHYLIP uses Fitch’s maximum parsimony algorithm, and constructs the phylogenetic tree that requires the least amount of evolutionary change to fit the input data. To supplement our parsimony analyses, we also applied the maximum likelihood algorithm to our data using the program SSSSg. Maximum likelihood is an optimality criterion, like maximum parsimony, for the reconstruction of phylogenies. Maximum likelihood methods differ from the non-parametric parsimony approach because they use an explicit model of character evolution for tree construction. Both maximum parsimony and maximum likelihood methods recovered the same five, high-level branching events within the human SDR family, and lower-level topological differences were negligible. As such, we arbitrarily chose to use the maximum likelihood tree for all subsequent analyses. Using Consensus Sequence Constructor, we identified a single consensus protein sequence for each of the five human SDR subgroups. A comparison was carried out on the five resultant consensus sequences in order to identify the conservative and variable sequence regions in human SDR enzymes. To further elucidate patterns of conservation and variation in human SDR enzymes, a comparative analysis of the 3D protein structure of each of the five consensus sequences was also conducted. We identified a representative structure for each of the five groups recovered in the phylogenetic reconstructions using the Protein Data Bank. The selection criteria focused on the maximum identity of the sequence alignment from all members in each group, and the highest degree of similarity at the tertiary structural level. 3.4 Mutational variability of human SDRs Mutational Variability (Talana, Consurf) Mutational variability was carried out to highlight the conserved and variable regions in human SDR’s sequences and structures. In our study, we applied to independent soft-wares-Talana and Consurf. Consurf server is used for estimating the evolutionary conservation of amino acid based on the phylogenetic relations between homologous sequences. 3.4.1 Consurf The first step in Consurf is to find sequence homologies (using BLAST) based on protein structures and amino acids sequences. Sequences are clustered and highly similar sequences are removed using CD-HIT and cut off (95%). Then, multiple sequence alignment is created and phylogenetic tree is also created based on multiple sequence alignment by using Neighbor Joining. 16 The second step, maximum likelihood calculates position-specific conservation scores (depend on the users choice). The third step is used to calculate conservation scores which are divided into discrete scale of nine grades for visualization. For example, grade 1-the most variable position-colored turquoise, grade 5-the intermediately conserved-colored white and grade 9-the most conserved-colored maroon. The conservation score at a site corresponds to the site’s evolutionary rate. It measures of evolutionary conservation at each sequence site of the target chain. The color grades are assigned as follow: the conservation scores below the average (negative values, are indicative of slowly evolving, conserved sites) are divided into 45 equal intervals. The score 45 intervals are used for the score above the average (positive values, rapidly evolving, variable sites) 3.4.2 Talana Similar to Consurf, Talana is used to calculate the number of different amino acids that occupy particular position in a provided MSA. Chart, scripts used to visualize the availability on a PDB profile. In addition, Talana produces the conservation scores into 12 grades. Grade 1 and 2 are the most conservative and are in darkest blue color, grade 3 and 6 are the intermediately conservative and are in light blue and white color whereas the grade 7 to 9 are the most variable and are in pink and red color. 3.4.3 Mutational variability of human SDR protein family We used the five representative structures we identified (Table 1) together with all protein sequences available in each group identified in our phylogenetic analyses to study the mutational variability within the five subgroups of the human SDR family. ConSurf (available at consurf.tau.ac.il) and Talana (available at http://www.bioware.republika.pl/) were used to identify conservative and variable residues within functional regions in the aligned homologous sequences. Consurf and Talana are used for estimating the evolutionary conservation of amino acids based on the phylogenetic relationships between homologous sequences. Both programs analyzed the evolutionary conservation of amino acids based on the sequences and produce conservation scores that correspond to the rate of evolution at each site. The scores are divided into nine grades for the visualization of differing rates of evolution in Consurf: grade 1 is the most variable position and is colored turquoise; grade 5 is the intermediately conserved position that is colored white; and grade 9 is the most conserved position and is colored maroon. Alternatively, in Talana, the conservation scores are divided into 12 grades: grade 1 is the most conserved position (darkest blue); grade 6 is the intermediately conserved position (white); grade 12 is the most variable position (darkest red). After the conservation score has been calculated for each site, both programs automatically project the value for each sequence onto the consensus protein structures. Results from both Consurf and Talana were visualized using Rastop2.2 (http://www.geneinfinity.org/rastop) and mutually compared for verification of their compatibility. Table 1: PDB code and name of five representative Groups PDB code and name of representative structures 1 3edm chain A ,Uncharacterized Oxidoreductase SSP0419 2 1hdc chain A, Retinol Dehydrogenase 7 3 1yb1 chain A, 17-beta hydroxysteroid dehydrogenase 13 4 3rd5 chain A, Retinol dehydrogenase 11 5 1q7b chain A, 3-Oxoacyl-[acyl-carrier-protein] reductase FabG 17 3.5 Analysis of correlated mutations Correlated mutations are the phenomenon of several mutations occurring simultaneously and dependent on each other. According to the current hypothesis of molecular positive Darwinian, selection, correlation mutations are related to the change occurring in their neighborhood. They reflect the protein-protein interaction and they preserve the biological activity and structure properties of the molecules [40]. In this project, we also studied mutational correlation among human SDR members in order to gain more understanding on protein-protein interaction among these protein family. This information may be useful for further study on designing inhibitors. The Corm and Talana are two soft-wares being used to accomplish this task. Lastly, we set out to investigate the tendency of different amino acids along human SDR proteins to mutate together. It is clear that many residues within the same protein have evolved to form specific molecular complexes and that the specificity of these interactions are essential for their function. To maintain functionality, it is reasonable to assume that the sequence changes accumulated during the evolution of one of the interacting residues must be compensated by changes in the other [34,35,36]. In this way, the network of necessary inter-residue contacts may constrain divergence of the protein sequence to some extent. Correlated mutations in representative protein structures and corresponding consensus sequences in each subgroup of human SDRs were identified, localized and analysed with the aid of Talana and Corm (freely available for non-commercial academic purposes at http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/corm.jar). The program FEEDBACK was implemented in Corm, which is designed to analyze the aligned protein sequences for the occurrence of correlated mutations. It returns all possible residues occurring at all sequence positions of aligned proteins for each residue occurring at each position. Talana produces a similar set of results, but also highlights correlated sequence mutations in the corresponding protein structures. The candidate correlated sequence and structure mutations that were recovered using both software packages were compared and then visualized on the SDR template structure of the five groups using DSVisualizer1.7 of Accelrys (http://accelrys.com/products/discoverystudio/visualization-download.php) and/or Rastop2.2 (http://www.geneinfinity.org/rastop). The visualization of the protein sequence mutation correlation results from Talana and Corm provided an additional method of investigating potential correlated mutations in protein structure. 3.6 Availability of original software generated by authors The original applications of Geisha 3, Consensus Constructor, SSSSg, Talana and Corm are freely available at the addresses listed above. They are also available directly upon any request sent to the authors. Additionally, the authors are willing to assist in the appropriate, effective running of all applications. 4 RESULTS AND DISCUSSION 4.1 Multiple sequence alignment, consensus sequence generation, and analysis of human SDR specificity After multiple sequence alignment and verification, we identified four sequences (P49327, P14060, P56159, and P56937) that shared very low sequence identity with the rest of the members of the human SDR family, and were removed from 18 subsequent analyses. We constructed the consensus sequence from the remaining 71 sequences, and used it to identify features of general and individual specificity. Our comparative analyses revealed little overall general specificity and much individual specificity amongst human SDR sequences (figure 4). Among 306 positions in the consensus sequence, only 5 positions-bold letters (1.6%) are occupied by the same residue in more than 70% of sequences, whereas 105 positions (34.3%) are occupied by the same residue in at least 30% of sequences. 196 positions-X letters (64.1%) are occupied by any particular residue in more than 30% of sequences. Figure 4: Part of result of multiple sequence alignment of 71 human SDR’s member 19 Figure 5: Completed consensus sequence of 71 human SDR’s members Consensus sequences for the human SDR family, constructed using Consensus Sequence Constructor. The highly conserved positions (>70% identity) are marked with bolded black letters as M, G, G, V and L. Intermediate conservation (>30% identity) is indicated with black characters corresponding to the most commonly occurring residue. The positions marked as X are the variable positions that are occupied by any particular residue in more than 30% of sequences. As a whole, this figure displays the highly variable characteristics of the human SDR family. 4.2 Sequence specificity and interrelationships of the human SDR family We recovered five distinct subgroups within the human SDR family (figure 5). Figure 6: Phylogenetic tree construction by PHYLIP 20 Figure 7: Phylogenetic Tree construction by SSSSg. Both the program shown that human SDR family can be phylogenetically grouped into five distinct classes. In this study, we sought to elucidate the evolution of sequence and structure within the human SDR family. Our results illustrate that the human SDR family possess a low level of sequence conservation overall (figure 5). This indicates that evolutionary differentiation has led to the formation of narrow specificity in individual members of the family, rather than the preservation of common specificity for the family in its entirety. This conclusion is further supported by the results of our phylogenetic analysis (figure 6 and 7). Low overall sequence identity leads to the grouping of the 21 human SDR family into five distinct clusters (this result is similar to previous studies on the phylogenetics of the human SDR family [37,38]), with each group potentially further classified into two sub-groups (conservative and variable) based on the outcome of mutational variability and correlated mutations analyses. The consensus sequence for each of these five groups is shown in figure 6, and the positions that form the binding site of the enzyme (K and S) and active site (Y) are marked with red letters (figure 8). Figure 8: Comparison of the five consensus human SDR sequences Group Active (AS) Binding (BS) Site 1 Y-156 2 Y-176 3 Y-185 4 Y-199 5 Y-151 Sites S-143 S-164 S-172 S-174 S-138 K-160 S-157 K-180 C-177 K-189 C-186 K-203 C-200 K-155 A-152 Three residues between ASBS A-158 I-178 S-187 H-201 A-153 S-159 S-179 S-188 S-202 A-154 Figure 9: The active site (AS), substrate binding sites(BS), and three residues between AS and one of the BS in 5 human SDR groups identified by Talana. On the right there is shown the location of these residues in the template structure of SDR group 1. Note, that AS and BS are very conservative contrary to the three residues adjacent to AS and BS Red characters demarcate for the binding sites (K, S) and the active site (Y) of the enzymes. These locations were found to be conserved residues and outline the common sequence features within the human SDR family (figure 8). In contrast, the three cluster of amino acids marked in yellow (such as SAS, FGV, CSS, CHS and AAA) indicate the presence of variable residues directly adjacent to the conserved residues. These locations determine the narrow specificity within each group. Based on a comparison of the consensus sequences for each of the five groups within the human SDRs family, the binding and active sites typically exhibit very conserved residues, and are occupied by the same type of residues in all 5 groups. In contrast to the highly conserved nature of the active and binding sites, a cluster of three amino acids, which are located directly adjacent to the active site and also next to one of the binding sites (marked with red letters in figure 8 and 9), reveal substantial variability. 22 4.3 Mutational variability of human SDRs The two programs (Talana and Consurf) were used to analyze the mutation variability of both sequence and structure of the protein templates in each of the five human SDR groups yielded similar results (figure 10). The identification of conservative and variable sequence and structure regions within the human SDR family is presented in figure 9, 10. The conservative and variable sequences and structures differ not only among the five human SDR groups, and also within each group. Consurf-Group 1 Talana-Group 1 Figure 10: The identification of functional regions within group 1 using Consurf and Talana. Group 1 expressed the full grade of coloring scheme in Consurf : the continuous conservation scores are partitioned into a discrete scale of 9 bins for visualization, such that bin 9 contains the most conserved positions and bin 1 contains the most variable positions. The color grades (1-9) are assigned as follows: the most conserved regions are on the darkest maroon color and the least variable regions are on the lightest turquoise color on the visualization. Similarly, using Talana, group 1 also expressed the full grade of coloring. From the grade 1, the most conserved regions are on the darkest blue color to the grade 12, the least variable regions are on the lightest rose color on the visualization. Therefore, both tools displayed similar results for the identification of functional regions in protein structure. Mutational Variability 23 Group 1 Group 2 Group 3 24 Group 4 Group 5 Figure 11: The result of mutational variability (done by Talana) Across the different groups of human SDRs, the protein structure of group 1 contains a mixture of conserved and variable regions with the variable level (full grades in colour scheme of Talana) being dominant in the whole structure. In contrast, group 3 displayed the most conservative level (grade 1 in colour scheme of Talana is dominant in the protein structure) compared to the others. Group 4 displayed an intermediately conservative level whereas group 2 displayed an intermediate level of variability (figure 11). Total variability Core variability Surface variability Group1 25 Group 2 Group3 Group4 Group5 Figure 12: Variability profiles for each of the five groups of human SDRs Total, core and surface variability profiles are displayed for each group based on the distribution of residues on the protein structure. Group 3 displayed the most conservative level (grade 1 of the color scheme is dominant in the entire structure) compared to groups 4 and 5. Group 1 showed the most variable level (full grade of color scheme, from grade 1 to grade 12 in the structure) and group 2 showed an intermediate level of variability. In addition, conservative and variable structures could also be detected within each group. With few exceptions, the conserved residues occurred within active and substrate binding sites, whereas the variable residues (a cluster of three amino acids which are located directly adjacent to the active site and also next to one of the binding sites ) were found at random locations in the protein structure (figure 12). 26 For example, in group 1, the active site (Y-156) and the binding sites (S-143 and K160) are found at a conserved region in the protein structure, whereas a cluster of three amino acids (S-157, A-158, S-159) are located at a variable region next to the conserved region in the protein structure (figure 13). Similar patterns exist in each of the other four groups of human SDR, but involve clusters of different amino acids. Group 1 2 3 4 5 Active YY-176 Y-185 Y-199 Y-151 Site 156 (AS) Binding SS-164 S-172 S-174 S-138 Sites 143 (BS) KK-180 K-189 K-203 K-155 160 Three SC-177 C-186 C-200 A-152 residues 157 between AS-BS AI-178 S-187 H-201 A-153 158 SS-179 S-188 S-202 A-154 159 Figure 13: The location of the conserved and variable residues in the template structure of group 1 of human SDR was identified by Talana. For example, conservative residues included active site and binding sites ( Y-156, S-143 and K-160) both of which are located in a conserved region (grade 1 in color scheme of Talana). In contrast, the three cluster of residues (S-157, A-158 and S-159) are clearly located in a more variable region. Conservative residues are found near the active and binding sites, which are located on the protein structure next to the binding pocket, such as Y-156 (active site) and S-143; K-160 (binding site) in group 1, figure 13. Furthermore, our mutational variability results confirm that the conserved residues are located at the conserved region in the protein structure (such as Y, S and K of group 1 in figure 13). The result of mutational variability in our study compliments prior studies on the identification of conserved residues- Y, S, K. According to several previous studies, Y, S and K residues are considered, together, as a catalytic triad that is found at the active sites of all human SDR proteins [7, 14]. Tyrosine (Y) functions as the catalytic base, whereas Serine (S) stabilizes the substrate and Lysine (K) interacts with the nicotinamide ribose and the pKa of the Tyr-OH [14]. We interpret the presence and location of these conserved residues (Y, S, K) as evolutionary constraint at the level of sequence and structure that leads to the retention of similar physio-chemical characteristics, thus maintaining a given function in the human SDR family. It is these conservative residues that display the global specificity that defines the common characteristics of the entire human SDR family. Variable groups, essentially occurring only at particular three clusters of amino acids, are located directly adjacent to the binding pocket (between the active site and one of the binding sites in figure 12). Just as with the conserved residues, we find that these variable residues occur near a conserved region of the protein structure as well. Additionally, the three clusters of amino acids form a narrow cluster on the binding pocket such as: S-157, A-158, S-159 in group 1 (figure 13). In particular, we 27 see several instances of Serine and Alanine transitions via single point mutations. Serine is encoded with six different combinations of codons (UCU, UCC, UCA, UCG, AGU and AGC) and Alanine is encoded with four codons (GCU, GCC, GCA and GCG). Hence, the simple way for changing Serine into Alanine is by a single transition of amino acid, for instance, Serine (UCU) changed to Alanine (GCU). Thus, single point mutations could potentially be the mechanism underlying the marked variability of group 1, which is the least conservative group overall. In contrast, the three clusters of amino acid in group 3 are Cystein-Serine-Serine, but unlike Serine and Alanine, Cystein is encoded with only two codons (UGU and UGC). Although a single point mutation could also be the main mechanism for mutation in group 3, Cystein and Serine and Serine can form a disulfide bridge, which may increase overall protein stability [39, 40]. Hence, group 3 of human SDRs shows the most sequence conservation compared to the others. The differences and similarities we see among the five different groups of human SDRs are likely related to the functional conservation used by each group in order to maintain the common metabolic functions of the human SDR family as a whole. Particular divergent, adaptive specificity, on the other hand, has permitted each family to adopt its own, specific targets. These similarities and especially the differences in structure and function of the human SDR family are important to consider during future design of specific inhibitors to target only a particular group within the human SDRs family. 4.4 Correlated mutations within the human SDR family Our analyses of mutational correlation within the human SDR family using both Corm and Talana reveal similar outcomes. Based on the distribution of mutations mapped onto protein structure, the correlated mutations can be broken into two groups. The first, core group, includes all mutations that show core molecular contact (table 2), most of which are located in conserved regions of the protein structures (the core variability in figure 12). The second, surface group, includes all mutations that appear on the surface of the protein structure, (table 2) with most mutations located at variable regions within the protein structure (the surface variability in figure 11, 12). Table 4 and 5 outlines the number of observed sets of correlated mutations for group 5 of human SDRs. Table 2: The core residues in five human SDR groups identified by Talana The residues in each group are located at the core of the protein structure. The occurrences of Valine and Isoleucine are more frequent compared to other amino acids, showing that these hydrophobic amino acids potentially play a more vital role in stabilizing the chemical structure of the proteins. Group 1 Group 2 Group 3 Group 4 Group 5 V11, I85, I87, M95, T119, H137, I141, V163, V181, T182, S183, I184, G187, A214 I115, V117, G120, V133, V141, M166, G208, V213, L219, S245, M247 V73, V89, V104, V117, A123, G147, I151, V173, C174, I211 I42, V43, L67, G75, M129, K160, A195, T224, V227, T231, S234, Q248, V252 V7, A57, V69, G79, V80, V158, V174, T178, A220 Table 3: The surface residues in five human SDR groups identified by Talana These residues are located on the surface of protein structures and are distant from each other. 28 Group 1 Group 2 Group 3 Group 4 Group 5 E3, Q5, V8, A20, S21, I22, T25, Q29, D39, S41, R42, E45, V46, K48, I50, Q51, N53, Q55, V57, E59 ,S61, I62, D64, H67, E69, T72 , E73, E80, Q84, I87, M95, S98, A99, I100, E102, E109, A110, M111, D113, I116, K117, G118, T119, Y121, S129, N132, H137, I144, E148, V149, T150, L155, S157, A161, V163, I166, Q168, E171, R180, V181, T182, S183, G187, M188, S194, G195, T197, W199, K204, L205, K208, I210, E212, A213, A214, I215, Y216, Q219, Q220, H223, V224, N225, E228, T230, V231, R232, P233 K64, R70, S71, D75, E78, I81, V91, E99, R100, N103, I115, V117, M119, N122, R126, F130, A131, S132, L134, D135, L139, N147, R153, M166, T195, Y196, G208, V213, T214, M216, S220, D221, L223, A230, V234, I237, K241, F242, D244, S245, M247, A249, E251, N255, C257, G259, D266, C275, H276, , S282, W285 T35, Q59, R62, V86, V89, N102, D105, Q106, R109, E115, A123, P126, L130, S131, K133, E135, E136, T138, I145, L155, S158, R161, R162, G177, I179, Y181, I183, P184, A201, D204, K208, V219, T226, R232, P235, L237, R244, S245, I247, N248, N253, Y262, N264, I268, K271 Q35, L36, V43, E53, K56, L67, V72, D73, G75, L77, R80, Q83, A84, V85, G87, Q90, F92, K95, A99, D100, T101, K109, D110, H117, M129, S133, A136, H142, H155, K160, E163, L175, H178, L179, R181, I182, H183, H185, E190, F192, A195, L197, H201, K211, K218, S220, T224, Y225, V227, S234, S241, I242, M243, W245, W247, Q248, F251, V252, Q258, Y266, C267, L269 E4, L24, R28, K31, E39, Q43, S46, D47, Y48, G50, A57, T61, N62, P63, K71, A72, T74, G79, M96, S104, I106, E108, M126, K128, Q130, A149, V174, V179, K190, A191, N193, D194, E195, A202, A206, D211, P212, R213, E226, I244 Table 4: The identification of correlated mutation sets and their core and surface characteristics for group 5 Group5 Positions 70 79 105 157 173 194 Mutal-Distribution of Correlated Mutational (CM) Core Surface Asn-193, Gln-130, Thr-74 Ala-72, Lys-71, Ala-57 Leu-24 Val-80 Val- Ala-202, Gln-130, Glu-108 69 Thr-74, Ala-72, Pro-63 Asp-47, Lys-31 Thr-178 Ala- Glu-226, Arg-213, Asp220 211 Ala-206, Asp-194, Gln-130 Lys-128, Met-126, Ile-106 Met-96,Thr-74, Thr-61 Gly-50, Tyr-48, Ser-46 Glu-39, Arg-28, Glu-4 Val-158 Asn-193, Lys-190, Gln-130 Thr-174, Ala-72 Val-174 Ala-57 Pro-212, Gly-79, Thr-74 Ala-72, Leu-24 Thr-74, Gly-79, Ser-104 29 211 Val-174 57 Ala- Gln-130, Glu-195, Ile-244 Pro-212, Gly-79, Thr-74 Ala-72, Leu-24 Table 5: Selected correlated mutations in human SDRs identified by Talana Correlated mutation in group 5 was analyzed by the Talana program, indicating that if a mutation happened at one specific location, it led to mutation in other positions. For example, if mutation occurred at position 6 (I), the other mutations occurred at the same time at positions 61 (D), 73 (EKNR), 78 (ADEP) and 129 (ACFGH). AA at Sequen ce Counts Refer Refere ence nce Positi Positio on n 70 E 5 79 105 148 157 173 194 211 Correlated Mutations and amino acids 23: EKT 56: EMV 71: EHKN Q 71: -AT 62: EKV 62: FP 38: E 73: KQRY 129: 192: CFHW PST 73: ELNT 68: ADLT 68: V 45: S 129: AQQS 71: EQT 71: -AKN 47: Y 192: NR 73: KLNY 73: ERT 49:G 107: DKNQ 107: AE 60: T 129: AFSW 129: CGHQ 73: KNRT 201: EQS 201: AD 95: M 125: M K 4 23: AL I 4 V 4 I 4 30: FTV 30: IK 3: -E 56: A 46: AE 46: DG 27: KR V 4 3: QT 27: 38: 45: ADLT AFGS ALV 47: EQT 49: -EIT 60: LQS 73: ELQY ALV 125: AILV I 4 V 4 225: DE 225: GILP 4 205: A 205: LM 201: AQ 201: DGS 219: AV 219: GLRS T L 4 V 5 I 4 V 5 D 4 E 4 A 4 201: AS 201: DEGQ 129: CHQS 129: AFGW 192: PS 192: NRT 78: DEQ 78: AGPR 243: QV 243: FIS 78: 212: KQR 212: DE 4 177: TV 177: A 73: LRT 73: EKNQ 73: QRY 73: EKLNT 56: MV 56: AE 78: AEP 78: DGR 56: 210: DEG 210: QRS A 129: ACFHQ 129: GSW 42: DKQ 42: AE 71: EHKN 71: -AQT 23: ET 23: AKL 73: ENR 73: KLTY 23: 193: 197: D AT 193: 197: AE -EQ 78: 107: EGR DE 78: 107: ADPQ ANQ 129: 189: CHW AR 129: 189: AFGQS EKM 71: 73: HKNQ KQR 71: 73: -AET ELNTY 103: 129: DN ACGH 103: 129: EFQS FQSW 71: 73: 30 211: A 211: P 173: 127: K 127: -AQ P 5 ET 23: AKL MV 56: AE HKNQ KQR 71: 73: -AET ELNTY DEQ I 78: 173: AGPR V Our analysis of mutational correlation of each position along the SDR protein sequences shows that particular fragments are highly variable. Especially, for surface variability, these positions are seldom in direct contact with each other, but maintain contact with conservative positions. For example, according to the results of correlated mutations in group 5 (table 6), a mutation at position 23 is accompanied by mutation at 70 and other positions. There is no obvious relationship between the positions of correlated mutations and their contact with each other (surface variability in group 5, figure 11-13) because such correlated mutations are generally in positions that are very distant from each other. According to the currently assumed model, positive mutations (ones that improve fitness) do not occur independently. Instead, the occurrence of one mutation is dependent upon other locally occurring mutations. In this way, the nature of correlated mutations reflects the protein-protein interaction and the necessity to preserve the biological activity and structural properties of the molecules [41]. Therefore, the correlated mutations revealed in our study provide useful information for further study of complex proteinprotein interactions. Formerly, it was hypothesized that protein-protein interactions only happen to proteins in close proximity. However, our findings show that such interactions may also occur when proteins are positioned at some distance from one another. Thus, we conclude that the correlated mutations occurring distantly are due to interacting protein “forces” that optimize these interactions. These distant protein interactions may act as a potential adaptive mechanism within the human SDR family allowing them change in response to fluctuating external conditions and functional demands through evolution. We also found that in each human SDR group, there are core residues that form a narrow correlation cluster on protein structures, and most of them are in a conserved region (core variability, figure 11-13, table 2). There is evidence elsewhere to suggest that these core residues tend to mutate together to maintain proper functioning [15], and our results provide additional support for the claim that these centralized residues tend to mutate together to preserve the biological function of the SDR proteins. Moreover, the differences in core variability may explain why the human SDR family shares a low level of similarity in sequences (15-30%) but not in protein structure. Contrary to core residues, surface residues are randomly scattered over the protein structure and are not directly contacting each other (surface variability, figure 11-13, table 3). Interestingly, our results indicate that the surface residues of human SDR proteins do seem to be interacting with one another, despite the distance between them (Table 2-3). The molecular mechanisms by which these distantly located correlated mutations occur has yet to be fully investigated and understood. Here, we suggest a few potential explanations as to why these distant residues might be interacting. One possibility is that we have not yet uncovered the intermediate sequences that contain linking residues that indirectly join distant proteins. A second option is that the mechanism of variability at these sites is different from a single point mutation. Although it has long been accepted that single point mutations are major contributors to the acquisition of beneficial mutations through evolution, the correlation of surface mutations do not seem to be adequately explained by the occurrence of single point mutations alone. Using the data presented here as a springboard, further investigation of correlated mutations in distantly located proteins may help researchers to gain insight into the causes, prevention, and treatment of diseases that are caused by genetic or protein structure mutations. 31 5 CONCLUSION In conclusion. The study on human SDR family provided several important results for further studies on Molecular Docking, Molecular Dynamic Simulation and Designing Inhibitors because, We generated two critical features of human SDR family, the generally characteristic including conserved residues as Serine, Lysine and Tyrosine which play important role to stabilize the protein function in order to main the common feature of human SDR family. In contrast, the specific characteristic including three cluster of amino acids located next to the active and binding sites of protein family. These residues changed slightly during the evolution by single point mutation so, they are responsible for the adaptive mechanism of protein molecules to the changing of surrounding environment. 6 REFERENCES 1) N.M. Luscombe, D. Greenbaum, M. Gerstein. “What is bioinformatics? An introduction and overview”. Department of Molecular Biophysics and Biochemistry, Yale University New Haven, USA. Yearbook of Medical Informatics 2001. 2) Reichhardt T. “It is sink or swim as a tidal wave of data approaches”. Nature 1999, 399(6736):517-20. 3) Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic Acids Res 2000;28 (1):15-8. 4) Bairoch A, Apweiler R. “The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000”. Nucleic Acids Res 2000;28 (1):45-8. 5) Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR. “The Protein Data Bank. A computer-based archival file for macromolecules structures”. Eur J Biochem 1977;80(2):319-24. 6) Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H. “The Protein Data Bank”. Nucleic Acids Res 2000;28(1):235-42. 7) Pearson WR, Lipman DJ. “Improved tools for biological sequence comparison”. Proc Natl Acad Sci USA 1988;85(8):2444-2448. 8) Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W. “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”. Nucleic Acids Res. 1997;25(17):3389-3402. 9) Benach J. “X-ray structure analysis of short chain dehydrogenases/reductases”. Karolinska Institute, Stockholm, 1999. 10) Persson B., Krook M., Atrian S., Gonzales-Duarte R., Jeffery J., Jornvall H., Ghosh D. “Short chain dehydrogenases/reductases”. Biochemistry, 34(18): 6003-6013, 1995. 11) Oppermann U., Jornvall H., Kallberg Y.., Persson B. “Short chain dehydrogenases/reductases relationships: a large family with eight 32 clusters common in human, animal and Plant genomes”. Protein Science, 2002. 12) Yvonne Kallberg, Udo Oppermann, Hans Jornvall, Bengt Persson. ”SDRs-coenzyme based functional assigments in completed genomes. Eur. J. Biochem. 269, 4409-4417 (2002). 13) Yvonne Kallberg, Udo Oppermann, Hans Jornvall, Bengt Persson. ”SDR relationship: A large family with 8 clusters common to human, animal and plant genomes”. Protein Science (2002), 22: 636-641 14) Keller B, Volkmann A, Wilckens T, Moeller G, Adamski J. “Bioinformatic identification and characterization pf new members of SDR super-family. Molecular and Cellular Endocrinology 248(2006), 56-60. 15) Filling C, Berndt K.D, Benach J, Knapp S, Prozorovski T, Nordling, Ladenstein R, Jornvall H, Oppermann U. “Critical residues of structures and catalysis in short-chain dehydogenases/reductases. J. Bio.CHem. 277(2002) 25677-25684). 16) Jame E Bray, Brian D Maroden, Udo Oppermann. “The Human SDR superfamily: A Bioinformatic Summary”. Chemico-Biological Interaction 178 (2009) 99-109. 17) Persson B, Kallberg Y. “Classification and nomenclature of the superfamily of short-chain dehydrogenases/reductases (SDRs)”.Chem Biol Interact. 2012 Nov 29. 18) Xiaoqiu Wu, Petra Lukacik, Kathryn L. Kavanagh and Udo Oppermann. “Review: SDR-type human hydroxysteroid dehydrogenases involved in steroid hormone activation”. Mol. Cell. Endocrinology 265-266 (2007) 71-76. 19) Brigitte Keller, Marc Meier, Jerzy Adamski. ”Comparison of predicted and experimental subcellular localization of two putative rat steroid dehydrogenases from SDR protein super family”. Molecular and Cellular Endocrinology. 30(2009) 43-46. 20) S.R Eddy. “Profile hidden markov model”. Bioinformatics, 14(9):755–763, 1998. 21) D.W. Mount. “Bioinformatics Sequence and Genome Analysis”. Cold Spring Harbor Laboratory Press, 2001. 22) P. Baldi and S. Brunak. “Bioinformatics - The Machine Learning Approach”. Massachusetts Institute of Technology, 2 editions, 2001. 23) R. et.al Durbin. “Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids”. Cambridge University Press, 1998. 24) D. Higgins and W. Taylor. “Bioinformatics: sequence, structure and databanks”. Oxford University Press, 2000. 25) Aloysius Phillips, Daniel Janies, Ward Wheeler. “Review: Multiple sequence alignment in phylogenetic analysis”. Molecular phylogenetic and evolution vol.16, No.3, 2000, 317-330. 26) A.D. Baxevanis and B.F.F. Ouellette. “Bioinformatics: A Practical Guide To The Analysis of Genes and Proteins”. John Wiley & Sons Inc., 2 edition, 2001. 27) Robert C Edgar, Serafim Batzoglou. “Multiple sequence alignment”. Current opinion in structural biology 16, 2006, 368-373. 28) Zhumur Ghost, Bibekanand Mallick. “Bioinformatics Principles and Applications”. Oxford University Press. 2008. 33 29) D. Gusfield. “Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology”. Cambridge University Press, 1997. 30) Julie D. Thompson, Desmond G. Higgins and Toby J. Gibson. “CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice”. Nucleic Acids Research Vol.22, No.22 (1994): 4673-4680 31) Robert C. Edgar. “MUSCLE: multiple sequence alignment with high accuracy and high throughout”. Nucleic Acids Research Vol.32, No.5 (2004): 1792-1797 32) Timo Lassmann and Erik LL. Sonnhammer. “KALIGN: an accurate and fast multiple sequence alignment algorithm”. BMC Bioinformatics (2005) 33) Cedric Notredame, Desmond G.Higgins and Jaap Heringa. “TCOFFEE: A novel method for fast and accurate multiple sequence alignment”. J. Mol. Bio 302 (2000: 205-213 34) Jacek Leluk. “A non-statistical approach to protein mutational variability”. Biosystem 56 (2000): 83-93 35) Jacek Leluk. “Regularities on mutational variability in selected protein families and the Markovian model of amino acids replacements”. Computers and Chemistry 24 (2000): 659-672. 36) Jacek Leluk, Beata Hanus-Lorenz and Aleksander F. Sikorski. “Application of genetic semihomology algorithm to theoretical studies on various protein families”. Acta Biochimica Polonica Vol.48 No. 1 (2001). 37) Julie D. Thompson, Federic Plewniak, Oliver Poch. “BAliBASE: a benchmark alignment database for the evaluation of alignment programs”. Bioinformatics application notes. Vol.15, No.1, 1999, 87-88 38) Joshep Lancaster, Jeremy Buhler, Roger D. Chamberlain. “Acceleration of ungapped extension in mercury BLAST”. Microprocessors and Microsystems. 33(2009), 281-289. 39) Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. “Basic local alignment search tool”. J Mol Bio, 1990, 215: 403-410. 40) Hall, B.G., "Spontaneous Point Mutations That Occur More Often When Advantageous Than When Neutral." Genetics 126 (1990): 5-16. Web. 31 March 2011. 41) Nishikawas S, Adivinata J, Morioka H, Fujimura T, Tanaka T, Uesugi S, Hakoshima T, Tomika K, Nakagawa S, Ikehara M,. “ A thermoresistant mutant of Ribonuclease T1 having three disulfide bonds”. Protein Eng 1990, 3: 443-448. 42) Jacek Leluk, Monica Sobczyk, Lukasz Becella. “Correlated mutations in selected protein families”. Task quarterly 6, No.2 (2002), 469-482. 7 SUPPLEMENTS 34 Figure 1: Consurf Color Scale which is coded for multiple sequence alignment. The Consurf color is generated based on the level of identity among the sequence. The most conserved residues such as M in the figure 1 would be coded as the highest score-9 in the scale and then the color scale would be decrease when the level of identity decrease. Similar to Talana, but the color scale of Talana is different. Figure 2: Talana Color Scale which is coded for multiple sequence alignment. The most conserved residues (the highest score in doing alignment) is placed with the most darkest blue color whereas the most variable residues is placed with the darkest red color (the least score in doing alignment). Table 1: The results of correlated mutations both Core and Surface residues was identified by analyzing both the representative protein sequence and structure in group 1 with the aid of Talana. Core and surface residues were classified based on how they distribute on the protein 3D structure. Core residues located inside protein structure whereas surface residues located in the surface of structure. Group1 Positions ( of resides in Core and Surface Residues Core 35 Surface protein sequence) 6 9 20 Met-95 Val-11 Ile-141 39 49 83 116 His-137 119 Thr- Ile-85 Thr-182 36 Ile-166 121 Val-57 50 Val-8 Gly-195 157 Val-149 148 Met-111 72 Arg-42 Thr-197 194 Ser-129 53 Thr-25 22 Thr-230 197 Gly-195 144 Ser-129 98 Glu-80 64 Glu-59 41 Pro-233 230 Glu-228 220 Gln-219 148 His-137 119 Met-95 73 Gln-55 Gln-51 Gln-5 Glu-89 Pro-233 225 His-223 Ile-215 214 Lys-208 205 Lys-204 194 TyrIleSerGluThrSerAsnIleThrIleSerAspSerThrGlnGluThrGluAsn-53 Lys-48 AsnTyr-216 AlaLeuSer- Thr-182 180 Gln-168 155 Ile-144 118 Lys-117 113 Ala-110 Met-95 His-67 Val-46 39 Ser-21 20 130 179 182 185 Ala-214 184 Ile-87 95 IleMet- Gly-187 183 Val-181 163 SerVal- Ile-184 87 Gly-187 Ile- 37 ArgLeuGlyAspAla-99 Glu-69 Ser-61 AspAla- Thr-150 132 Ser-129 109 Met-95 87 Asp-39 3 Asn- Gln-219 212 Gly-195 194 Gly-187 183 Val-181 171 Gln-168 163 Ala-161 117 Met-95 Gln-55 Val-46 Thr-25 Glu- Glu-148 132 Pro-233 219 Glu-212 Gly-187 168 Lys-117 113 Glu-73 GluIleGlu- SerSerGluValLysAsp-64 Asn-53 Gln-29 Gln-5 AsnGlnGly-195 GlnAspGln-5 186 208 230 His-137 184 Ile- His-137 Pro-233 231 Val-224 223 Gln-219 216 Ile-215 213 Tyr-199 188 Asp-113 Ile-100 Gln-84 Ser-61 45 Ser-21 Ile-210 Pro-233 232 Thr-230 219 Gln-168 150 Glu-148 Gln-5 ValHisTyrAlaMetIle-116 Glu-102 Ile-87 Glu-73 GluGln-5 Ile-62 ArgGlnThrGlu-73 Table 2: The results of correlated mutations both Core and Surface residues was identified by analyzing both the representative protein sequence and structure in group 2 with the aid of Talana. Core and surface residues were classified based on how they distribute on the protein 3D structure. Core residues located inside protein structure whereas surface residues located in the surface of structure. Group2 Positions ( of resides in protein sequence) 53 Core and Surface Residues Core Met-166 208 Gly- 38 Surface Trp-285 282 His-276 275 Asp-266 259 Cys-257 251 Ala-249 245 Asp-244 241 Val-234 230 SerCysGlyGluSerLysAla- Met-216 214 Val-213 195 Leu-139 135 Leu-134 132 Arg-126 103 Glu-99 91 Ile-81 75 Ser-71 64 63 87 92 Ser-245 213 Val- Val-141 Trp-285 282 His-276 259 Cys-257 255 Glu-251 249 Ser-245 230 Leu-223 214 Val-213 208 Thr-195 139 Asp-135 134 Ser-132 103 Arg-100 99 Val-91 81 Glu-78 70 Glu-251 237 Ser-220 153 Leu-134 115 Val-133 120 Gly- 39 Ser-220 ThrThrAspSerAsnValAspLys- SerGlyAsnAlaAlaThrGlyLeuLeuAsnGluIleArg- IleArgIle- 102 106 Ser-245 Val-213 117 Leu-219 Val- Ile-115 213 Gly-208 141 ValVal- 119 Cys-275 259 Met-247 221 Met-216 196 Leu-139 135 Ser-132 131 Phe-130 122 Met-119 117 Trp-285 282 His-276 275 Gly-259 257 Glu-251 249 Ser-245 244 Lys-241 237 Val-234 230 Ser-220 214 Thr-195 166 Arg-153 139 Asp-135 134 Ser-132 103 Glu-99 91 Ile-81 Asp-75 Ser-71 70 Lys-64 GLy-259 216 Thr-195 147 40 GlyAspTryAspAlaAsnVal- SerCysCysAlaAspIleAlaThrMetLeuLeuAsnVal- Arg- MetAsn- 125 214 Ile-115 141 Val- Met-247 Glu-251 237 Ser-220 153 Leu-134 Ala-249 IleArgPhe-242 Table 3: The results of correlated mutations both Core and Surface residues was identified by analyzing both the representative protein sequence and structure in group 3 with the aid of Talana. Core and surface residues were classified based on how they distribute on the protein 3D structure. Core residues located inside protein structure whereas surface residues located in the surface of structure. Group3 Positions ( of resides in protein sequence) 6 Core and Surface residues Core 75 Val-104 77 Val-89 Val-117 147 Cys-174 151 Ile-211 123 Val-173 73 Ile-211 GlyIleAlaVal- 41 Surface Gly-177 Thr-35 Asn-253 Asp-105 Lys-271 Ile-265 264 Tyr-262 248 Ile-247 245 Arg-244 237 Pro-235 232 Thr-226 219 Lys-208 204 Ala-201 184 Ile-183 181 Ile-179 177 Arg-162 161 Ser-158 155 Ile-145 138 Glu-135 136 Lys-133 131 Arg-109 Lys-208 Ile-268 AsnAsnSerLeuArgValAspProTryGlyArgLeuThrGluSer- Leu-130 126 Ala-123 115 Gln-106 105 Asn-102 Val-86 62 ProGluArg-109 AspVal-89 ArgGln-59 Table 4: The results of correlated mutations both Core and Surface residues was identified by analyzing both the representative protein sequence and structure in group 4 with the aid of Talana. Core and surface residues were classified based on how they distribute on the protein 3D structure. Core residues located inside protein structure whereas surface residues located in the surface of structure. Group4 Positions ( of resides in protein sequence) 2 9 Core and Surfaces residues Core Surface Thr-224 Gln-248 67 Val-43 160 49 Ala-195 231 65 Gln-248 LeuLys- ThrVal- 42 Thr-156 241 Glu-190 178 Ala-136 117 Asp-110 36 Ser- Gln-258 Ile-242 Ser-220 211 Phe-192 182 Arg-181 160 Gly-87 85 Leu-77 73 Val-252 Lys-218 192 Leu-179 Try-266 HisHisLeu- LysHisLysValAspLys-56 PheGln-83 Asp- 108 186 193 252 Ile-42 Met-129 234 Gly-75 Ser- Leu-67 Ala-195 248 Val-43 Gln- Gln-248 Ser-234 Val-227 43 110 Ala-99 Leu-269 267 Val-252 248 Trp-247 241 Val-227 225 Thr-224 201 Phe-192 185 His-183 181 Leu-179 163 His-155 136 Ser-133 110 Thr-101 95 Phe-92 90 Gly-87 77 Gly-75 72 Lys-56 53 Gln-35 Gln-258 Trp-245 243 Ile-242 220 Lys-211 197 Ala-195 192 His-183 181 Leu-175 160 Lys-109 Val-85 Asp-73 Lys-56 Leu-269 267 Val-252 Phe-92 CysGlnSerTryHisHisArgGluAlaApsLysGlnLeuValGluVal-252 MetSerLeuPheArgLysGly-87 Leu-77 Leu-67 Val-43 CysGln- 208 232 Gln-208 67 Val-43 160 LeuLys- Gln-248 42 Ile- 44 248 Trp-247 241 Ser-234 227 Try-225 224 His-201 192 His-185 183 Arg-181 179 Glu-163 155 His-142 136 Ser-133 129 His-117 110 Thr-101 95 Phe-92 90 Gly-87 77 Gly-75 72 Lys-56 53 Gln-35 Gln-258 252 Gln-248 242 Ser-220 211 Phe-192 183 Lys-160 87 Val-85 Arg-80 77 Asp-73 Try-266 252 Phe-251 211 Asp-110 99 SerValThrPheHisLeuHisAlaMetAspLysGnlLeuValGlu- ValIleLysHisGlyAla-84 LeuLys-56 ValLysAla- Phe-92 85 Val- Table 5: The results of correlated mutations both Core and Surface residues was identified by analyzing both the representative protein sequence and structure in group 5 with the aid of Talana. Core and surface residues were classified based on how they distribute on the protein 3D structure. Core residues located inside protein structure whereas surface residues located in the surface of structure. Group5 Core and Surface residues Positions ( of resides in Core Surface protein sequence) 6 Val-7 Gln-130 Gly179 Thr-74 Asn62 70 Asn-193 Gln130 Thr-74 Ala72 Lys-71 Ala57 Leu-24 79 Val-80 Val- Ala-202 Gln69 130 Glu-108 Thr74 Ala-72 Pro-63 Asp-47 Lys31 105 Thr-178 AlaGlu-226 Arg220 213 Asp-211 Ala206 Asp-194 Gln130 Lys-128 Met126 Ile-106 Met96 Thr-74 Thr61 Gly-50 Tyr48 Ser-46 Glu39 Arg-28 Glu-4 148 Gly-79 Ala-202 Ala149 GLn-130 Glu- 45 157 173 Val-158 Val-174 Ala-57 194 211 Val-174 57 Ala- 46 108 Thr-74 43 Asn-193 190 Gln-130 174 Ala-72 Pro-212 Thr-74 72 Leu-24 Thr-74 79 Ser-104 130 Glu-195 244 Pro-212 79 Thr-74 72 Leu-24 GlnLysThrGly-79 AlaGlyGlnIleGlyAla- [...]... the evolutionary conservation of amino acids based on the phylogenetic relationships between homologous sequences Both programs analyzed the evolutionary conservation of amino acids based on the sequences and produce conservation scores that correspond to the rate of evolution at each site The scores are divided into nine grades for the visualization of differing rates of evolution in Consurf: grade... least 30% of sequences 196 positions-X letters (64.1%) are occupied by any particular residue in more than 30% of sequences Figure 4: Part of result of multiple sequence alignment of 71 human SDR’s member 19 Figure 5: Completed consensus sequence of 71 human SDR’s members Consensus sequences for the human SDR family, constructed using Consensus Sequence Constructor The highly conserved positions (>70%... variable sequence regions in human SDR enzymes To further elucidate patterns of conservation and variation in human SDR enzymes, a comparative analysis of the 3D protein structure of each of the five consensus sequences was also conducted We identified a representative structure for each of the five groups recovered in the phylogenetic reconstructions using the Protein Data Bank The selection criteria... variable sequences and structures differ not only among the five human SDR groups, and also within each group Consurf-Group 1 Talana-Group 1 Figure 10: The identification of functional regions within group 1 using Consurf and Talana Group 1 expressed the full grade of coloring scheme in Consurf : the continuous conservation scores are partitioned into a discrete scale of 9 bins for visualization, such... 4.3 Mutational variability of human SDRs The two programs (Talana and Consurf) were used to analyze the mutation variability of both sequence and structure of the protein templates in each of the five human SDR groups yielded similar results (figure 10) The identification of conservative and variable sequence and structure regions within the human SDR family is presented in figure 9, 10 The conservative... position-colored turquoise, grade 5-the intermediately conserved-colored white and grade 9-the most conserved-colored maroon The conservation score at a site corresponds to the site’s evolutionary rate It measures of evolutionary conservation at each sequence site of the target chain The color grades are assigned as follow: the conservation scores below the average (negative values, are indicative of. .. applied to independent soft-wares-Talana and Consurf Consurf server is used for estimating the evolutionary conservation of amino acid based on the phylogenetic relations between homologous sequences 3.4.1 Consurf The first step in Consurf is to find sequence homologies (using BLAST) based on protein structures and amino acids sequences Sequences are clustered and highly similar sequences are removed... of these interactions are essential for their function To maintain functionality, it is reasonable to assume that the sequence changes accumulated during the evolution of one of the interacting residues must be compensated by changes in the other [34,35,36] In this way, the network of necessary inter-residue contacts may constrain divergence of the protein sequence to some extent Correlated mutations... Arginine while some of them are encoded by only 1 codon as Methionine In order to overcome that limitation on HMM, we applied one kind of newest model, called Genetic Semi-homology implemented in Geisha 3 This model focuses on what kind of codons that encoded for amino acid so, it concerns on cryptic mutations which changing in gene compositions without affecting on amino acid sequence Hence, genetic-semi-homology... mutations within the human SDR family Our analyses of mutational correlation within the human SDR family using both Corm and Talana reveal similar outcomes Based on the distribution of mutations mapped onto protein structure, the correlated mutations can be broken into two groups The first, core group, includes all mutations that show core molecular contact (table 2), most of which are located in conserved ... 5: Completed consensus sequence of 71 human SDR’s members Consensus sequences for the human SDR family, constructed using Consensus Sequence Constructor The highly conserved positions (>70% identity)... Multiple sequence alignment, consensus sequence generation, and analysis of human SDR specificity 18 4.2 Sequence specificity and interrelationships of the human SDR family 20 4.3 Mutational... mutations occur at the level of sequence, the effects of these mutations are noticed at the level of function Bio-molecule function, in turn, is directly related to 3D structure As such, by studying

Định dạng
Số trang	53
Dung lượng	1,84 MB