Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 53 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
53
Dung lượng
1,84 MB
Nội dung
INTERNATIONAL UNIVERSITY
VIETNAM NATIONAL UNIVERSITY HCMC
COMPARATIVE STUDY ON SEQUENCESTRUCTURE-FUNCTION RELATIONSHIP OF
HUMAN SHORT-CHAIN
DEHYDROGENASES/REDUCTASES
A thesis submitted to
the School of Biotechnology, International University
in partial fulfillment of the requirements for the degree of
MSc. in Biotechnology
Student name: TANG THI NGOC NU – MBT04011.
Supervisor:
Dr. LE THI LY
May/2013
ABSTRACT
The human short-chain dehydrogenases/reductases (SDRs) family has been the
subject of many recent studies due to their crucial roles in the human body. There
are a growing number of single-nucleotide polymorphisms and a variety of heritable
metabolic diseases that have been identified from the SDR genome. Here, we carried
out a phylogenetic analysis of homologous SDR sequences, and subsequently utilized
a series of bio-informatics and comparative analytical methods to investigate the
sequence-structure-function relationships within the human SDR family. Our findings
show that Tyrosine, Serine, and Lysine are not only present in all members of the
human SDR family, but are also located in a conserved region of both the SDR
protein sequence and structure. In contrast, we find a cluster of three residues
(Serine-Alanine-Serine,
Phenylalanine-Glycine-Valine,
Cystein-Serine-Serine,
Cystein-Histidine-Serine or Alanine-Alanine-Alanine) that are different in protein
sequence and structure and appear to be specific to each group of human SDR
family.
Finally, our analysis of correlated mutations within the human SDR family
reveals the occurrence of residues that are distantly located, but seem to be
interacting with one another. We hypothesize that these long-distance interactions
may be an adaptive mechanism that allows members of the human SDR family to
cope
with
a
changing
environment
and
differing
functional
demands
over
evolutionary time. Taken together, our results provide data that will be useful for
designing inhibitors targeted at specific groups of human SDRs, such as those that
are known to be metabolically disorders.
Key
words:
multiple
sequence
alignments,
mutational variability and correlation.
iv
consensus
sequence,
phylogeny,
ACKNOWLEDGEMENTS
First and foremost I would like to give my special thankful my university advisor, Dr.
Ly Le, for her support and encouragement during the time I was carrying out the
thesis, for cheering me up and guiding me through temporary standstills. In addition,
I would like to give my great thankful to my advisor Dr. Ly Le, for providing me with
this interesting topic, and for straighten many question marks concerning the
Bioinformatics part.
I would also like to thank my best friend, Charlene Mccord Buxan, for taking the time
to read my Master thesis and sharing valuable comments.
Last but not least, I would like to give my deeply thankful to my parents, who are
always by my side. Without my parent’s support, I could not finish successfully my
Master.
v
PUBLICATION
Ngoc Nu Tang, Ly Le. Comparative Study on 11β Hydroxysteroid dehydrogenase 1
(11βHSD1)”. Research Journal of Biotechnology, 2012. Accepted.
Ngoc Nu Tang, Jacek Leluk, Ly Le. Comparative study on Sequence-structurefunction
of
Human
Short-chain
dehydrogenases
Bioinformatics, 2013. Submitted.
SUPERVISOR’S APPROVAL
Dr. LE THI LY
vi
reductase
family”.
BMC
THESIS CONTENTS
ABSTRACT .................................................................................................... iv
ACKNOWLEDGEMENT...................................................................................... v
PUBLICATION ................................................................................................. vi
Ngoc Nu Tang, Ly Le. Comparative Study on 11β Hydroxysteroid dehydrogenase 1
(11βHSD1)”. Research Journal of Biotechnology, 2012. Accepted....................... vi
Ngoc Nu Tang, Jacek Leluk, Ly Le. Comparative study on Sequence-structurefunction of Human Short-chain dehydrogenases reductase family”. BMC
Bioinformatics, 2013. Submitted. ................................................................... vi
1
INTRODUCTION ......................................................................................... 1
1.1
General Introduction about Bioinformatics ........................................... 1
1.2
General introduction on Human Short-chain dehydrogenases/reducutases
(SDR) family ................................................................................................ 1
1.3
2
3
Aims and Objectives ......................................................................... 3
SEQUENCE DATABAES ............................................................................... 4
2.1
Data Collection ................................................................................. 4
2.2
Bioinformatic tools ............................................................................ 4
SEQUENCE ANALYSIS TOOLS ...................................................................... 4
3.1
Sequence alignment of human SDR protein ......................................... 4
3.1.1
Alignment of Pair Sequence ........................................................... 5
3.1.2
Local and global alignment ............................................................ 5
3.1.3
Why Sequence Alignment is performed? .......................................... 6
3.1.4
Substitution Matrices and Gap Penalties .......................................... 6
3.1.5
Multiple Sequence Alignment ......................................................... 7
3.1.5.1
ClustalW .................................................................................... 9
3.1.5.2
MUSCLE (Multiple Sequence Comparison by Log-Expectation) .......... 9
3.1.5.3
KALIGN .................................................................................... 10
3.1.5.4
T-COFFEE (Tree-based Consistency Objective Function of
Alignment Evaluation) ................................................................................ 11
3.1.5.5
3.1.6
GEISHA 3 ................................................................................. 11
Multiple sequence alignments of human SDR protein and alignment
verification ............................................................................................. 12
3.2
3.2.1
Consensus sequence construction and BLAST search........................... 12
What is BLAST (Basic Local Alignment Search Tool)? ....................... 12
vii
3.2.2
Construction of consensus of Human SDR protein family and BLAST
search
13
3.3
Phylogenetic tree construction and comparison of consensus sequences 13
3.3.1
Phylogenetic Tree Prediction ........................................................ 13
3.3.2
Distance-based Method ............................................................... 15
3.3.3
Character-based Method.............................................................. 15
3.3.3.1
PHYLIP ..................................................................................... 15
3.3.3.2
SSSSg ...................................................................................... 15
3.3.4
Human SDR phylogenetic tree and comparison of consensus sequences
16
3.4
Mutational variability of human SDRs ................................................ 16
Mutational Variability (Talana, Consurf) ...................................................... 16
4
3.4.1
Consurf ..................................................................................... 16
3.4.2
Talana ....................................................................................... 17
3.4.3
Mutational variability of human SDR protein family ......................... 17
3.5
Analysis of correlated mutations ....................................................... 18
3.6
Availability of original software generated by authors .......................... 18
RESULTS AND DISCUSSION ...................................................................... 18
4.1
Multiple sequence alignment, consensus sequence generation, and
analysis of human SDR specificity ................................................................. 18
4.2
Sequence specificity and interrelationships of the human SDR family .... 20
4.3
Mutational variability of human SDRs ................................................ 23
4.4
Correlated mutations within the human SDR family ............................ 28
5
CONCLUSION .......................................................................................... 32
6
REFERENCES .......................................................................................... 32
7
SUPPLEMENTS......................................................................................... 34
8
LIST OF FIGURES
Figure 1: In the motif, “a” represents for aromatic residues, “c” for charged residues,
“h” for hydrophobic residues, “L” for aliphatic, “p” for polar and “x”: for any residues.
In motif TGxxGhLG the aliphatic residues before the last G has replaced the original
aromatic residues, and the last motif has been changed from h[KR]xxNGP into
h[KR]xxNxxG. .................................................................................................. 6
Figure 2: Illustration of a local and global alignment [Figure 2.2, [22] .................. 13
Figure 3 : Here A, B and C represents the three highly conserved sequences of the
same protein taken from three separate organisms. The phylogenetic tree give a
view of the substitution that happened during the evolution, when these substitutions
evolved from the same ancestor [21]. .............................................................. 31
Figure 4: Part of result of multiple sequence alignment of 71 human SDR’s member
.................................................................................................................... 42
Figure 5: Completed consensus sequence of 71 human SDR’s members................ 42
Figure 6: Phylogenetic tree construction by PHYLIP ............................................ 44
Figure 7: Phylogenetic Tree construction by SSSSg. Both the program shown that
human SDR family can be phylogenetically grouped into five distinct classes. ........ 45
Figure 8: Comparison of the five consensus human SDR sequences ...................... 46
Figure 9: The active site (AS), substrate binding sites(BS), and three residues
between AS and one of the BS in 5 human SDR groups identified by Talana. ......... 47
Figure 10: The identification of functional regions within group 1 using Consurf and
Talana. ........................................................................................................ 49
Figure 11: The result of mutational variability (done by Talana) ........................... 51
Figure 12: Variability profiles for each of the five groups of human SDRs .............. 53
Figure 13: The location of the conserved and variable residues in the template
structure of group 1 of human SDR was identified by Talana. ............................. 54
LIST OF TABLES
Table 1: PDB code and name of five representative ............................................ 38
Table 2: The core residues in five human SDR groups identified by Talana ............ 57
Table 3: The surface residues in five human SDR groups identified by Talana ........ 58
Table 4: The identification of correlated mutation sets and their core and surface
characteristics for group 5 ............................................................................... 59
Table 5: Selected correlated mutations in human SDRs identified by Talana .......... 60
ix
1 INTRODUCTION
1.1 General Introduction about Bioinformatics
Bioinformatics is conceptual biology in terms of molecules (in the sense of physical
chemistry) and applying "informatics techniques" (derived from disciplines such as
applied math, computer science and statistics) to understand and organize the
information associated with these molecules, on a large scale. In short,
bioinformatics is a management information system for molecular biology and has
many practical applications [1]. Bioinformatics was born with the response to
handle the large quantities of biological data, which has increased dramatically [2].
For example as of August 2000, the GenBank repository of nucleic acid sequences
contained 8,214,000 entries [3] and the SWISS-PROT database of protein sequences
contained 88,166 [4]. On average, these databases are doubling in size every 15
months [3]. Bioinformatics, the subject of the current review, is often defined as the
application of computational techniques to understand and organize the information
associated with biological macromolecules. This unexpected union between the two
subjects is largely attributed to the fact that life itself is an information technology;
an organism’s physiology is largely determined by its genes, which at its most basic
can be viewed as digital information [1].
Basically, the aims of bioinformatics are three folds:
The first aim of bioinformatics helps to organize the data in an easier way for
researchers to access existing information and to submit new entries as they are
produced, as the Protein Data Bank for 3D macromolecular structures [5,6]. Thus the
purpose of bioinformatics extends much further.
The second aim of bioinformatics is to develop tools and resources that aid in the
analysis of data. For example, having sequenced a particular protein, it is of interest
to compare it with previously characterized sequences. This need is more than just a
simple text-based search and programs such as FASTA [7] and PSI-BLAST [8,9]
must consider what comprises a biologically significant match. Development of such
resources dictates expertise in computational theory as well as a thorough
understanding of biology.
The third aim of bioinformatics is to use these tools to analyze the data and interpret
the results in a biologically meaningful manner. More specific, bioinformatics can
conduct global analyses of all the available data with the aim of uncovering common
principles that apply across many systems and highlight novel features.
According to the important of Bioinformatics contribute in Biology area, especially in
analyzing the huge biological data effectively. In this study, I applied the third aim
of bioinformatics to highlight the general and specific characteristics of human SDR
family by covering two aspects on the bioinformatic’s topics, multiple sequence
alignment algorithm and identification of conserved motifs.
1.2 General introduction on Human Short-chain dehydrogenases/reducutases
(SDR) family
Short-chain dehydrogenases/reductases (SDRs) belong to one of the largest enzyme
super-families and includes over 46,000 members [10]. Among these, there are at
least 140 different enzymes that have been sequenced to date, and about 70 of
them are known to belong to the human SDR family [11, 12]. Most SDRs are known
to be NAD or NADP-dependent oxidoreductases that share characteristic sequence
motifs and mechanisms of action [13, 14]. This SDR enzyme super-family is present
in all forms of prokaryotic and eukaryotic life [13], and plays an important role in a
variety of key metabolic processes.
Indeed, human SDRs have been extensively studied for their critical roles in lipid,
amino acid, carbohydrate, cofactor, hormone and xenobiotic metabolism, as well as
in redox sensor mechanisms [15]. In addition to their crucial roles in normal
1
metabolic processes, the function of human SDRs in metabolic defects, such as type
II diabetes, warrants continued research attention [16,]. Given their part in proper
physiological functioning, the human SDR protein family appears to be a suitable
target for the development of novel drugs directed at influencing hormone
metabolism [17].
Despite their importance to proper metabolic function and potential use for the
treatment and prevention of various human diseases, a standardized way of
classifying SDRs has yet to be established. According to prior studies, SDR enzymes
can be divided into two main types, denoted as “Classical” and “Extended.” [18,19].
The “Classical” type consists of about 250 amino acid residues, while the “Extended”
family has an additional 100-residue domain forming the C-terminal region. Another
study, alternatively, divided the family into three types, designated as
“Intermediate”; “Complex” and “Divergent,” which can be distinguished according to
their characteristic sequence motif [15]. Even with conflicting ideas of how to group
SDRs, it is clear that members of the human SDR family have diverged over
evolutionary time because they share only 15 to 30% of overall sequence identity
[16]. Despite clear sequence diversification, human SDRs all have a common
sequence motif that defines the cofactor binding site (TGxxxGxG) and the catalytic
tetrad (N-S-Y-K) [20]. Moreover, the three-dimensional structures of all human SDRs
share common features, such as an alpha/beta-folding motif characterized by a
central beta-sheet. This central beta-sheet is typical of a Rossmann-fold with helices
on either side [20]. Given these interesting structural similarities, it is important to
study the evolutionary history of the SDR super-family to better understand why
they have similar 3D structure despite sharing very little sequence identity. It is
proposed that these common motifs might be conserved through evolution due to
their crucial function in differentiating the human SDR family from other enzyme
families [13]. While bio-molecule mutations occur at the level of sequence, the
effects of these mutations are noticed at the level of function. Bio-molecule function,
in turn, is directly related to 3D structure. As such, by studying and comparing the
sequences and 3D structures of the different human SDRs in a phylogenetic context,
it may be possible to reveal more pertinent information about the evolutionary and
functional diversification of the group.
The first enzymes of this type were analysed as early as in the 70’s. These analyses
gave the structures of prokaryotic ribitol dehydrogenase and Drosophila
alcohol dehydrogenase. The proteins were then not known to be a family but
the alcohol dehydrogenase turned out to differ from the previously known alcohol
dehydrogenase of liver and yeast. When other dehydrogenases showed the
same distinctive pattern as the two alcohol dehydrogenase types the concept of
a family of short-chain dehydrogenases was established. [20] This occurred in
1981, and since then the SDR family has grown enormously, both in the number of
known members and the variety of their functions. Currently at least
3000 members, including species variants, are known with a substrate spectrum
ranging from alcohols, sugars, steroids and aromatic compounds to xenobiotics.
[19, 20].
As can be expected, due to its broad variety of different functions, the SDR
family is very divergent. The residue identity in pair-wise comparisons is as low
as 15-30%. However, although few residues are completely conserved,
there are several sequence motifs, consensus patterns, which are distinguishable
within the families. The criterion for SDR membership is therefore the occurrence of
typical sequence motifs, arranged in a specific manner. These motifs comprise
Rossmann-fold elements for nucleotide binding and specific residues for the active
site and they reflect common folding patterns [18].
The SDR enzymes can be divided into two main classes, the Classic and the
2
Extended families. The Classic family is the largest family, with 218 of the
sequences in the data set as opposed to 118 for the Extended family. Why are
there then two classes, what distinguish them from each other? One distinction
is the length of the sequences. The classical SDRs have a sequence length of
around 250 residues, while the extended SDRs are around 350 residues long.
Another difference, although with exceptions, is that the classical SDRs prefer
the NADP(H) coenzyme and the extended SDR prefer the NAD(H). There are
however NAD(H)-binding classical SDRs as well as NADP(H)-binding extended
SDRs [19, 20]. The earlier mentioned sequence motifs do also differ between
the two main families. These motifs are what is most distinguishing for the two
classes, since for example the length can vary. The motifs can therefore be used
to separate the two main families.
What do then the motifs look like? They are placed in or near to different
secondary structures, as for example the Gly-motif, TGxxxGhG or TGxxGhlG,
which is placed in and adjacent to β1 + α1. There are seven motifs each for the
Classical and the Extended families. These motifs are based on the motif used
by Bengt Persson et al. [20] and can be seen in figure 3 below:
Figure 1: In the motif, “a” represents for aromatic residues, “c” for charged
residues, “h” for hydrophobic residues, “L” for aliphatic, “p” for polar and
“x”: for any residues. In motif TGxxGhLG the aliphatic residues before the
last G has replaced the original aromatic residues, and the last motif has
been changed from h[KR]xxNGP into h[KR]xxNxxG.
It is not denial the fact that human SDR family play important implications for
medicine, especially involving in metabolic defects as diabetes type II. Therefore, the
identification and functional analysis of human SDR family on sequences and
structures is the primary goal of the study leading to new targets for drug design and
development.
1.3 Aims and Objectives
In this study, a rigorous comparative analysis of homologous sequences and
sequence-structure-function relationships in the human SDR family was performed
using bioinformatics. Our goal was to gain insight into the mechanisms of action of
the human SDR family. Specifically, we sought to identify and compare the
convergent and divergent residues of the human SDR nucleotide-binding pocket.
We hypothesized that evolutionarily conserved regions in the human SDR family
would appear at or near to the location of the active and binding sites of the protein.
3
This is because active and binding sites are responsible for any chemical and or
enzymatic reactions that happened in the protein molecules. These interactions help
to maintaining proper functioning at these sites is necessary for the protein-protein
and protein-ligand interactions that are indispensable in regulating molecular
processes. In contrast, we expect to find variable regions in the human SDR family
that are near the nucleotide binding sites due to the varying substrate-enzyme
interactions that are characteristic of each individual human SDR family. Through
periods of adaptive radiation over evolutionary time, these regions of variability
allowed each group within the human SDR family to adopt its own, specific features.
These nuances in structural design, and potential functionality, are important to
identify in order to facilitate the future design of inhibitors that are directly targeted
to each subgroup of human SDRs [18,19, 20].
2 SEQUENCE DATABAES
2.1 Data Collection
75 sequences of human SDR enzymes were collected from UniProtKB database
(http://www.uniprot.org).
100 homologous sequences of human SDR enzymes were collected from NCBI-BLAST
(http://blast.ncbi.nlm.nih.gov/).
Human SDR protein structures were collected RCSB Protein Data Bank (PDB)
(http://www.rcsb.org/pdb/home/home.do)
2.2 Bioinformatic tools
Tools for doing multiple sequences alignment:
ClustalW, MUSCLE, Kalign and T-COFFEE were available at European Molecular
Biology
Laboratory-European
Bioinformatics
Institute
(EMBL-EBI)
(http://www.ebi.ac.uk/services/all).
Geisha3 was available at (http://atama.wnb.uz.zgora.pl/~jleluk/linki.html).
Tools for constructing consensus sequence:
Consensus
sequence
constructor
was
available
at
(http://atama.wnb.uz.zgora.pl/~jleluk/linki.html)
Tools for constructing phylogenetic tree:
PHYLIP was available at (http://www.phylip.com).
SSSSg
was
available
at
(http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/ssssg/ssssg.zip)
Tools for studying mutational variability:
Consurf was available at (consurf.tau.ac.il)
Talana was available at (http://www.bioware.republika.pl/)
Tools for visualization the results from the study of Consurf and Talana
Rastop was available at (http://www.geneinfinity.org/rastop)
Tools for studying correlated mutation
Talana was available at (http://www.bioware.republika.pl/)
Com
was
available
at
(http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/corm.jar).
3 SEQUENCE ANALYSIS TOOLS
3.1 Sequence alignment of human SDR protein
Several diseases are caused by disorder in genes or proteins. An understanding of
the sequences, how the genes or proteins are related to each other, what functions
they have can be of help in the development of remedies for diseases that are
caused by these disorders. However, an enormous task is trying to seek for the gene
or protein that is responsible for the disease. This is because, for example one single
gene can be responsible for several hundreds or thousand’s of base pair
4
combinations that could bear the disorder. There can also be hundreds of gene
sequence candidates for intensive studies. Hence, it is hard to choose a good
candidate to further investigate if it could be the origin for the disease. The
appropriate candidate has previously been found with trial and error technique.
However, the finding for remedy of a disease is a time consuming, thus new
techniques and algorithms are needed to make the discovery of the gene or protein
that cause a particular disease easier [21].
Depending on the biological data is available on several websites as UniprotKB,
NCBI, ect. There are several different techniques available for performing the
sequencing. For example, if we want to highlight the difference and similar of two
protein sequences then, alignment of pair sequences will be the best method to carry
out. In contrast, if we want to generate the similar and different characteristics for all
members in a certain family then, multiple sequence alignment will be suitable tools.
3.1.1 Alignment of Pair Sequence
A pair-wise sequence alignment is performed when two protein sequences are
available in the databases (UniProKB or NCBI), then a comparison will be made for a
series of characters or character patterns that lies in the same order in both the
sequences. For example:
The two sequences, V and W, are written in a two-row matrix. The first row contains
the characters of V and the second the characters of W. Matching characters, in V
and W, are placed in the same column and different characters are placed as a
mismatch in the same column. Another way to deal with different characters is to
insert a gap into the sequence, such that the character is placed opposite a gap in
the other sequence. The reason for introducing a gap is that more matches can be
achieved in this way. The column score is usually positive for corresponding
characters and negative for dissimilar characters. The sum of all column scores
makes up the score for the alignment. [22]
3.1.2 Local and global alignment
A pair-wise alignment can either be performed as a global or a local alignment, as
shown in Figure 1 below.
The global method was firstly to be invented and was used to compare over their
entire length (Figure 1). The global alignment works properly when the sequences
are conserved or when they are closely related to each other.
In contrast, the local method works well when the sequences are not closely related
so, they might only share a conserved region and thus, they are not similar over the
whole sequence. In local alignment, only substrings in the sequences are aligned.
[22]
Therefore, depending on the level of similarity in sequences, we can determine
whether local or global alignments will be applied.
5
Figure 2: Illustration of a local and global alignment [Figure 2.2, [22]
3.1.3 Why Sequence Alignment is performed?
Sequence alignments are important for discovering functional, structural, and
evolutionary information in biological sequences. With the help of an alignment, it is
simple for us to illustrate whether protein sequences share similar functionality
biochemical function and 3D structure. If proteins from different life form are similar,
they might have diverged from a common ancestor. Hence, they are homologous
and they suffered the mechanism of mutation and selection to evolve into the new
sequences. The alignment also implies the changes that have occurred between the
sequences and their common ancestor, are considered as substitutions. If there have
been insertions of new and deletions of old residues from the sequences, this is
referred to as gaps [23]. Therefore, the more similarity the sequences are, the less
change have occurred and the protein are likely related, thus the best alignment is
the one that best represents the most likely evolutionary scenario.
It is however important to remember that even though the sequences are similar;
they might not necessarily be homologous. Similarity between short sequence
fragments may have evolved by chance or as a result of evolutionary convergence,
meaning that the similar regions have the same function but that they have
developed independently from different ancestors [24]. This is still a limitation on
sequence alignment. In order to overcome that problem, in this study, I carried out
five independent approaches of alignment in order to increase the accuracy of the
alignment [24].
3.1.4 Substitution Matrices and Gap Penalties
As mentioned earlier, substitutions happen during evolution. When this happens,
certain amino acids are more commonly changed than others because they share the
similar in physio-chemical properties, other changes take place too, however they
are rarer. Knowing which substitutions are the most and least regular in a large
number of proteins can aid the prediction of alignments for any set of protein
sequences [21]. Matrices that estimate the probabilities of all possible substitutions
can therefore be of use here. There are several different methods for building these
so called substitution matrices, but the two most commonly used are PAM and
BLOSUM. PAM is for example applied in the much used global-multiple alignment
program CLUSTALW [22].
Aside from binary cost functions (0: match and 1: mismatch) a transformation
matrix of substitution costs can be instituted which will assign a separate penalty for
each class of mismatch observe [23].
6
The minimum mutation distant matrix (Fitch, 1990) is based on the minimum
number of nucleic acid/amino acid which must be changed in order to convert the
codon for 1 amino acid to the codon of another amino acid. The most common type
of transformation table is the log-odds matrix. These log-odds matrices contain the
relative frequencies with which amino acid are assumed to replace 1 another
overtime. The positive values in the matrix indicate a replacement rate is greater
than expected by chance whereas negative values indicate a replacement rate is less
than expected by chance.
The most relevant of log-odds matrices are the PAM (point allowed mutation) matrix
(Dayhoff et al., 1978). PAM matrix is calculated from the original PAM1 by
multiplying the PAM1 matrix y X times with itself then giving the probability of X PAM
1 mutation. Low PAM matrix is used with closely related sequences, while high PAM
matrix is used with distantly related sequences. The other matrix is BLOSUM matrix
(Henikoff, 1992) which is based on well-conserved blocks of multiply aligned
sequences segments or motif that represented the most conserved regions of aligned
family.
As earlier explained, there might contain gaps in an alignment. These gaps are
introduced into the alignment in order to align as many of the same characters as
possible. There should however not be to many gaps, because if gaps appear
everywhere the alignment will show an unlikely change of amino acids. [24, 25]
For this reason there exist penalties for inserting gaps. There is one penalty for
opening the gap and one penalty for extending the gap. There are several ways to
decide the value of the penalties, but the gap extension penalty is usually set to
something less than the gap-open penalty, allowing long insertions and deletions to
be penalized less than they would be by the linear gap cost. This is desirable when
gaps of a few residues are expected almost as frequently as gaps of a single residue
[23].
In addition, gap are constructed in the alignment representing implied insertion or
deletions. The decision to institute a gap in the alignment is a result of the gap cost
calculation during the wave-front update of the matrix elements
The first method is the dot matrix analysis. This method shows a matrix of the two
sequences. It has one sequence written horizontally across the top of the graph and
the other along the left-hand side, and diagonal lines showing alignments [21].
The second method is the dynamic programming algorithm, which solves a problem
by combining solutions to sub-problems. It finds the optimal alignment by comparing
all character-pairs in the sequences [24]. This algorithm is a commonly used
algorithm for sequence analysis.
The third method is the word or K-tuple method. This method starts by searching for
identical short stretches of sequences, called words or k-tuples. These are then
joined into an alignment with the dynamic programming algorithm. Examples of
programs using this method are FASTA and BLAST. These programs are commonly
used for database searches, when seeking for the sequences that align the best with
an input test sequence [25].
This was a few of the most widely used methods for pair-wise sequence alignment. If
the sequences to be aligned, though, are more than two then these methods are not
a good choice. The next chapter, Multiple Sequence Alignment, will discuss other
methods that are more appropriate in this case.
3.1.5 Multiple Sequence Alignment
Similar to a pair-wise alignment, multiple sequence alignment is that sequences are
searched for a series of individual characters or character patterns that are in the
same order in the sequences. The alignment can be also performed as a global or a
7
local alignment, and substitution matrices and gap penalties can be of use. The
difference is that a multiple sequence alignment contains more than two sequences.
A multiple sequence alignment of a set of sequences can provide information on the
most alike regions in the set. In proteins, such regions may represent conserved
functional or structural domains.
Multiple alignments are the basis for most sensitive sequence searching algorithms.
They are also useful for deciphering evolutionary information in biological sequences.
For example, they provide information on which residues are important for the
function and for stabilization the secondary and three-dimensional structures of the
protein. That is, it can illustrate which residues and regions that represent conserved
functional or structural domains. Even if only two sequences in a set are supposed to
be aligned, it can be meaningful to conduct a multiple alignment of all sequences in
the set in order to improve the accuracy of the alignment. In addition, it is difficult to
identify the pattern of conserved residues when only comparing the two sequences.
Therefore, multiple sequence alignment is the most common methods to study the
conserved motif for a certain family [26].
Standard Protocol of multiple sequence alignment (MSA) in traditional way
Pair-wise alignment (align the most closely related sequence first and then
gradually adding the more distant one)
A distant matrix (K-tuple) calculated based on pair-wise alignment (based on
level of identity) then, giving the divergence of each sequence.
A guide tree is calculated based on distant matrix by applying Neighbor
joining
The sequence progressively aligned according to the branches order in the
guide tree (based on the statistic matrices, PAM/BLOSUM) then, doing
complete alignment [22, 23]
However, there are two major problems associated with the progressive approach on
MSA of traditional methods. That is the local minimum problems and the choice of
alignment parameters.
The local minimum related to the “greedy” nature of the alignment strategy. The
algorithm greedily adds sequences together, following the initial tree. There is no
guarantee that global optimal solution, as defined by some overall measure of
multiple alignment quality or anything related to it. More specifically, any
mismatches made early in the alignment process can not be corrected later as new
information form other sequences is added. This problem mainly related to the result
of incorrect branching order in the initial tree. The initial tree are derived from a
matrix of distances between separately aligned pairs of sequences and are much less
reliable compared to the trees from completed multiple alignment. Thus, if
misalignment happen and carry through from the early alignment steps cause local
minimum problem.
The problem in choosing alignment parameters happen due to traditionally, one
weight matrix and two gap penalties (one for opening a new gap and one for
extending existing gap) work well for closely related sequence. Because all residues
weight matrices give most weight to identities. When identities dominate an
alignment, almost any weight matrix will find approximately the correct solution.
However, this method does not work well for most distantly related sequences or
divergent sequences then, leading to more mismatches. In addition, in closely
related sequences, the range of gap penalties values will find the correct solution
that can be very broad. However, as more and more divergent sequences are used.
It may be very narrow range of values will deliver the best alignment.
Therefore, there are many MSA methods are designed to improve these problems
which are available in the EBI website at the current time. However, the most
8
important point is which MSA tools will be the best candidate when performing MSA?.
The answer is there will be no the best tools but the most crucial point to consider
when choosing a program will be the biological accuracy, execution time and
memory usage. The most accurate programs according to benchmark [16,17] tests
are MUSCLE and T-COFFEE. In practice, accuracy claims can be difficult to validate
due to the frequent practice of parameter tuning to optimize performance on 1 or
more benchmarks. Benchmark scores are typically based on averages over many
alignments [27]. Thus, we employed four independent tools that are available at EBI
services as ClustalW, Kalign, MUSCLE and T-Coffee to perform MSA. However, these
four tools work based on Hidden Markov Model (HMM) which assumes that the
probability of amino acid A is substituted by amino acid B is independently to what
amino acid A is transformed from. This model contains some limitations that amino
acids can not be considered as equal unit in evolutionary substitution [20, 28].
Because some of them are encoded by 6 codons as Serine, Leucine and Arginine
while some of them are encoded by only 1 codon as Methionine. In order to
overcome that limitation on HMM, we applied one kind of newest model, called
Genetic Semi-homology implemented in Geisha 3. This model focuses on what kind
of codons that encoded for amino acid so, it concerns on cryptic mutations which
changing in gene compositions without affecting on amino acid sequence. Hence,
genetic-semi-homology is more sensitive than HHMs to work on non-homologous
sequences [28, 36]. Therefore, in order to improve the accuracy of MSA:
In this study, the five independent tools of MSA already applied to address these
issues in their programs [29].
3.1.5.1 ClustalW
In ClustalW, there were several improvements on the progressive multiple alignment
method which greatly improve the sensitivity without sacrificing any of the speed and
efficiency which makes this approach so practical.
In ClustalW, the problem in the choice of alignment parameters will be improved by
varying the gap penalties in a position and residue-specific manner [19].
All pairs of sequences are aligned separately by using ClustalW two groups of
penalties and full amino acids weight matrix are used in dynamic programming
(using matrix to score alignment).
Guide tree created based on score by using Neighbor Joining. In CLustalW, all of the
remaining modifications apply on to the final progressive alignment stages:
Initially, gap penalties are calculated depending on the weight matrix
(similarity-length of sequence)
Derive sensible local gap open penalties at every position in each pre-aligned
group of sequence will vary as new sequence is added.
The final modification allows us to delay the addition of very divergent
sequence until the end of the alignment process when all of the more closely
related sequences have been aligned.
Initial values can be set by users. Then, the software automatically attempts to
choose appropriate gap penalties for each sequence alignment, depending on several
factors: the weight matrix, similarity of sequence, length of sequence, differences in
length of sequence, position-specific gap penalties [30]
3.1.5.2 MUSCLE (Multiple Sequence Comparison by Log-Expectation)
Stage 1: Draft progressive
9
All unaligned sequences are used to align first by using Kmer counting (K-tuple-word
matching) to create K-mer distance matrix D1. Based on the distance matrix D1, the
tree 1 is calculated by applying UPGMA (Un-weight pair group method with
Arithmetic Averages) and distance matrices are clustered using UPGMA. In tree 1,
internal node-a pair-wise alignment is constructed to create a new profile. At each
leaf, a profile is constructed from input sequences. Nodes in the tree are visited in
prefix order-children before their parents). Next, the progressive alignment is
calculated based on the tree 1 and multiple sequence alignment 1 is produced.
Stage 2: Improve progressive
Compute percentage identity in multiple sequence alignment 1 and Kimura distance
matrix D2 is produced. Then, the tree 2 is produced based on UPGMA. This is
optimized by computing alignment only for sub-trees whose branching order changed
relative to tree 1. Finally, the progressive alignment is carried out to produce
multiple sequence alignment 2.
Stage 3: Refinement
Based on the tree 2, deleted edge in tree 2 to produce two sub-trees, then,
computing sub-tree profiles. The sub-tree profiles are used to do realignment profiles
and multiple sequence alignment is then finally produced. Finally, sum of pairs score
is used to confirm the accuracy of multiple sequence alignment. If the sum of pairs
score give score better then, the final multiple sequence alignment is produced [20].
3.1.5.3 KALIGN
KALIGN develops by employing the Wu-Manber string algorithm to improve both
accuracy and speed of MSA.
The Wu-Manber algorithm is used instead of using K-tuple (word matching) matches
(run for identical residues), calculated a distant matrix between two strings which is
measured by the Levenshtein edit distance. For example: two strings A and B have
an edit distance d if A can be transformed into B by applying d mismatches
(insertion/deletion) then, providing distant scores to build up the tree.
In Wu-Manber, sharing mismatches patterns can be still readily found on enable WuManber algorithm to report the meaningful distances between highly divergences.
However, for matching patterns, many spurious (failed but seemly tree) matches are
reported.
Wu-Manber algorithm also applies in progressive alignment. At each internal node of
the guide tree, two profiles are aligned. Optionally, KALIGN uses Wu-Manber as an
anchor point during the alignment phases, which requires two extra steps to dynamic
programming. KALIGN employs global dynamic programming method-using affine
gap penalties mean that residues are assigned into three stages (aligned, gap in
sequence A and gap in sequence B). It disallows a gap in one sequence to be
immediately followed by a gap in other sequence. When these state matrices are
filled, the final cells contain the maximum align score and a trace back procedure
(requiring the matrices) is used to retrieve the actual alignment.
There are two extra steps in dynamic programming:
Consistency check: this task is sieve through thousand of matches found
between two sequences. Find the largest set of matches that can be included
in an alignment.
Updating of pattern match positions: this updating step adjusts the absolute
position of matches found within sequences to their relative position within
profiles generated by dynamic programming step.
KALIGN uses a substitution matches (BLOSUM, PAM) an affine gap penalties in
dynamic programming. A common idea is that similar sequences should be aligned
with hard matrices (PAM50, BLOSUM80) while more distantly related sequences align
better using soft matrices (PAM250, BLOSUM40) [31].
10
3.1.5.4 T-COFFEE (Tree-based Consistency Objective Function of Alignment
Evaluation)
This method has two main features: provide a simple and flexible mean of
generating MSA using heterogeneous source (combing local and global sequence
alignment). In addition, the optimization method is used to find the MA that best fit
the pair-wise alignment in the input library).
First, the ClustalW primary library is created by doing the global pair-wise alignment
and the Lalign primary library is created by doing the local pair-wise alignment.
Then, the next step is to combine the local and global alignment by addition. If any
pairs are duplicated between two libraries, it is merged into a single entry that has a
weight equal to the sum of two weights. Otherwise, a new type of entry is created
for pair being considered.
The second step is weighting or signal addition. T-Coffee assigns each weight to each
pair of aligned residues in the library.
The thirds step is primary library or listing of weight pair-wise constraints. Each
constraint receives a weight equal to percent of identity within the pair-wise
alignment it comes from.
The fourth step is extension. For each pair of aligned residues in the library, T-Coffee
assigns a weight that reflects the degree to which those residues align consistency
with residue from all others.
The fifth step is extension library. The final weight for any pairs of residues reflects
some of the information contained in the whole family. It is based on taking each
aligned residue pair from the library and checking the sequences. Thus, the weight of
a pair of residues will be the sum of all the weights gathers through the examination
of all the triplets involving this pair.
The final step is progressive alignment. To replace BLOSUM/PAM by using weight in
the extended library to align the residues in two sequences. This pair of sequences is
then fixed and any gaps have been introduced can not be shifted later. The next
closet two sequences are aligned to the existing alignment of the first two
sequences. Finally, completed MSA is created [32].
3.1.5.5 GEISHA 3
CLustalW, MUSCLE, KALIGN and T-COFFEE calculated based on statistical matrices or
Hidden Markov Model (HMMs). HMMs is the probability of amino acid A is substituted
by amino acid B is independently what amino acid A is transformed from. However,
amino acids can not be considered as equal unit in evolutionary substitution. Some of
them are encoded by six codons as Serine, Arginine and Leucine while some are
encoded by one codon as Methioine and Trytophan. In addition, Markov model does
not concern cryptic mutation which changes composition of the gene without effect
on amino acid sequence, single point mutation-common mechanism for protein
variability. For example:
Met(AUG) is changed to Arg(AGG) and then changed to Lys(AAG). If Arg is originated
from Met, need one step to change Arg to Lys. However, Leu(CUR) is changed to
Arg(CGR) and then changed to Glu(CAR) and finally to Lys(AAR). In this case, if Arg
is originated from Leu, need two steps to change Arg to Lys.
Thus, ClustalW, KALIGN, T-COFFEE and MUSCLE do not concern on genetic code. In
order to overcome this limitation, genetic semihomology implemented in Geisha3 is
used in our study to increase the accuracy of the alignment and it also considers on
the closely relationship between amino acids and their codons in related proteins. For
instance:
11
Met(AUG) is transferred into Arg(AGG) and then transferring into Lys(AAG). All this
steps happened based on the one single point mutation or single step from U to G
and to A. However, id the Arg in this example is coded with CGU then, in order to
mutate to Lys (AAU), this process does not follow one single transition or tranversion
anymore. Thus, Geisha3 in our study not only helps to increase the accuracy of the
alignment but also helps to reduce the mismatches happened in the alignment
process.
In Geisha3, the term Semi-homoloy is used which means that two residues are
Semi-homology if there is only one substitution in their codons. Thus, there are three
different types of Semi-homology:
The first type of Semi-homology concerns amino acids whose codons differ in one
nucleotide of the same type such as pyrimidine (T and C) to pyrimidine, purine (A,
G) to purine.
The second type of Semi-homology concerns amino acids whose codons differ in
nucleotide of different types such as pyrimidine to purine.
The last type of Semi-homology is not alternative to the former two. It concerns
residues whose codons differ in the last codon which is known the most tolerant in
encoding amino acids [33, 34, 35].
3.1.6 Multiple sequence alignments of human SDR protein and
alignment verification
75 sequences of human SDR enzymes were collected from UniProtKB database
(http://www.uniprot.org). Sequences were initially aligned with ClustalW, T-Coffee,
MUSCLE and Kalign using the template sequence Q14376 (UDP-glucose-4epimerase). In order to create the most robust alignment possible, initial alignments
using each method were compared against one another and the most differing
sequences, with a very low degree of shared identity, were removed before
performing subsequent analyses.
The potential evolutionary relationship between corresponding non-identical positions
from the four different multiple alignments were verified separately using the genetic
semi-homology algorithm implemented in version 3 of the program Geisha
[33,34,35].
Geisha3
is
freely
accessible
from
the
Website
(http://atama.wnb.uz.zgora.pl/~jleluk/linki.html).
Verifying
multiple
sequence
alignments using Geisha helps to identify and reduce potential mismatches that may
occur during the initial alignment process. ClustalW, T-Coffee, MUSCLE and Kalign
are based on the Hidden Markov Model. Geisha improves alignment accuracy by
completing the alignment while considering point mutations. Setting it apart from the
programs used for initial alignments, Geisha assumes that the probability of the
replacement of one amino acid into another depends significantly on what amino
acids occupied that position in the past.
Only the sequences who displayed the most similar level of identity (equal or higher
than 80% in that case) would be keep in the result of MSA otherwise would be
removed. Because these sequences would be target for constructing consensus
sequence which shows the most conserved motifs for human SDR family.
3.2
Consensus sequence construction and BLAST search
3.2.1 What is BLAST (Basic Local Alignment Search Tool)?
The most widely software for efficiency comparing bio-sequences to a database is
BLAST [26]. BLAST computation is organized as thee steps pipeline:
Stage 1: Words matching, which detects substring of fixed length w in the stream
that perfectly match a substring of query.
12
Stage 2: Ungapped extension, each matching w-mers is forwarded to the second
stage, ungapped extension which extends the w-mers to either side to identify a
longer pairs of sequences around it that match with at most a small number of
mismatch character. These longer matches are high-scoring segment pairs (HSPs) or
ungapped extension.
Stage 3: Gap extension. Every HSPs has both enough matches and sufficiency few
mismatches is passed to the stage of gap extension. The gap extension use the
Smith-waterman dynamic programming algorithm to extend it into gapped
alignment, a pair of similar regions that may differ by arbitrary edit [37, 38]
In this study, we apply NCBI BLAST [39] for searching homologous sequences of
human SDR family.
3.2.2 Construction of consensus of Human SDR protein family and
BLAST search
As a way of summarizing the verified human SDR multiple sequence alignments, a
single consensus sequence for the entire human SDR super-family was established.
The consensus sequence was obtained using the Consensus Sequence Constructor
[33,34,35] with default parameter values. The highly conserved positions (>70%
identity) are marked with bolded black letters, whereas Intermediate conservation
(>30% identity) is indicated with black characters corresponding to the most
commonly occurring residue and the positions marked as X are the variable positions
that are occupied by any particular residue in more than 30% of sequences
This is an original application designed by our Polish collaborators and is freely
available
for
non-commercial
academic
purposes
from
the
Website
http://atama.wnb.uz.zgora.pl/~jleluk/linki.html.
The
most
robust
consensus
sequence was then used to identify two types of specificity for all members of the
human SDR super-family: 1) the general specificity, which indicates common
features of the entire enzyme super-family, and 2) the individual specificity, which
distinguishes the unique structural properties of each grouping within the human
SDR super-family separately. Put another way, the general specificity is concerned
with the more conservative regions of the human SDR protein sequence, while the
individual specificity highlights the more variable regions. By investigating both types
of specificity, our results may be of better use for future work on developing
inhibitors that can be directed to only one or a few enzymes without affecting the
activity of others. Lastly, the consensus sequence was also used in a BLAST search
for potential new members of the human SDR family. The new sequences
supplemented the original 75 SDR family members (about 100 additional sequences)
and were aligned in the same way as described above.
3.3 Phylogenetic tree construction and comparison of consensus sequences
3.3.1 Phylogenetic Tree Prediction
Phylogenetic tree shows the inferred evolutionary relationships among various
biological species or other entities based upon on similarities and the differences in
their physical and for genetic characteristics. The taxa joined together in the tree are
implied to have descended from a common ancestor [25].
The phylogenetic tree prediction is used for structuring sequences constitutes an
important area of sequence analysis. It can be helpful when analyzing changes that
have occurred in the evolution of different organisms, or it can be of use when
studying the evolution of a family of sequences. Based on these analyses the
sequences that are the most closely related can be identified through that they are
occupying neighbor branches on a tree.
When a phylogenetic analysis of a family of related nucleic acids or protein
13
sequences is performed the evolutionary history of the family is examined and
the sequences are shown in the form of an evolutionary tree. The original ancestor
sequence will then form the root node of the tree. The branching relation in the tree
shows the degree to which the sequences are related. The closest related sequences
will be placed as neighbor-leaves and are joined to a common branch beneath them.
Phylogenetic analysis is closely related to multiple alignment, which often is
the base that the phylogenetic analysis proceeds from. One reason for building
a phylogenetic tree of the multiple alignment is that the tree makes the relationships between the sequences clearer. Another reason is that when the genes for
the proteins, in the different organisms, have developed during evolution, amino
acids have been substituted. A phylogenetic tree can be of use when these
substitutions are to be analyzed [21]. An illustration of a small phylogenetic
tree with a few substitutions is given in Figure 3.
Figure 3 : Here A, B and C represents the three highly conserved sequences
of the same protein taken from three separate organisms. The phylogenetic
tree give a view of the substitution that happened during the evolution,
when these substitutions evolved from the same ancestor [21].
How is then the phylogenetic analysis performed? First a multiple alignment
is built, using one of the methods described in the chapter concerning multiple
alignment, as for example the CLUSTALW program. Then the substitution
model is chosen. The choice of model is based on how similar the sequences are.
If they are highly similar the PAM matrix is often useful, since it is designed to
track the evolutionary origins of proteins, but if they are less similar the BLOSUM matrix might be superior, because it is designed to find the conserved
domains of proteins [21]. After choosing the substitution model the next
step is to build the tree. Here there are several tree-building methods to choose
of. These methods can be divided into two main groups, namely distance-based
and character-based methods. [25]
These two groups are just briefly explained below, because phylogenetic prediction
was not considered as a solution for the classification problem subjected for this
project. The reason for this is that the SDR proteins are distantly related. A
phylogenetic analysis of very different sequences is difficult to carry out, as there are
several possible evolutionary paths that could have given rise to the observed
sequence. This results in a very complex problem that requires considerable
expertise to execute. [21]
14
3.3.2
Distance-based Method
Distance-based methods use the number of changes, the distance, between two
aligned sequences to derive trees. [22] The sequence pairs that have the least
number of changes between them are the closest related. They are placed as
neighbors in the tree and are both connected to their common ancestor node by a
branch. [21] There are several different methods that are classed as distance-based
methods, for example Un-weighted Pair Group Method with Arithmetic Mean,
UPGMA, Neighbor-joining, and Fitch-Margoliash. [21, 28]
3.3.3
Character-based Method
The character-based methods derive trees that optimize the distribution of the
actual data patterns for each character. Pair-wise distances are therefore not, as
in distance-based methods, fixed as they are determined by the tree topology.
This allows the assessment of the reliability of each base position in an alignment on
the basis of all other base positions. [25] Examples of methods that belong to the
character-based methods are Maximum Parsimony and Maximum Likelihood. The last
method for sequence analysis is secondary structure prediction, which is described in
the paragraph below.
Although both distant and character-based methods can be used to construct
phylogenetic tree but in our study, we prefer to construct tree based on characterbased method, especially Maximum Likelihood (ML) and Maximum Parsimony (MP).
This is because distant based methods as UPGMA (Unweighted Pair Group with
Arithmetic Mean) or Neighbor-Joining (NJ) contain several drawbacks such as: they
can work well on closely related sequences but failed on the distantly divergent
sequence [28]. However, in our study, the level of identity of SDR family is only 15%
to 30% so, in order to produce the most accuracy result, MP and ML can overcome
the drawbacks of UPGMA and NJ [28].
3.3.3.1 PHYLIP
PHYLIP implemented with Maximum Parsimony, uses Fitch’s algorithm to find a
minimum number of mutation requires changing from one nucleotide to each other.
In order word, MP’s work is based on the observed data on the similarities and
differences among data, smallest number of evolutionary changes based on
Operation Taxon Unit. PHYLIP creates the tree by selecting the tree that minimizing
the number of evolutionary steps (transformation of one character state to another
required to explain a given set of data).
For each site, each leaf is labeled with set containing observed nucleotide at this
position. For each internal node I with children j and k, labeled Si and Sk
Si = Si U Sk if Sj ∩ Sk is empty
Sj ∩ Sk otherwise
Total, number of changes is necessary for a site is number of union operations.
Weakness: its implicitly assumes that the rate of change along branches is similar.
3.3.3.2 SSSSg
SSSSg implemented maximum likelihood. It calculated the possible way to change
one amino acid to another. In order word, ML’s work is to create all the possible
trees containing the set of organisms considered, using the statistics to evaluate the
trees.
For example, given a data D, model M, find a tree T
Pr (D/M, T) is maximized
Make two independent assumptions:
15
Different sites evolve independently
Divergent sequences evolve independently after diverging
3.3.4 Human SDR phylogenetic tree and comparison of consensus
sequences
The results of our multiple sequence alignments were used as input data for
constructing phylogenetic trees that would outline the interrelationships of the
various members of the human SDR super-family. In this study, two independent
approaches were used to construct the phylogenetic trees - PHYLIP (Felsenstein,
1989) and SSSSg (database: Uniprot, matrix: BLOSUM45, number of matches: 10
and E upper value: 5.0). PHYLIP is a free package of programs for inferring
phylogenies accessible at (http://www.phylip.com). SSSSg is our original software,
and
is
freely
accessible
at:
(http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/ssssg/ssssg.zip).
PHYLIP
uses Fitch’s maximum parsimony algorithm, and constructs the phylogenetic tree
that requires the least amount of evolutionary change to fit the input data. To
supplement our parsimony analyses, we also applied the maximum likelihood
algorithm to our data using the program SSSSg. Maximum likelihood is an optimality
criterion, like maximum parsimony, for the reconstruction of phylogenies. Maximum
likelihood methods differ from the non-parametric parsimony approach because they
use an explicit model of character evolution for tree construction. Both maximum
parsimony and maximum likelihood methods recovered the same five, high-level
branching events within the human SDR family, and lower-level topological
differences were negligible. As such, we arbitrarily chose to use the maximum
likelihood tree for all subsequent analyses.
Using Consensus Sequence Constructor, we identified a single consensus protein
sequence for each of the five human SDR subgroups. A comparison was carried out
on the five resultant consensus sequences in order to identify the conservative and
variable sequence regions in human SDR enzymes.
To further elucidate patterns of conservation and variation in human SDR enzymes, a
comparative analysis of the 3D protein structure of each of the five consensus
sequences was also conducted. We identified a representative structure for each of
the five groups recovered in the phylogenetic reconstructions using the Protein Data
Bank. The selection criteria focused on the maximum identity of the sequence
alignment from all members in each group, and the highest degree of similarity at
the tertiary structural level.
3.4
Mutational variability of human SDRs
Mutational Variability (Talana, Consurf)
Mutational variability was carried out to highlight the conserved and variable regions
in human SDR’s sequences and structures.
In our study, we applied to independent soft-wares-Talana and Consurf. Consurf
server is used for estimating the evolutionary conservation of amino acid based on
the phylogenetic relations between homologous sequences.
3.4.1
Consurf
The first step in Consurf is to find sequence homologies (using BLAST) based on
protein structures and amino acids sequences. Sequences are clustered and highly
similar sequences are removed using CD-HIT and cut off (95%). Then, multiple
sequence alignment is created and phylogenetic tree is also created based on
multiple sequence alignment by using Neighbor Joining.
16
The second step, maximum likelihood calculates position-specific conservation scores
(depend on the users choice).
The third step is used to calculate conservation scores which are divided into discrete
scale of nine grades for visualization. For example, grade 1-the most variable
position-colored turquoise, grade 5-the intermediately conserved-colored white and
grade 9-the most conserved-colored maroon.
The conservation score at a site corresponds to the site’s evolutionary rate. It
measures of evolutionary conservation at each sequence site of the target chain.
The color grades are assigned as follow: the conservation scores below the average
(negative values, are indicative of slowly evolving, conserved sites) are divided into
45 equal intervals. The score 45 intervals are used for the score above the average
(positive values, rapidly evolving, variable sites)
3.4.2
Talana
Similar to Consurf, Talana is used to calculate the number of different amino acids
that occupy particular position in a provided MSA. Chart, scripts used to visualize the
availability on a PDB profile. In addition, Talana produces the conservation scores
into 12 grades. Grade 1 and 2 are the most conservative and are in darkest blue
color, grade 3 and 6 are the intermediately conservative and are in light blue and
white color whereas the grade 7 to 9 are the most variable and are in pink and red
color.
3.4.3 Mutational variability of human SDR protein family
We used the five representative structures we identified (Table 1) together with all
protein sequences available in each group identified in our phylogenetic analyses to
study the mutational variability within the five subgroups of the human SDR family.
ConSurf
(available
at
consurf.tau.ac.il)
and
Talana
(available
at
http://www.bioware.republika.pl/) were used to identify conservative and variable
residues within functional regions in the aligned homologous sequences. Consurf and
Talana are used for estimating the evolutionary conservation of amino acids based
on the phylogenetic relationships between homologous sequences.
Both programs analyzed the evolutionary conservation of amino acids based on the
sequences and produce conservation scores that correspond to the rate of evolution
at each site. The scores are divided into nine grades for the visualization of differing
rates of evolution in Consurf: grade 1 is the most variable position and is colored
turquoise; grade 5 is the intermediately conserved position that is colored white; and
grade 9 is the most conserved position and is colored maroon. Alternatively, in
Talana, the conservation scores are divided into 12 grades: grade 1 is the most
conserved position (darkest blue); grade 6 is the intermediately conserved position
(white); grade 12 is the most variable position (darkest red). After the conservation
score has been calculated for each site, both programs automatically project the
value for each sequence onto the consensus protein structures. Results from both
Consurf
and
Talana
were
visualized
using
Rastop2.2
(http://www.geneinfinity.org/rastop) and mutually compared for verification of their
compatibility.
Table 1: PDB code and name of five representative
Groups PDB code and name of representative structures
1
3edm chain A ,Uncharacterized Oxidoreductase SSP0419
2
1hdc chain A, Retinol Dehydrogenase 7
3
1yb1 chain A, 17-beta hydroxysteroid dehydrogenase 13
4
3rd5 chain A, Retinol dehydrogenase 11
5
1q7b chain A, 3-Oxoacyl-[acyl-carrier-protein] reductase FabG
17
3.5 Analysis of correlated mutations
Correlated mutations are the phenomenon of several mutations occurring
simultaneously and dependent on each other. According to the current hypothesis of
molecular positive Darwinian, selection, correlation mutations are related to the
change occurring in their neighborhood. They reflect the protein-protein interaction
and they preserve the biological activity and structure properties of the molecules
[40].
In this project, we also studied mutational correlation among human SDR members
in order to gain more understanding on protein-protein interaction among these
protein family. This information may be useful for further study on designing
inhibitors. The Corm and Talana are two soft-wares being used to accomplish this
task.
Lastly, we set out to investigate the tendency of different amino acids along human
SDR proteins to mutate together. It is clear that many residues within the same
protein have evolved to form specific molecular complexes and that the specificity of
these interactions are essential for their function. To maintain functionality, it is
reasonable to assume that the sequence changes accumulated during the evolution
of one of the interacting residues must be compensated by changes in the other
[34,35,36]. In this way, the network of necessary inter-residue contacts may
constrain divergence of the protein sequence to some extent.
Correlated mutations in representative protein structures and corresponding
consensus sequences in each subgroup of human SDRs were identified, localized and
analysed with the aid of Talana and Corm (freely available for non-commercial
academic
purposes
at
http://atama.wnb.uz.zgora.pl/~jleluk/software/wlasne/corm.jar).
The
program
FEEDBACK was implemented in Corm, which is designed to analyze the aligned
protein sequences for the occurrence of correlated mutations. It returns all possible
residues occurring at all sequence positions of aligned proteins for each residue
occurring at each position. Talana produces a similar set of results, but also
highlights correlated sequence mutations in the corresponding protein structures.
The candidate correlated sequence and structure mutations that were recovered
using both software packages were compared and then visualized on the SDR
template structure of the five groups using DSVisualizer1.7 of Accelrys
(http://accelrys.com/products/discoverystudio/visualization-download.php)
and/or
Rastop2.2 (http://www.geneinfinity.org/rastop). The visualization of the protein
sequence mutation correlation results from Talana and Corm provided an additional
method of investigating potential correlated mutations in protein structure.
3.6 Availability of original software generated by authors
The original applications of Geisha 3, Consensus Constructor, SSSSg, Talana and
Corm are freely available at the addresses listed above. They are also available
directly upon any request sent to the authors. Additionally, the authors are willing to
assist in the appropriate, effective running of all applications.
4 RESULTS AND DISCUSSION
4.1 Multiple sequence alignment, consensus sequence generation, and analysis of
human SDR specificity
After multiple sequence alignment and verification, we identified four sequences
(P49327, P14060, P56159, and P56937) that shared very low sequence identity with
the rest of the members of the human SDR family, and were removed from
18
subsequent analyses. We constructed the consensus sequence from the remaining
71 sequences, and used it to identify features of general and individual specificity.
Our comparative analyses revealed little overall general specificity and much
individual specificity amongst human SDR sequences (figure 4). Among 306 positions
in the consensus sequence, only 5 positions-bold letters (1.6%) are occupied by the
same residue in more than 70% of sequences, whereas 105 positions (34.3%) are
occupied by the same residue in at least 30% of sequences. 196 positions-X letters
(64.1%) are occupied by any particular residue in more than 30% of sequences.
Figure 4: Part of result of multiple sequence alignment of 71 human SDR’s
member
19
Figure 5: Completed consensus sequence of 71 human SDR’s members
Consensus sequences for the human SDR family, constructed using Consensus
Sequence Constructor. The highly conserved positions (>70% identity) are marked
with bolded black letters as M, G, G, V and L. Intermediate conservation (>30%
identity) is indicated with black characters corresponding to the most commonly
occurring residue. The positions marked as X are the variable positions that are
occupied by any particular residue in more than 30% of sequences. As a whole, this
figure displays the highly variable characteristics of the human SDR family.
4.2 Sequence specificity and interrelationships of the human SDR family
We recovered five distinct subgroups within the human SDR family (figure 5).
Figure 6: Phylogenetic tree construction by PHYLIP
20
Figure 7: Phylogenetic Tree construction by SSSSg. Both the program shown
that human SDR family can be phylogenetically grouped into five distinct
classes.
In this study, we sought to elucidate the evolution of sequence and structure within
the human SDR family. Our results illustrate that the human SDR family possess a
low level of sequence conservation overall (figure 5). This indicates that evolutionary
differentiation has led to the formation of narrow specificity in individual members of
the family, rather than the preservation of common specificity for the family in its
entirety. This conclusion is further supported by the results of our phylogenetic
analysis (figure 6 and 7). Low overall sequence identity leads to the grouping of the
21
human SDR family into five distinct clusters (this result is similar to previous studies
on the phylogenetics of the human SDR family [37,38]), with each group potentially
further classified into two sub-groups (conservative and variable) based on the
outcome of mutational variability and correlated mutations analyses.
The consensus sequence for each of these five groups is shown in figure 6, and the
positions that form the binding site of the enzyme (K and S) and active site (Y) are
marked with red letters (figure 8).
Figure 8: Comparison of the five consensus human SDR sequences
Group
Active
(AS)
Binding
(BS)
Site
1
Y-156
2
Y-176
3
Y-185
4
Y-199
5
Y-151
Sites
S-143
S-164
S-172
S-174
S-138
K-160
S-157
K-180
C-177
K-189
C-186
K-203
C-200
K-155
A-152
Three residues
between ASBS
A-158
I-178
S-187
H-201 A-153
S-159
S-179
S-188
S-202 A-154
Figure 9: The active site (AS), substrate binding sites(BS), and three
residues between AS and one of the BS in 5 human SDR groups identified by
Talana.
On the right there is shown the location of these residues in the template structure of
SDR group 1. Note, that AS and BS are very conservative contrary to the three
residues adjacent to AS and BS
Red characters demarcate for the binding sites (K, S) and the active site (Y) of the
enzymes. These locations were found to be conserved residues and outline the
common sequence features within the human SDR family (figure 8). In contrast, the
three cluster of amino acids marked in yellow (such as SAS, FGV, CSS, CHS and
AAA) indicate the presence of variable residues directly adjacent to the conserved
residues. These locations determine the narrow specificity within each group.
Based on a comparison of the consensus sequences for each of the five groups within
the human SDRs family, the binding and active sites typically exhibit very conserved
residues, and are occupied by the same type of residues in all 5 groups. In contrast
to the highly conserved nature of the active and binding sites, a cluster of three
amino acids, which are located directly adjacent to the active site and also next to
one of the binding sites (marked with red letters in figure 8 and 9), reveal
substantial variability.
22
4.3 Mutational variability of human SDRs
The two programs (Talana and Consurf) were used to analyze the mutation
variability of both sequence and structure of the protein templates in each of the five
human SDR groups yielded similar results (figure 10). The identification of
conservative and variable sequence and structure regions within the human SDR
family is presented in figure 9, 10. The conservative and variable sequences and
structures differ not only among the five human SDR groups, and also within each
group.
Consurf-Group 1
Talana-Group 1
Figure 10: The identification of functional regions within group 1 using
Consurf and Talana.
Group 1 expressed the full grade of coloring scheme in Consurf : the continuous
conservation scores are partitioned into a discrete scale of 9 bins for visualization,
such that bin 9 contains the most conserved positions and bin 1 contains the most
variable positions. The color grades (1-9) are assigned as follows: the most
conserved regions are on the darkest maroon color and the least variable regions are
on the lightest turquoise color on the visualization. Similarly, using Talana, group 1
also expressed the full grade of coloring. From the grade 1, the most conserved
regions are on the darkest blue color to the grade 12, the least variable regions are
on the lightest rose color on the visualization. Therefore, both tools displayed similar
results for the identification of functional regions in protein structure.
Mutational Variability
23
Group 1
Group 2
Group 3
24
Group 4
Group 5
Figure 11: The result of mutational variability (done by Talana)
Across the different groups of human SDRs, the protein structure of group 1 contains
a mixture of conserved and variable regions with the variable level (full grades in
colour scheme of Talana) being dominant in the whole structure. In contrast, group 3
displayed the most conservative level (grade 1 in colour scheme of Talana is
dominant in the protein structure) compared to the others. Group 4 displayed an
intermediately conservative level whereas group 2 displayed an intermediate level of
variability (figure 11).
Total variability
Core variability
Surface variability
Group1
25
Group 2
Group3
Group4
Group5
Figure 12: Variability profiles for each of the five groups of human SDRs
Total, core and surface variability profiles are displayed for each group based on the
distribution of residues on the protein structure. Group 3 displayed the most
conservative level (grade 1 of the color scheme is dominant in the entire structure)
compared to groups 4 and 5. Group 1 showed the most variable level (full grade of
color scheme, from grade 1 to grade 12 in the structure) and group 2 showed an
intermediate level of variability.
In addition, conservative and variable structures could also be detected within each
group. With few exceptions, the conserved residues occurred within active and
substrate binding sites, whereas the variable residues (a cluster of three amino acids
which are located directly adjacent to the active site and also next to one of the
binding sites ) were found at random locations in the protein structure (figure 12).
26
For example, in group 1, the active site (Y-156) and the binding sites (S-143 and K160) are found at a conserved region in the protein structure, whereas a cluster of
three amino acids (S-157, A-158, S-159) are located at a variable region next to the
conserved region in the protein structure (figure 13). Similar patterns exist in each of
the other four groups of human SDR, but involve clusters of different amino acids.
Group
1
2
3
4
5
Active
YY-176
Y-185
Y-199
Y-151
Site
156
(AS)
Binding
SS-164
S-172
S-174 S-138
Sites
143
(BS)
KK-180
K-189
K-203 K-155
160
Three
SC-177
C-186
C-200 A-152
residues 157
between
AS-BS
AI-178
S-187
H-201 A-153
158
SS-179
S-188
S-202 A-154
159
Figure 13: The location of the conserved and variable residues in the
template structure of group 1 of human SDR was identified by Talana.
For example, conservative residues included active site and binding sites (
Y-156, S-143 and K-160) both of which are located in a conserved region
(grade 1 in color scheme of Talana). In contrast, the three cluster of
residues (S-157, A-158 and S-159) are clearly located in a more variable
region.
Conservative residues are found near the active and binding sites, which are located
on the protein structure next to the binding pocket, such as Y-156 (active site) and
S-143; K-160 (binding site) in group 1, figure 13. Furthermore, our mutational
variability results confirm that the conserved residues are located at the conserved
region in the protein structure (such as Y, S and K of group 1 in figure 13). The
result of mutational variability in our study compliments prior studies on the
identification of conserved residues- Y, S, K. According to several previous studies, Y,
S and K residues are considered, together, as a catalytic triad that is found at the
active sites of all human SDR proteins [7, 14]. Tyrosine (Y) functions as the catalytic
base, whereas Serine (S) stabilizes the substrate and Lysine (K) interacts with the
nicotinamide ribose and the pKa of the Tyr-OH [14]. We interpret the presence and
location of these conserved residues (Y, S, K) as evolutionary constraint at the level
of sequence and structure that leads to the retention of similar physio-chemical
characteristics, thus maintaining a given function in the human SDR family. It is
these conservative residues that display the global specificity that defines the
common characteristics of the entire human SDR family.
Variable groups, essentially occurring only at particular three clusters of amino acids,
are located directly adjacent to the binding pocket (between the active site and one
of the binding sites in figure 12). Just as with the conserved residues, we find that
these variable residues occur near a conserved region of the protein structure as
well. Additionally, the three clusters of amino acids form a narrow cluster on the
binding pocket such as: S-157, A-158, S-159 in group 1 (figure 13). In particular, we
27
see several instances of Serine and Alanine transitions via single point mutations.
Serine is encoded with six different combinations of codons (UCU, UCC, UCA, UCG,
AGU and AGC) and Alanine is encoded with four codons (GCU, GCC, GCA and GCG).
Hence, the simple way for changing Serine into Alanine is by a single transition of
amino acid, for instance, Serine (UCU) changed to Alanine (GCU). Thus, single point
mutations could potentially be the mechanism underlying the marked variability of
group 1, which is the least conservative group overall. In contrast, the three clusters
of amino acid in group 3 are Cystein-Serine-Serine, but unlike Serine and Alanine,
Cystein is encoded with only two codons (UGU and UGC). Although a single point
mutation could also be the main mechanism for mutation in group 3, Cystein and
Serine and Serine can form a disulfide bridge, which may increase overall protein
stability [39, 40]. Hence, group 3 of human SDRs shows the most sequence
conservation compared to the others. The differences and similarities we see among
the five different groups of human SDRs are likely related to the functional
conservation used by each group in order to maintain the common metabolic
functions of the human SDR family as a whole. Particular divergent, adaptive
specificity, on the other hand, has permitted each family to adopt its own, specific
targets. These similarities and especially the differences in structure and function of
the human SDR family are important to consider during future design of specific
inhibitors to target only a particular group within the human SDRs family.
4.4 Correlated mutations within the human SDR family
Our analyses of mutational correlation within the human SDR family using both Corm
and Talana reveal similar outcomes. Based on the distribution of mutations mapped
onto protein structure, the correlated mutations can be broken into two groups. The
first, core group, includes all mutations that show core molecular contact (table 2),
most of which are located in conserved regions of the protein structures (the core
variability in figure 12). The second, surface group, includes all mutations that
appear on the surface of the protein structure, (table 2) with most mutations located
at variable regions within the protein structure (the surface variability in figure 11,
12). Table 4 and 5 outlines the number of observed sets of correlated mutations for
group 5 of human SDRs.
Table 2: The core residues in five human SDR groups identified by Talana
The residues in each group are located at the core of the protein structure. The
occurrences of Valine and Isoleucine are more frequent compared to other amino
acids, showing that these hydrophobic amino acids potentially play a more vital role
in stabilizing the chemical structure of the proteins.
Group 1
Group 2
Group 3
Group 4
Group 5
V11, I85, I87, M95, T119, H137, I141, V163, V181, T182, S183, I184,
G187, A214
I115, V117, G120, V133, V141, M166, G208, V213, L219, S245, M247
V73, V89, V104, V117, A123, G147, I151, V173, C174, I211
I42, V43, L67, G75, M129, K160, A195, T224, V227, T231, S234,
Q248, V252
V7, A57, V69, G79, V80, V158, V174, T178, A220
Table 3: The surface residues in five human SDR groups identified by Talana
These residues are located on the surface of protein structures and are distant from
each other.
28
Group 1
Group 2
Group 3
Group 4
Group 5
E3, Q5, V8, A20, S21, I22, T25, Q29, D39, S41, R42, E45, V46, K48,
I50, Q51, N53, Q55, V57, E59 ,S61, I62, D64, H67, E69, T72 , E73, E80,
Q84, I87, M95, S98, A99, I100, E102, E109, A110, M111, D113, I116,
K117, G118, T119, Y121, S129, N132, H137, I144, E148, V149, T150,
L155, S157, A161, V163, I166, Q168, E171, R180, V181, T182, S183,
G187, M188, S194, G195, T197, W199, K204, L205, K208, I210, E212,
A213, A214, I215, Y216, Q219, Q220, H223, V224, N225, E228, T230,
V231, R232, P233
K64, R70, S71, D75, E78, I81, V91, E99, R100, N103, I115, V117, M119,
N122, R126, F130, A131, S132, L134, D135, L139, N147, R153, M166,
T195, Y196, G208, V213, T214, M216, S220, D221, L223, A230, V234,
I237, K241, F242, D244, S245, M247, A249, E251, N255, C257, G259,
D266, C275, H276, , S282, W285
T35, Q59, R62, V86, V89, N102, D105, Q106, R109, E115, A123, P126,
L130, S131, K133, E135, E136, T138, I145, L155, S158, R161, R162,
G177, I179, Y181, I183, P184, A201, D204, K208, V219, T226, R232,
P235, L237, R244, S245, I247, N248, N253, Y262, N264, I268, K271
Q35, L36, V43, E53, K56, L67, V72, D73, G75, L77, R80, Q83, A84, V85,
G87, Q90, F92, K95, A99, D100, T101, K109, D110, H117, M129, S133,
A136, H142, H155, K160, E163, L175, H178, L179, R181, I182, H183,
H185, E190, F192, A195, L197, H201, K211, K218, S220, T224, Y225,
V227, S234, S241, I242, M243, W245, W247, Q248, F251, V252, Q258,
Y266, C267, L269
E4, L24, R28, K31, E39, Q43, S46, D47, Y48, G50, A57, T61, N62, P63,
K71, A72, T74, G79, M96, S104, I106, E108, M126, K128, Q130, A149,
V174, V179, K190, A191, N193, D194, E195, A202, A206, D211, P212,
R213, E226, I244
Table 4: The identification of correlated mutation sets and their core and
surface characteristics for group 5
Group5
Positions
70
79
105
157
173
194
Mutal-Distribution of Correlated Mutational (CM)
Core
Surface
Asn-193, Gln-130, Thr-74
Ala-72, Lys-71, Ala-57
Leu-24
Val-80
Val- Ala-202, Gln-130, Glu-108
69
Thr-74, Ala-72, Pro-63
Asp-47, Lys-31
Thr-178
Ala- Glu-226, Arg-213, Asp220
211
Ala-206,
Asp-194, Gln-130
Lys-128, Met-126, Ile-106
Met-96,Thr-74, Thr-61
Gly-50, Tyr-48, Ser-46
Glu-39, Arg-28, Glu-4
Val-158
Asn-193, Lys-190, Gln-130
Thr-174, Ala-72
Val-174
Ala-57
Pro-212, Gly-79, Thr-74
Ala-72, Leu-24
Thr-74, Gly-79, Ser-104
29
211
Val-174
57
Ala-
Gln-130, Glu-195, Ile-244
Pro-212, Gly-79, Thr-74
Ala-72, Leu-24
Table 5: Selected correlated mutations in human SDRs identified by Talana
Correlated mutation in group 5 was analyzed by the Talana program, indicating that
if a mutation happened at one specific location, it led to mutation in other positions.
For example, if mutation occurred at position 6 (I), the other mutations occurred at
the same time at positions 61 (D), 73 (EKNR), 78 (ADEP) and 129 (ACFGH).
AA at
Sequen
ce
Counts
Refer Refere
ence
nce
Positi Positio
on
n
70
E
5
79
105
148
157
173
194
211
Correlated Mutations and amino acids
23: EKT 56:
EMV
71:
EHKN
Q
71:
-AT
62:
EKV
62:
FP
38: E
73:
KQRY
129: 192:
CFHW PST
73:
ELNT
68:
ADLT
68:
V
45:
S
129:
AQQS
71:
EQT
71:
-AKN
47: Y
192:
NR
73:
KLNY
73:
ERT
49:G
107:
DKNQ
107:
AE
60: T
129:
AFSW
129:
CGHQ
73:
KNRT
201:
EQS
201:
AD
95: M 125:
M
K
4
23: AL
I
4
V
4
I
4
30:
FTV
30:
IK
3:
-E
56:
A
46:
AE
46:
DG
27:
KR
V
4
3:
QT
27:
38:
45:
ADLT AFGS ALV
47:
EQT
49:
-EIT
60:
LQS
73:
ELQY
ALV
125:
AILV
I
4
V
4
225:
DE
225:
GILP
4
205:
A
205:
LM
201:
AQ
201:
DGS
219:
AV
219:
GLRS
T
L
4
V
5
I
4
V
5
D
4
E
4
A
4
201:
AS
201:
DEGQ
129:
CHQS
129:
AFGW
192:
PS
192:
NRT
78:
DEQ
78:
AGPR
243:
QV
243:
FIS
78:
212:
KQR
212:
DE
4
177:
TV
177:
A
73:
LRT
73:
EKNQ
73:
QRY
73:
EKLNT
56:
MV
56:
AE
78:
AEP
78:
DGR
56:
210:
DEG
210:
QRS
A
129:
ACFHQ
129:
GSW
42:
DKQ
42:
AE
71:
EHKN
71:
-AQT
23:
ET
23:
AKL
73:
ENR
73:
KLTY
23:
193: 197:
D
AT
193: 197:
AE
-EQ
78:
107:
EGR
DE
78:
107:
ADPQ ANQ
129: 189:
CHW AR
129: 189:
AFGQS EKM
71:
73:
HKNQ KQR
71:
73:
-AET ELNTY
103: 129:
DN
ACGH
103: 129:
EFQS FQSW
71:
73:
30
211:
A
211:
P
173:
127:
K
127:
-AQ
P
5
ET
23:
AKL
MV
56:
AE
HKNQ KQR
71:
73:
-AET ELNTY
DEQ I
78:
173:
AGPR V
Our analysis of mutational correlation of each position along the SDR protein
sequences shows that particular fragments are highly variable. Especially, for surface
variability, these positions are seldom in direct contact with each other, but maintain
contact with conservative positions. For example, according to the results of
correlated mutations in group 5 (table 6), a mutation at position 23 is accompanied
by mutation at 70 and other positions. There is no obvious relationship between the
positions of correlated mutations and their contact with each other (surface
variability in group 5, figure 11-13) because such correlated mutations are generally
in positions that are very distant from each other. According to the currently
assumed model, positive mutations (ones that improve fitness) do not occur
independently. Instead, the occurrence of one mutation is dependent upon other
locally occurring mutations. In this way, the nature of correlated mutations reflects
the protein-protein interaction and the necessity to preserve the biological activity
and structural properties of the molecules [41]. Therefore, the correlated mutations
revealed in our study provide useful information for further study of complex proteinprotein interactions. Formerly, it was hypothesized that protein-protein interactions
only happen to proteins in close proximity. However, our findings show that such
interactions may also occur when proteins are positioned at some distance from one
another. Thus, we conclude that the correlated mutations occurring distantly are due
to interacting protein “forces” that optimize these interactions. These distant protein
interactions may act as a potential adaptive mechanism within the human SDR
family allowing them change in response to fluctuating external conditions and
functional demands through evolution.
We also found that in each human SDR group, there are core residues that form a
narrow correlation cluster on protein structures, and most of them are in a conserved
region (core variability, figure 11-13, table 2). There is evidence elsewhere to
suggest that these core residues tend to mutate together to maintain proper
functioning [15], and our results provide additional support for the claim that these
centralized residues tend to mutate together to preserve the biological function of
the SDR proteins. Moreover, the differences in core variability may explain why the
human SDR family shares a low level of similarity in sequences (15-30%) but not in
protein structure. Contrary to core residues, surface residues are randomly scattered
over the protein structure and are not directly contacting each other (surface
variability, figure 11-13, table 3). Interestingly, our results indicate that the surface
residues of human SDR proteins do seem to be interacting with one another, despite
the distance between them (Table 2-3).
The molecular mechanisms by which these distantly located correlated mutations
occur has yet to be fully investigated and understood. Here, we suggest a few
potential explanations as to why these distant residues might be interacting. One
possibility is that we have not yet uncovered the intermediate sequences that contain
linking residues that indirectly join distant proteins. A second option is that the
mechanism of variability at these sites is different from a single point mutation.
Although it has long been accepted that single point mutations are major
contributors to the acquisition of beneficial mutations through evolution, the
correlation of surface mutations do not seem to be adequately explained by the
occurrence of single point mutations alone. Using the data presented here as a
springboard, further investigation of correlated mutations in distantly located
proteins may help researchers to gain insight into the causes, prevention, and
treatment of diseases that are caused by genetic or protein structure mutations.
31
5 CONCLUSION
In conclusion. The study on human SDR family provided several important results for
further studies on Molecular Docking, Molecular Dynamic Simulation and Designing
Inhibitors because, We generated two critical features of human SDR family, the
generally characteristic including conserved residues as Serine, Lysine and Tyrosine
which play important role to stabilize the protein function in order to main the
common feature of human SDR family. In contrast, the specific characteristic
including three cluster of amino acids located next to the active and binding sites of
protein family. These residues changed slightly during the evolution by single point
mutation so, they are responsible for the adaptive mechanism of protein molecules
to the changing of surrounding environment.
6 REFERENCES
1) N.M. Luscombe, D. Greenbaum, M. Gerstein. “What is bioinformatics? An
introduction and overview”. Department of Molecular Biophysics and
Biochemistry, Yale University New Haven, USA. Yearbook of Medical
Informatics 2001.
2) Reichhardt T. “It is sink or swim as a tidal wave of data approaches”.
Nature 1999, 399(6736):517-20.
3) Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL.
GenBank. Nucleic Acids Res 2000;28 (1):15-8.
4) Bairoch A, Apweiler R. “The SWISS-PROT protein sequence database
and its supplement TrEMBL in 2000”. Nucleic Acids Res 2000;28 (1):45-8.
5) Bernstein FC, Koetzle TF, Williams GJ, Meyer EF, Brice MD, Rodgers JR. “The
Protein Data Bank. A computer-based archival file for macromolecules
structures”. Eur J Biochem 1977;80(2):319-24.
6) Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H. “The
Protein Data Bank”. Nucleic Acids Res 2000;28(1):235-42.
7) Pearson WR, Lipman DJ. “Improved tools for biological sequence
comparison”. Proc Natl Acad Sci USA 1988;85(8):2444-2448.
8) Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W. “Gapped
BLAST and PSI-BLAST: a new generation of protein database search
programs”. Nucleic Acids Res. 1997;25(17):3389-3402.
9) Benach
J.
“X-ray
structure
analysis
of
short
chain
dehydrogenases/reductases”. Karolinska Institute, Stockholm, 1999.
10)
Persson B., Krook M., Atrian S., Gonzales-Duarte R., Jeffery J.,
Jornvall H., Ghosh D. “Short chain dehydrogenases/reductases”.
Biochemistry, 34(18): 6003-6013, 1995.
11)
Oppermann U., Jornvall H., Kallberg Y.., Persson B. “Short chain
dehydrogenases/reductases relationships: a large family with eight
32
clusters common in human, animal and Plant genomes”. Protein
Science, 2002.
12)
Yvonne Kallberg, Udo Oppermann, Hans Jornvall, Bengt Persson.
”SDRs-coenzyme based functional assigments in completed genomes.
Eur. J. Biochem. 269, 4409-4417 (2002).
13)
Yvonne Kallberg, Udo Oppermann, Hans Jornvall, Bengt Persson.
”SDR relationship: A large family with 8 clusters common to human,
animal and plant genomes”. Protein Science (2002), 22: 636-641
14)
Keller B, Volkmann A, Wilckens T, Moeller G, Adamski J.
“Bioinformatic identification and characterization pf new members of
SDR super-family. Molecular and Cellular Endocrinology 248(2006), 56-60.
15)
Filling C, Berndt K.D, Benach J, Knapp S, Prozorovski T, Nordling,
Ladenstein R, Jornvall H, Oppermann U. “Critical residues of structures
and catalysis in short-chain dehydogenases/reductases. J. Bio.CHem.
277(2002) 25677-25684).
16)
Jame E Bray, Brian D Maroden, Udo Oppermann. “The Human SDR
superfamily: A Bioinformatic Summary”. Chemico-Biological Interaction
178 (2009) 99-109.
17)
Persson B, Kallberg Y. “Classification and nomenclature of the
superfamily of short-chain dehydrogenases/reductases (SDRs)”.Chem
Biol Interact. 2012 Nov 29.
18)
Xiaoqiu Wu, Petra Lukacik, Kathryn L. Kavanagh and Udo Oppermann.
“Review: SDR-type human hydroxysteroid dehydrogenases involved
in steroid hormone activation”. Mol. Cell. Endocrinology 265-266 (2007)
71-76.
19)
Brigitte Keller, Marc Meier, Jerzy Adamski. ”Comparison of
predicted and experimental subcellular localization of two putative
rat steroid dehydrogenases from SDR protein super family”. Molecular
and Cellular Endocrinology. 30(2009) 43-46.
20)
S.R Eddy. “Profile hidden markov model”. Bioinformatics,
14(9):755–763, 1998.
21)
D.W. Mount. “Bioinformatics Sequence and Genome Analysis”.
Cold Spring Harbor Laboratory Press, 2001.
22)
P. Baldi and S. Brunak. “Bioinformatics - The Machine Learning
Approach”. Massachusetts Institute of Technology, 2 editions, 2001.
23)
R. et.al Durbin. “Biological Sequence Analysis Probabilistic
Models of Proteins and Nucleic Acids”. Cambridge University Press, 1998.
24)
D. Higgins and W. Taylor. “Bioinformatics: sequence, structure
and databanks”. Oxford University Press, 2000.
25)
Aloysius Phillips, Daniel Janies, Ward Wheeler. “Review: Multiple
sequence alignment in phylogenetic analysis”. Molecular phylogenetic
and evolution vol.16, No.3, 2000, 317-330.
26)
A.D. Baxevanis and B.F.F. Ouellette. “Bioinformatics: A Practical
Guide To The Analysis of Genes and Proteins”. John Wiley & Sons Inc., 2
edition, 2001.
27)
Robert C Edgar, Serafim Batzoglou. “Multiple sequence alignment”.
Current opinion in structural biology 16, 2006, 368-373.
28)
Zhumur Ghost, Bibekanand Mallick. “Bioinformatics Principles and
Applications”. Oxford University Press. 2008.
33
29)
D. Gusfield. “Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology”. Cambridge University
Press, 1997.
30)
Julie D. Thompson, Desmond G. Higgins and Toby J. Gibson.
“CLUSTALW: improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, position-specific
gap penalties and weight matrix choice”. Nucleic Acids Research Vol.22,
No.22 (1994): 4673-4680
31)
Robert C. Edgar. “MUSCLE: multiple sequence alignment with
high accuracy and high throughout”. Nucleic Acids Research Vol.32, No.5
(2004): 1792-1797
32)
Timo Lassmann and Erik LL. Sonnhammer. “KALIGN: an accurate
and fast multiple sequence alignment algorithm”. BMC Bioinformatics
(2005)
33)
Cedric Notredame, Desmond G.Higgins and Jaap Heringa. “TCOFFEE: A novel method for fast and accurate multiple sequence
alignment”. J. Mol. Bio 302 (2000: 205-213
34)
Jacek Leluk. “A non-statistical approach to protein mutational
variability”. Biosystem 56 (2000): 83-93
35)
Jacek Leluk. “Regularities on mutational variability in selected
protein families and the Markovian model of amino acids
replacements”. Computers and Chemistry 24 (2000): 659-672.
36)
Jacek Leluk, Beata Hanus-Lorenz and Aleksander F. Sikorski.
“Application of genetic semihomology algorithm to theoretical studies
on various protein families”. Acta Biochimica Polonica Vol.48 No. 1 (2001).
37)
Julie D. Thompson, Federic Plewniak, Oliver Poch. “BAliBASE: a
benchmark alignment database for the evaluation of alignment
programs”. Bioinformatics application notes. Vol.15, No.1, 1999, 87-88
38)
Joshep
Lancaster,
Jeremy
Buhler,
Roger
D.
Chamberlain.
“Acceleration
of
ungapped
extension
in
mercury
BLAST”.
Microprocessors and Microsystems. 33(2009), 281-289.
39)
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. “Basic local
alignment search tool”. J Mol Bio, 1990, 215: 403-410.
40)
Hall, B.G., "Spontaneous Point Mutations That Occur More Often
When Advantageous Than When Neutral." Genetics 126 (1990): 5-16.
Web. 31 March 2011.
41)
Nishikawas S, Adivinata J, Morioka H, Fujimura T, Tanaka T, Uesugi S,
Hakoshima T, Tomika K, Nakagawa S, Ikehara M,. “ A thermoresistant
mutant of Ribonuclease T1 having three disulfide bonds”. Protein Eng
1990, 3: 443-448.
42)
Jacek Leluk, Monica Sobczyk, Lukasz Becella. “Correlated mutations
in selected protein families”. Task quarterly 6, No.2 (2002), 469-482.
7
SUPPLEMENTS
34
Figure 1: Consurf Color Scale which is coded for multiple sequence alignment. The
Consurf color is generated based on the level of identity among the sequence. The
most conserved residues such as M in the figure 1 would be coded as the highest
score-9 in the scale and then the color scale would be decrease when the level of
identity decrease. Similar to Talana, but the color scale of Talana is different.
Figure 2: Talana Color Scale which is coded for multiple sequence alignment. The
most conserved residues (the highest score in doing alignment) is placed with the
most darkest blue color whereas the most variable residues is placed with the
darkest red color (the least score in doing alignment).
Table 1: The results of correlated mutations both Core and Surface residues was
identified by analyzing both the representative protein sequence and structure in
group 1 with the aid of Talana. Core and surface residues were classified based on
how they distribute on the protein 3D structure. Core residues located inside protein
structure whereas surface residues located in the surface of structure.
Group1
Positions ( of resides in
Core and Surface Residues
Core
35
Surface
protein sequence)
6
9
20
Met-95
Val-11
Ile-141
39
49
83
116
His-137
119
Thr-
Ile-85
Thr-182
36
Ile-166
121
Val-57
50
Val-8
Gly-195
157
Val-149
148
Met-111
72
Arg-42
Thr-197
194
Ser-129
53
Thr-25
22
Thr-230
197
Gly-195
144
Ser-129
98
Glu-80
64
Glu-59
41
Pro-233
230
Glu-228
220
Gln-219
148
His-137
119
Met-95
73
Gln-55
Gln-51
Gln-5
Glu-89
Pro-233
225
His-223
Ile-215
214
Lys-208
205
Lys-204
194
TyrIleSerGluThrSerAsnIleThrIleSerAspSerThrGlnGluThrGluAsn-53
Lys-48
AsnTyr-216
AlaLeuSer-
Thr-182
180
Gln-168
155
Ile-144
118
Lys-117
113
Ala-110
Met-95
His-67
Val-46
39
Ser-21
20
130
179
182
185
Ala-214
184
Ile-87
95
IleMet-
Gly-187
183
Val-181
163
SerVal-
Ile-184
87
Gly-187
Ile-
37
ArgLeuGlyAspAla-99
Glu-69
Ser-61
AspAla-
Thr-150
132
Ser-129
109
Met-95
87
Asp-39
3
Asn-
Gln-219
212
Gly-195
194
Gly-187
183
Val-181
171
Gln-168
163
Ala-161
117
Met-95
Gln-55
Val-46
Thr-25
Glu-
Glu-148
132
Pro-233
219
Glu-212
Gly-187
168
Lys-117
113
Glu-73
GluIleGlu-
SerSerGluValLysAsp-64
Asn-53
Gln-29
Gln-5
AsnGlnGly-195
GlnAspGln-5
186
208
230
His-137
184
Ile-
His-137
Pro-233
231
Val-224
223
Gln-219
216
Ile-215
213
Tyr-199
188
Asp-113
Ile-100
Gln-84
Ser-61
45
Ser-21
Ile-210
Pro-233
232
Thr-230
219
Gln-168
150
Glu-148
Gln-5
ValHisTyrAlaMetIle-116
Glu-102
Ile-87
Glu-73
GluGln-5
Ile-62
ArgGlnThrGlu-73
Table 2: The results of correlated mutations both Core and Surface residues was
identified by analyzing both the representative protein sequence and structure in
group 2 with the aid of Talana. Core and surface residues were classified based on
how they distribute on the protein 3D structure. Core residues located inside protein
structure whereas surface residues located in the surface of structure.
Group2
Positions ( of resides in
protein sequence)
53
Core and Surface Residues
Core
Met-166
208
Gly-
38
Surface
Trp-285
282
His-276
275
Asp-266
259
Cys-257
251
Ala-249
245
Asp-244
241
Val-234
230
SerCysGlyGluSerLysAla-
Met-216
214
Val-213
195
Leu-139
135
Leu-134
132
Arg-126
103
Glu-99
91
Ile-81
75
Ser-71
64
63
87
92
Ser-245
213
Val-
Val-141
Trp-285
282
His-276
259
Cys-257
255
Glu-251
249
Ser-245
230
Leu-223
214
Val-213
208
Thr-195
139
Asp-135
134
Ser-132
103
Arg-100
99
Val-91
81
Glu-78
70
Glu-251
237
Ser-220
153
Leu-134
115
Val-133
120
Gly-
39
Ser-220
ThrThrAspSerAsnValAspLys-
SerGlyAsnAlaAlaThrGlyLeuLeuAsnGluIleArg-
IleArgIle-
102
106
Ser-245
Val-213
117
Leu-219
Val-
Ile-115
213
Gly-208
141
ValVal-
119
Cys-275
259
Met-247
221
Met-216
196
Leu-139
135
Ser-132
131
Phe-130
122
Met-119
117
Trp-285
282
His-276
275
Gly-259
257
Glu-251
249
Ser-245
244
Lys-241
237
Val-234
230
Ser-220
214
Thr-195
166
Arg-153
139
Asp-135
134
Ser-132
103
Glu-99
91
Ile-81
Asp-75
Ser-71
70
Lys-64
GLy-259
216
Thr-195
147
40
GlyAspTryAspAlaAsnVal-
SerCysCysAlaAspIleAlaThrMetLeuLeuAsnVal-
Arg-
MetAsn-
125
214
Ile-115
141
Val-
Met-247
Glu-251
237
Ser-220
153
Leu-134
Ala-249
IleArgPhe-242
Table 3: The results of correlated mutations both Core and Surface residues was
identified by analyzing both the representative protein sequence and structure in
group 3 with the aid of Talana. Core and surface residues were classified based on
how they distribute on the protein 3D structure. Core residues located inside protein
structure whereas surface residues located in the surface of structure.
Group3
Positions ( of resides in
protein sequence)
6
Core and Surface residues
Core
75
Val-104
77
Val-89
Val-117
147
Cys-174
151
Ile-211
123
Val-173
73
Ile-211
GlyIleAlaVal-
41
Surface
Gly-177
Thr-35
Asn-253
Asp-105
Lys-271
Ile-265
264
Tyr-262
248
Ile-247
245
Arg-244
237
Pro-235
232
Thr-226
219
Lys-208
204
Ala-201
184
Ile-183
181
Ile-179
177
Arg-162
161
Ser-158
155
Ile-145
138
Glu-135
136
Lys-133
131
Arg-109
Lys-208
Ile-268
AsnAsnSerLeuArgValAspProTryGlyArgLeuThrGluSer-
Leu-130
126
Ala-123
115
Gln-106
105
Asn-102
Val-86
62
ProGluArg-109
AspVal-89
ArgGln-59
Table 4: The results of correlated mutations both Core and Surface residues was
identified by analyzing both the representative protein sequence and structure in
group 4 with the aid of Talana. Core and surface residues were classified based on
how they distribute on the protein 3D structure. Core residues located inside protein
structure whereas surface residues located in the surface of structure.
Group4
Positions ( of resides in
protein sequence)
2
9
Core and Surfaces
residues
Core
Surface
Thr-224
Gln-248
67
Val-43
160
49
Ala-195
231
65
Gln-248
LeuLys-
ThrVal-
42
Thr-156
241
Glu-190
178
Ala-136
117
Asp-110
36
Ser-
Gln-258
Ile-242
Ser-220
211
Phe-192
182
Arg-181
160
Gly-87
85
Leu-77
73
Val-252
Lys-218
192
Leu-179
Try-266
HisHisLeu-
LysHisLysValAspLys-56
PheGln-83
Asp-
108
186
193
252
Ile-42
Met-129
234
Gly-75
Ser-
Leu-67
Ala-195
248
Val-43
Gln-
Gln-248
Ser-234
Val-227
43
110
Ala-99
Leu-269
267
Val-252
248
Trp-247
241
Val-227
225
Thr-224
201
Phe-192
185
His-183
181
Leu-179
163
His-155
136
Ser-133
110
Thr-101
95
Phe-92
90
Gly-87
77
Gly-75
72
Lys-56
53
Gln-35
Gln-258
Trp-245
243
Ile-242
220
Lys-211
197
Ala-195
192
His-183
181
Leu-175
160
Lys-109
Val-85
Asp-73
Lys-56
Leu-269
267
Val-252
Phe-92
CysGlnSerTryHisHisArgGluAlaApsLysGlnLeuValGluVal-252
MetSerLeuPheArgLysGly-87
Leu-77
Leu-67
Val-43
CysGln-
208
232
Gln-208
67
Val-43
160
LeuLys-
Gln-248
42
Ile-
44
248
Trp-247
241
Ser-234
227
Try-225
224
His-201
192
His-185
183
Arg-181
179
Glu-163
155
His-142
136
Ser-133
129
His-117
110
Thr-101
95
Phe-92
90
Gly-87
77
Gly-75
72
Lys-56
53
Gln-35
Gln-258
252
Gln-248
242
Ser-220
211
Phe-192
183
Lys-160
87
Val-85
Arg-80
77
Asp-73
Try-266
252
Phe-251
211
Asp-110
99
SerValThrPheHisLeuHisAlaMetAspLysGnlLeuValGlu-
ValIleLysHisGlyAla-84
LeuLys-56
ValLysAla-
Phe-92
85
Val-
Table 5: The results of correlated mutations both Core and Surface residues was
identified by analyzing both the representative protein sequence and structure in
group 5 with the aid of Talana. Core and surface residues were classified based on
how they distribute on the protein 3D structure. Core residues located inside protein
structure whereas surface residues located in the surface of structure.
Group5
Core and Surface residues
Positions ( of resides in
Core
Surface
protein sequence)
6
Val-7
Gln-130
Gly179
Thr-74
Asn62
70
Asn-193
Gln130
Thr-74
Ala72
Lys-71
Ala57
Leu-24
79
Val-80
Val- Ala-202
Gln69
130
Glu-108
Thr74
Ala-72
Pro-63
Asp-47
Lys31
105
Thr-178
AlaGlu-226
Arg220
213
Asp-211
Ala206
Asp-194
Gln130
Lys-128
Met126
Ile-106
Met96
Thr-74
Thr61
Gly-50
Tyr48
Ser-46
Glu39
Arg-28
Glu-4
148
Gly-79
Ala-202
Ala149
GLn-130
Glu-
45
157
173
Val-158
Val-174
Ala-57
194
211
Val-174
57
Ala-
46
108
Thr-74
43
Asn-193
190
Gln-130
174
Ala-72
Pro-212
Thr-74
72
Leu-24
Thr-74
79
Ser-104
130
Glu-195
244
Pro-212
79
Thr-74
72
Leu-24
GlnLysThrGly-79
AlaGlyGlnIleGlyAla-
[...]... the evolutionary conservation of amino acids based on the phylogenetic relationships between homologous sequences Both programs analyzed the evolutionary conservation of amino acids based on the sequences and produce conservation scores that correspond to the rate of evolution at each site The scores are divided into nine grades for the visualization of differing rates of evolution in Consurf: grade... least 30% of sequences 196 positions-X letters (64.1%) are occupied by any particular residue in more than 30% of sequences Figure 4: Part of result of multiple sequence alignment of 71 human SDR’s member 19 Figure 5: Completed consensus sequence of 71 human SDR’s members Consensus sequences for the human SDR family, constructed using Consensus Sequence Constructor The highly conserved positions (>70%... variable sequence regions in human SDR enzymes To further elucidate patterns of conservation and variation in human SDR enzymes, a comparative analysis of the 3D protein structure of each of the five consensus sequences was also conducted We identified a representative structure for each of the five groups recovered in the phylogenetic reconstructions using the Protein Data Bank The selection criteria... variable sequences and structures differ not only among the five human SDR groups, and also within each group Consurf-Group 1 Talana-Group 1 Figure 10: The identification of functional regions within group 1 using Consurf and Talana Group 1 expressed the full grade of coloring scheme in Consurf : the continuous conservation scores are partitioned into a discrete scale of 9 bins for visualization, such... 4.3 Mutational variability of human SDRs The two programs (Talana and Consurf) were used to analyze the mutation variability of both sequence and structure of the protein templates in each of the five human SDR groups yielded similar results (figure 10) The identification of conservative and variable sequence and structure regions within the human SDR family is presented in figure 9, 10 The conservative... position-colored turquoise, grade 5-the intermediately conserved-colored white and grade 9-the most conserved-colored maroon The conservation score at a site corresponds to the site’s evolutionary rate It measures of evolutionary conservation at each sequence site of the target chain The color grades are assigned as follow: the conservation scores below the average (negative values, are indicative of. .. applied to independent soft-wares-Talana and Consurf Consurf server is used for estimating the evolutionary conservation of amino acid based on the phylogenetic relations between homologous sequences 3.4.1 Consurf The first step in Consurf is to find sequence homologies (using BLAST) based on protein structures and amino acids sequences Sequences are clustered and highly similar sequences are removed... of these interactions are essential for their function To maintain functionality, it is reasonable to assume that the sequence changes accumulated during the evolution of one of the interacting residues must be compensated by changes in the other [34,35,36] In this way, the network of necessary inter-residue contacts may constrain divergence of the protein sequence to some extent Correlated mutations... Arginine while some of them are encoded by only 1 codon as Methionine In order to overcome that limitation on HMM, we applied one kind of newest model, called Genetic Semi-homology implemented in Geisha 3 This model focuses on what kind of codons that encoded for amino acid so, it concerns on cryptic mutations which changing in gene compositions without affecting on amino acid sequence Hence, genetic-semi-homology... mutations within the human SDR family Our analyses of mutational correlation within the human SDR family using both Corm and Talana reveal similar outcomes Based on the distribution of mutations mapped onto protein structure, the correlated mutations can be broken into two groups The first, core group, includes all mutations that show core molecular contact (table 2), most of which are located in conserved ... 5: Completed consensus sequence of 71 human SDR’s members Consensus sequences for the human SDR family, constructed using Consensus Sequence Constructor The highly conserved positions (>70% identity)... Multiple sequence alignment, consensus sequence generation, and analysis of human SDR specificity 18 4.2 Sequence specificity and interrelationships of the human SDR family 20 4.3 Mutational... mutations occur at the level of sequence, the effects of these mutations are noticed at the level of function Bio-molecule function, in turn, is directly related to 3D structure As such, by studying