Báo cáo khoa học: and protein bilinear indices – novel bio-macromolecular descriptors for protein research: I. Predicting protein stability effects of a complete set of alanine substitutions in the Arc repressor ppt
Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 29 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
29
Dung lượng
615,34 KB
Nội dung
TOMOCOMD-CAMPS andproteinbilinearindices– novel
bio-macromolecular descriptorsforprotein research:
I. Predictingproteinstabilityeffectsofacompleteset of
alanine substitutionsintheArc repressor
Sadiel E. Ortega-Broche
1
, Yovani Marrero-Ponce
1,2,3
, Yunaimy E. Dı
´
az
1
, Francisco Torrens
2
and
Facundo Pe
´
rez-Gime
´
nez
3
1 Unit of Computer-Aided Molecular ‘Biosilico’ Discovery and Bioinformatics Research (CAMD-BIR Unit), Faculty of Chemistry–Pharmacy,
Central University of Las Villas, Santa Clara, Villa Clara, Cuba
2 Institut Universitari de Cie
`
ncia Molecular, Universitat de Vale
`
ncia, Edifici d’Instituts de Paterna, Spain
3 Unidad de Investigacio
´
n de Disen˜ o de Fa
´
rmacos y Conectividad Molecular, Departamento de Quı
´
mica Fı
´
sica, Facultad de Farmacia,
Universitat de Vale
`
ncia, Spain
Keywords
arc repressor; bilinear indices; linear
discriminant analysis; linear multiple
regression; protein stability
Correspondence
Y. Marrero-Ponce, Unit of Computer-Aided
Molecular ‘Biosilico’ Discovery and
Bioinformatics Research (CAMD-BIR Unit),
Faculty of Chemistry–Pharmacy, Central
University of Las Villas, Santa Clara, 54830,
Villa Clara, Cuba
Fax: +53 42 281130; +53 42 281455;
+34 96354 3156
Tel: +53 42 281192; +53 42 281473;
+34 96354 3156
E-mail: ymarrero77@yahoo.es;
ymponce@gmail.com;
yovanimp@uclv.edu.cu
Website: http://www.uv.es/yoma/
(Received 3 March 2009, revised 15 April
2010, accepted 14 May 2010)
doi:10.1111/j.1742-4658.2010.07711.x
Descriptors calculated from a specific representation scheme encode only
one part ofthe chemical information. For this reason, there is a need to
construct novel graphical representations of proteins andnovel protein
descriptors that can provide new information about the structure of
proteins. Here, a new setofproteindescriptors based on computation of
bilinear maps is presented. This novel approach to biomacromolecular
design is relevant for QSPR studies on proteins. Proteinbilinearindices are
calculated from the kth power of nonstochastic and stochastic graph–
theoretic electronic-contact matrices, M
k
m
and
s
M
k
m
, respectively. That is to
say, the kth nonstochastic and stochastic proteinbilinearindices are calcu-
lated using M
k
m
and
s
M
k
m
as matrix operators ofbilinear transformations.
Moreover, biochemical information is codified by using different pair combi-
nations of amino acid properties as weightings. Classification models based
on aproteinbilinear descriptor that discriminate between Arc mutants of
stability similar or inferior to the wild-type form were developed. These
equations permitted the correct classification of more than 90% of the
mutants in training and test sets, respectively. To predict t
m
and DDG
o
f
values
for Arc mutants, multiple linear regression and piecewise linear regression
models were developed. The multiple linear regression models obtained
accounted for 83% ofthe variance ofthe experimental t
m
. Statistics calcu-
lated from internal and external validation procedures demonstrated robust-
ness, stabilityand suitable power ability for all models. The results achieved
demonstrate the ability ofproteinbilinearindices to encode biochemical
information related to those structural changes significantly influencing the
Arc repressorstability when punctual mutations are induced.
Abbreviations
BOOT, bootstrapping; ECI, electronic charge index; HPI, hydropathy index; ISA, isotropic surface area; LDA, linear discrimination analysis;
LOO, leave-one out; MCC, Matthew’s correlation coefficient; QSAR, quantitative structure–activity relationship; QSPR, quantitative
structure–property relationship; SDEC, standard error in calculation.
3118 FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS
Introduction
The advent ofthe automatic-sequence techniques
and the fast growing number of DNA and protein
sequences available from diverse organisms have moti-
vated the development of graphical representations of
biopolymers as a method forthe analysis and compari-
son of sequences [1]. Initially, this approach was
applied inthe inspection and visual analysis of nucleic
acids sequences [2,3]. Subsequently, its usefulness for
the numerical characterization ofthe similarity ⁄ dissim-
ilarity degree among nucleotide sequences was demon-
strated, and it then became an alternative to the
alignment-based comparison methods [4].
The numerical characterizations ofthe biopolymer
structure are also known as biomacromolecular de-
scriptors. Combined with machine-learning techniques,
they have proved to be effective inthe prediction of
physical–chemical and biological features [5–12], the
interpretation of properties in structural terms, and the
study of similarity⁄ dissimilarity among biomolecules
[13–17], amongst others.
A general strategy adopted inthe design of biomac-
romolecular descriptors is the association of mathe-
matical objects with diverse graphical representations
of biopolymers [4]. One such strategy aims to represent
the biomacromolecular structure by means ofa graph
and then calculates the invariants ofthe associated
matrices. For example, Randic
´
and Basak used the
principal eigenvalues from matrices as invariants in an
analysis ofthe similarity degree among DNA
sequences [18]; Raychaudhury and Nandy considered
graph mean-moments as descriptorsof polynucleotide
sequences [19]; Benedetti and Morosetti [16], Shu et al.
[20], Bermu´ dez et al. [15] and Galindo et al. [21] also
applied graph–theoretical invariants to numerically
describe the structure of RNA molecules for different
purposes.
When a mathematical invariant is calculated from a
specific representation scheme, only a partial character-
ization from the chemical structure can be achieved
because only a part ofthe chemical information can be
encoded [22]. This can be overcome either by develop-
ing diverse graphical representations, because each of
them captures different information from the biomo-
lecular structures, or by calculating several mathemati-
cal invariants from the same representation scheme
[22]. The construction ofnovel representation forms
for biomolecules andthe design of new descriptors
that provide new information and better characteriza-
tion is therefore necessary [22].
Marrero-Ponce et al. [23–25] have recently applied
linear and quadratic forms on R
n
to calculate graph–
theoretical invariants of organic compound structures.
These descriptors were successfully applied inthe pre-
diction of physical–chemical properties and rational
drug design. Subsequently, the use of linear and
quadratic forms was extended to obtain numerical
characterizations of proteins and nucleic acids. Such
descriptors were effectively applied inthe modelling of
the interaction between RNA and drugs [26,27] and
for predictingthestabilityof proteins [6,28]. Bilinear
forms have also been used inthe definition of molecu-
lar descriptors [29], which have been applied appropri-
ately in molecular modelling [30].
The successful application of linear and quadratic
forms to obtain graph–theoretical invariants of the
biopolymer structure has encouraged us to explore
the use ofbilinear forms on R
n
as a logical–mathe-
matical procedure for designing novelprotein descrip-
tors. More precisely, we used bilinear forms to
transform the chemical information encoded by a
graph-based representation of proteins, similar to that
proposed by Marrero-Ponce et al. [6,28]. To validate
the utility of these descriptors, we applied them in
combination with multivariant analysis methods to
predict theeffectsofasetofalaninesubstitutions in
the stabilityoftheArc repressor. Arc is a small,
homodimeric repressorof 53 amino acids encoded by
P22, a temperate bacteriophage of Salmonella
typhimurium [31]. This homodimer has been widely
studied by Milla et al. [32], who determined the con-
tribution of specific residues to stabilize the native
structure by means ofalanine substitutions. The set
of Arc mutants obtained in these experiments was
used in subsequent studies to validate the usefulness
of diverse schemes forthe numerical characterization
of proteins [5,28,33–35].
Numerical characterization of
polypeptide chains
Here, we describe the strategy proposed by us to
numerically characterize the structure of peptides and
proteins by means ofbilinear transformations of their
structural information. This information is encoded
through elements of R
n
vector space and graph–
theoretic representations of polypeptide chains.
Accordingly, a background in amino acid-based mac-
romolecular vector and nonstochastic and stochastic
graph–theoretic electronic-contact matrices will be
described, followed by an outline ofthe mathematical
definition ofbilinear maps as well as a definition of
our procedures.
S. E. Ortega-Broche et al. PredictingthestabilityoftheArc repressor
FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS 3119
Macromolecular vectors for representing amino
acids sequences
In analogy to the molecular vector
x used to represent
organic molecules [23,36–47], we introduce here the
macromolecular vector (
x
m
). The components of this
vector are numeric values, which represent a certain
side-chain amino acid property. These properties char-
acterize each kind of amino acid (R group) within a
protein. Such properties can be z-values [48], the side-
chain isotropic surface area (ISA) and atomic charges
(electronic charge index; ECI) ofthe amino acid [49],
and the hydropathy index (Kyte–Doolittle scale; HPI)
[50], as well as other hydrophobicity scales such as
Hopp–Woods [51], and so on. For example, the z
1(AA)
scale ofthe amino acid, AA, takes the values
z
1(V)
= )2.69 for valine, z
1(A)
= 0.07 for alanine,
z
1(M)
= 2.49 for methionine, and so on [48,49].
Table 1 depicts several side-chain descriptorsfor the
natural amino acids [48–50].
Thus, a peptide (or protein) having 5, 10, 15, , n
amino acids can be represented by means of vectors,
with 5, 10, 15, , n components, belonging to the
spaces <
5
; <
10
; <
15
; ; <
n
, respectively. Where n is the
dimension ofthe real sets ð<
n
Þ.
This approach allows us encoding peptides such as
SKEERN throughout the macromolecular
x
m
¼
1:96 2:84 3:08 3:08 2:88 3:22½, inthe z
1
-scale
(Table 1). This vector belongs to the product space <
6
.
The use of other scales defines alternative macromolec-
ular vectors.
If we are interested in codifying the chemical
information by means of two different macromolecular
vectors, for example,
x
m
=[x
m1
, ,x
mn
] and
y
m
=[y
m1
, , y
mn
], then different combinations of
macromolecular vectors ð
x
m
6¼
y
m
Þ) are possible when a
weighting scheme is used. Inthe present study, we
characterized each amino acid with the biochemical
parameters shown in Table 1. From this weighting
scheme, fifteen (or thirty if
x
mw
À
y
mz
6¼
x
mz
À
y
mw
)
combinations (pairs) of macromolecular vectors (
x
m
,
y
m
;
x
m
„
y
m
) can be computed,
x
mz1
)
y
mz2
,
x
mz1
)
y
mz3
,
x
mz1
)
y
mHPI
,
x
mz1
)
y
mISA
,
x
mz1
)
y
mECI
,
x
mz2
)
y
mz3
,
x
mz2
)
y
mHPI
,
x
mz2
)
y
mISA
,
x
mz2
)
y
mECI
,
x
mz3
)
y
mHPI
,
x
mz3
)
y
mISA
,
x
mz3
)
y
mECI
,
x
mHPI
)
y
mECI
,
x
mHPI
)
y
mECI
and
x
mISA
)
y
mECI
. Here, we used the
symbols
x
mw
)
y
mz
, where the subscripts w and z repre-
sent two amino acid properties from our weighting
scheme anda dash (–) represents the combination
(pair) of two selected amino acid label biochemical
properties.
To illustrate this, let us consider the same peptide
as inthe example above SKEERN andthe weight-
ing scheme: z
1
and z
2
(
x
mz1
)
y
mz2
=
x
mz2
)
y
mz1
).
The following macromolecular vectors
x
m
¼
½ 1:96 2:84 3:08 3:08 2:88 3:22 and
y
m
¼
½À1:63 1:41 0:39 0:39 2:52 1:45 are obtained
when we use z
1
and z
2
as chemical weights for codify-
ing each amino acid inthe example peptide in
x
m
and
y
m
vectors, respectively (Table 2).
Graph-theoretic representations of polypeptide
chains
In molecular topology, molecular structure is
expressed, generally, by the hydrogen-suppressed
graph. That is, a molecule is represented by a graph.
Informally, a graph G is a collection of vertices
(points) and edges (lines or bonds) connecting these
vertices [52–54]. In more formal terms, a simple graph
G is defined as an ordered pair [V(G), E(G )], which
consists ofa nonempty setof vertices V(G) anda set
E(G) of unordered pairs of elements of V(G ), termed
edges [52–54]. In this particular case, we are not deal-
ing with a simple graph but with a so-called pseudo-
graph (G). Informally, a pseudograph is a graph with
multiple edges or loops between the same vertices or
the same vertex. Formally, a pseudograph is aset V of
vertices along aset E of edges, anda function f from
E to {{u,v}|u,v in V} (the function f shows which pair
of vertices are connected by which edge). An edge is a
loop if f(e)={u} for some vertex u in V [23,55,56].
Table 1. Descriptorsforthe natural amino acids.
Amino
acids
z-scale [48,49]
HPI [50] ISA [49] ECI [49]
z
1
z
2
z
3
Ala A 0.07 )1.73 0.09 1.8 62.90 0.05
Val V )2.69 )2.53 )1.29 4.2 120.91 0.07
Leu L )4.19 )1.03 )0.98 3.8 154.35 0.01
Ile I )4.44 )1.68 )1.03 4.5 149.77 0.09
Pro P )1.22 0.88 2.23 )1.6 122.35 0.16
Phe F )4.92 1.30 0.45 2.8 189.42 0.14
Trp W )4.75 3.65 0.85 ) 0.9 179.16 1.08
Met M )2.49 )0.27 )0.41 1.9 132.22 0.34
Lys K 2.84 1.41 )3.14 )3.9 102.78 0.53
Arg R 2.88 2.52 )3.44 )4.5 52.98 1.69
His H 2.41 1.74 1.11 )3.2 87.38 0.56
Gly G 2.23 )5.36 0.30 )0.4 19.93 0.02
Ser S 1.96 )1.63 0.57 )0.8 19.75 0.56
Thr T 0.92 )2.09 )1.40 )0.7 59.44 0.65
Cys C 0.71 )0.97 4.13 2.5 78.51 0.15
Tyr Y )1.39 2.32 0.01 )1.3 132.16 0.72
Asn N 3.22 1.45 0.84 )3.5 17.87 1.31
Gln Q 2.18 0.53 )1.14 )3.5 19.53 1.36
Asp D 3.64 1.13 2.36 )3.5 18.46 1.25
Glu E 3.08 0.39 )0.07 )3.5 30.19 1.31
Predicting thestabilityoftheArcrepressor S. E. Ortega-Broche et al.
3120 FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS
On the other hand, Anfinsen’s experiments with
small proteins demonstrated that aprotein amino acid
sequence encodes their peptidic backbone folding.
However, at present, merely knowledge ofthe amino
acid sequence ofaprotein does not provide us with its
3D structure. The primary structure of proteins con-
sists of unbranched amino acid sequences, which are
linked by amide bonds between the a-carboxyl group
of one residue andthe a-amino group ofthe next. The
3D distribution of all atoms inaprotein is referred to
as the protein’s tertiary structure. Whereas the term
secondary structure refers to the spatial arrangement
of amino acid residues that are adjacent inthe primary
structure, the tertiary structure includes longer-range
aspects ofthe amino acid sequence. Lastly, individual
polypeptidic chains in multi-subunit proteins are orga-
nized in 3D complexes reaching quaternary-structural
levels. As previously outlined, essential information for
protein folding is contained inthe amino acid sequence
and, more specifically, inthe amino acid side-chains of
the polypeptidic chain.
Taking the above statement into account, in the
present study, we apply a graph–theoretic model, as
developed and applied previously by Marrero-Ponce
et al. [33], to represent the molecular structure of pro-
teins. This is called a macromolecular graph. Here, the
graph vertices are C
a
-atoms in polypeptide backbone
and the edges are both covalent interactions between
amino acids (peptidic bonds) and noncovalent interac-
tions between amino acid side-chains inthe same or
different subunit. Noncovalent interactions can also
occur between an amino acid side-chain and its main-
chain, where this amino acid represents a pseudovertice
in the macromolecular pseudograph. These interactions
can be considered as contacts, which can exist among
amino acids that are near (or far) inthe polypeptide
backbone (i.e. the contact can be subdivided into short,
medium and large contacts). Table 2 shows how to
depict two interacting polypeptide chains by means of a
macromolecular pseudograph because the heterodimer
(SKEERN) contains an amino acid with a hydrogen
bond between its side-chain and its main-chain atom.
The n · nkth nonstochastic graph–theoretic elec-
tronic-contact matrix, M
k
m
, is a square and symmetric
matrix, where n is the number of amino acids in the
protein [6,28]. The coefficients
k
m
ij
are the elements of
the kth power of M
m
and are defined as:
m
ij
¼ 1if i 6¼ j and 9 e
k
2 EðG
m
Þð1Þ
=1 if i = j andthe amino acid i has a hydrogen
bond between its side-chain and its main-chain atom,
= 0 otherwise.
where E(G
m
) represents thesetof edges of G
m
.
The matrix M
k
m
provides the number of walks of
length k that link every pair of vertices v
i
and v
j
. For
this reason, each edge in M
1
m
represents a peptidic
bond (covalent bond) or a hydrogen bond as well as a
salt-bridge interaction (noncovalent bond) between
amino acids i and j.
On the other hand, the kth stochastic graph–theo-
retic electronic-contact matrix of G
m
,
s
M
k
m
, can be
Table 2. Representation of two interacting polypeptide chains and its associated pseudograph and macromolecular vector.
46
Ser
Lys
Glu
Glu
Arg
Asn
1
2
3
4
56
NH
2
COOH
chain 1
chain 2
2
3
4
5
6
1
Cα
Cα
Cα
Cα
Cα
Cα
NH
2
NH
2
NH
2
COOH
COOH
COOH
Macromolecular ‘pseudograph’ (G
m
) ofthe a-carbon
atoms (polypeptide’s backbone):
Here, we consider both the covalent interaction (peptidic bond
between amino acid shown with solid line) andthe noncovalent
interaction (salt-bridge and hydrogen bond shown with dashed line)
between amino acid side-chains (R-groups) inthe same polypeptidic chain
or different chains. The loop inthe third position (Glu
3
) indicates a hydrogen
bond between an amino acid main chain and its side-chain
Macromolecular vector:
x
m
¼½SKEERN2R
6
In the definition of the
x
m
, as macromolecular
vector, the one-letter symbol ofthe amino acids
indicates the corresponding side-chain amino acid
property, e.g. z
1
-values. That is to say, if we write S,
it means z
1
(S), z
1
-values or some amino acid property,
which characterizes each side chain inthe polypeptide.
Therefore, if we use the canonical bases of R
6
, the
coordinates of any vector
x
m
coincide with the
components of that macromolecular vector.
½X
m
T
¼½SKEERN
[X
m
]
T
= transposed of [X
m
] and it means the vector of the
coordinates of
x
m
in the canonical basis of R
6
(a 1 · 6 matrix)
[X
m
]: vector of coordinates of
x
m
in the canonical basis of R
6
(a 6 · 1matrix)
x
m
,
y
m
components are z
1
and z
2
-values, respectively.
x
m
¼½1:96 2:84 3:08 3:08 2:88 3:22
y
m
¼
y
m
¼½À1:63 1:41 0:39 0:39 2:52 1:45
S. E. Ortega-Broche et al. PredictingthestabilityoftheArc repressor
FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS 3121
directly obtained from M
k
m
. Here,
s
M
k
m
=[
k
sm
ij
], is a
square matrix of order n (n = number of C
a
atoms)
and the elements
k
sm
ij
are defined as:
k
sm
ij
¼
k
m
ij
k
SUM
i
¼
k
m
ij
k
d
i
ð2Þ
where,
k
m
ij
are the elements ofthe kth power of M
k
m
and the sum ofthe ith row of M
k
m
is named the k-order
vertex degree of C
a
atom i,
k
d
i
. It should be noted that
the matrix
s
M
k
m
in Eqn (2) has the property that the
sum ofthe elements in each row is 1. An n · n matrix
with nonnegative entries having this property is called
a ‘stochastic matrix’ [57]. Table 3 shows the zero, first
and second powers ofthe total nonstochastic and sto-
chastic graph–theoretic electronic-contact matrices of
macromolecular pseudograph depicted in Table 2.
Mathematical bilinear forms: a theoretical
framework
In mathematics, abilinear form ina real vector space
is a mapping b:V Â V !<, which is linear in both
arguments [58–63]. That is, this function satisfies the
following axioms for any scalar aand any choice of
vectors
v;
w;
v
1
;
v
2
;
w
1
and
w
2
:
(1) bða
v;
wÞ¼bð
v; a
wÞ¼abð
v;
wÞ
(2) bð
v
1
þ
v
2
;
wÞ¼bð
v
1
;
wÞþbð
v
2
;
wÞ
(3) bð
v;
w
1
þ
w
2
Þ¼bð
v;
w
1
Þþbð
v;
w
2
Þ
That is, b is bilinear if it is linear in each parameter,
taken separately.
Let V be a real vector space in <
n
ðV 2<
n
Þ and con-
sider that the following vector set,
e
1
;
e
2
; ;
e
n
fg
is a
basis setof <
n
. This basis set permits us to write in
unambiguous form any vectors
x and
y of V, where
ðx
1
; x
2
; ; x
n
Þ2<
n
and ðy
1
; y
2
; ; y
n
Þ2<
n
are the
coordinates ofthe vectors
x and
y, respectively. That is
to say:
x ¼
X
n
i¼1
x
i
e
i
ð3Þ
and
y ¼
X
n
j¼1
y
j
e
j
ð4Þ
Subsequently,
bð
x;
yÞ¼bðx
i
e
i
; y
j
e
j
Þ¼x
i
y
j
bð
e
i
;
e
j
Þð5Þ
if we take the a
ij
as the n · n scalars bð
e
i
;
e
j
Þ. That is:
a
ij
¼ bð
e
i
;
e
j
Þ; to i ¼ 1; 2; ; n and j ¼ 1; 2; ; n ð6Þ
Then:
bð
x;
yÞ¼
X
n
i;j
a
ij
x
i
y
j
¼ X½
T
AY½¼
x
1
::: x
n
ÂÃ
a
11
::: a
jn
::: ::: :::
a
n1
::: a
nn
2
4
3
5
y
1
.
.
.
y
n
2
6
4
3
7
5
ð7Þ
As can be seen, the defined equation for b may be
written as the single matrix equation [see Eqn (7)],
where [Y] is a column vector (an n · 1 matrix) of the
coordinates of
y ina basis setof <
n
, and [X]
T
(a 1 · n
matrix) is the transpose of [X], where [X] is a column
vector (an n · 1 matrix) ofthe coordinates of
x in the
same basis of <
n
:
Finally, we introduce the formal definition of sym-
metric bilinear form. Let V be a real vector space and
b be abilinear function in V · V. Thebilinear function
Table 3. The zero (k = 0), first (k = 1) and second (k = 2) powers ofthe total nonstochastic and stochastic graph–theoretic electronic-contact
matrices of G
m
, respectively.
Order (k) Nonstochastic Stochastic
k =0
100000
010000
001000
000100
000010
000001
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
100000
010000
001000
000100
000010
000001
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
k =1
010010
101001
011000
000011
100101
010110
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
0
1
2
00
1
2
0
1
3
0
1
3
00
1
3
0
1
2
1
2
000
0000
1
2
1
2
1
3
00
1
3
0
1
3
0
1
3
0
1
3
1
3
0
2
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
5
k =2
201102
031120
112001
110211
020131
201113
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
1
3
0
1
6
1
6
0
1
3
0
3
7
1
7
1
7
2
7
0
1
5
1
5
2
5
00
1
5
1
6
1
6
0
1
3
1
6
1
6
0
2
7
0
1
7
3
7
1
7
1
4
0
1
8
1
8
1
8
3
8
2
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
5
Predicting thestabilityoftheArcrepressor S. E. Ortega-Broche et al.
3122 FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS
b is called symmetric if bð
x;
yÞ¼bð
y;
xÞ; 8
x;
y 2 V [58–
63]. Then:
bð
x;
yÞ¼
X
n
i;j
a
ij
x
i
y
j
¼
X
n
i;j
a
ji
x
j
y
i
¼ bð
y;
xÞð8Þ
Nonstochastic and stochastic amino acid-based
bilinear indices: total (global) definition
The kth nonstochastic and stochastic bilinear indices
for a protein, b
m
k
ð
x
m
;
y
m
Þ and
s
b
m
k
ð
x
m
;
y
m
Þ, are com-
puted from these kth nonstochastic and stochastic
graph–theoretic electronic-contact matrix, M
k
m
and
s
M
k
m
as shown in Eqns (9) and (10), respectively:
b
mk
ð
x
m
;
y
m
Þ¼
X
n
i¼1
X
n
j¼1
k
m
ij
x
i
m
y
j
m
ð9Þ
s
b
mk
ð
x
m
;
y
m
Þ¼
X
n
i¼1
X
n
j¼1
k
sm
ij
x
i
m
y
j
m
ð10Þ
where n is the number of amino acids (C
a
atom) in the
protein, and x
1
m
; ; x
n
m
and y
1
m
; ; y
n
m
are the coordi-
nates or components ofthe macromolecular vectors
x
m
and
y
m
in a canonical basis setof <
n
:
The defined Eqns (9) and (10) for b
m
k
ð
x
m
;
y
m
Þ and
s
b
m
k
ð
x
m
;
y
m
Þ may be also written as the single matrix
equations:
b
m
k
ð
x
m
;
y
m
Þ¼½X
m
T
M
k
m
½Y
m
ð11Þ
s
b
m
k
ð
x
m
;
y
m
Þ¼½X
m
Ts
M
k
m
½Y
m
ð12Þ
where [Y
m
] is a column vector (an n · 1 matrix) of the
coordinates of
y
m
in the canonical basis setof <
n
, and
[X
m
]
T
is the transpose of [X
m
], where [X
m
] is a column
vector (an n · 1 matrix) ofthe coordinates of
x
m
in the
canonical basis of <
n
: Therefore, if we use the canoni-
cal basis set, the coordinates [(x
1
m
, , x
n
m
) and (y
1
m
, ,
y
n
m
)] of any macromolecular vectors (
x
m
and
y
m
) coin-
cide with the components of those vectors [(x
m1
, ,
x
mn
) and (y
m1
, , y
mn
)]. For that reason, those coordi-
nates can be considered as weights (R-group in C
a
atom, that is to say ‘amino acid labels’) ofthe vertices
of G
m
, as a result ofthe fact that components of the
molecular vectors are values of some amino acid
property that characterizes each kind of R-chain in the
protein. The calculation ofthe three first values of
bilinear indicesforthe example protein (Tables 2 and
3) is shown in Table 4.
It should be noted that nonstochastic and stochastic
bilinear indices are symmetric and nonsymmetric bilin-
ear forms, respectively. Therefore, if, inthe following
weighting scheme, W and Z are used as amino acid
weights to compute theproteinbilinear indices, two dif-
ferent sets of stochastic bilinear indices,
WÀZs
b
m
k
ð
x
m
;
y
m
Þ
and
ZÀWs
b
m
k
ð
x
m
;
y
m
Þ [because
x
mW
À
y
mZ
6¼
x
mZ
À
y
mW
]
can be obtained, and only one group of nonstochastic
bilinear i ndices
WÀZ
b
m
k
ð
x
m
;
y
m
Þ¼
ZÀW
b
m
k
ð
x
m
;
y
m
Þ because,
in this case,
x
mW
À
y
mZ
¼
x
mZ
À
y
mW
can be calculated.
Nonstochastic and stochastic local bilinear
indices: definition of amino acid, amino
acid-type and peptide fragment bilinear indices
In the last decade, Randic
´
[64] proposed a list of desir-
able attributes fora molecular descriptor. Therefore,
this list can be considered as a methodological guide
for the development of new topological indices. One of
the most important criteria is the possibility of defining
the descriptors locally. This attribute refers to the
fact that the index could be calculated forthe molecule
(protein) as a whole but also over certain fragments of
the structure itself.
Therefore, in addition to total bilinearindices com-
puted forthe whole protein, a local-fragment (peptide
fragment) formalism can be developed. These descrip-
tors are termed local nonstochastic and stochastic
bilinear indices: b
mk
L
ð
x
m
;
y
m
Þ and
s
b
mk
L
ð
x
m
;
y
m
Þ, respec-
tively. The definition of these descriptors is:
b
mk
L
ð
x
m
;
y
m
Þ¼
X
n
i¼1
X
n
j¼1
k
m
ij
L
x
i
m
y
j
m
ð13Þ
s
b
mk
L
ð
x
m
;
y
m
Þ¼
X
n
i¼1
X
n
j¼1
k
sm
ij
L
x
i
m
y
j
m
ð14Þ
where
k
m
ijL
[
k
sm
ijL
] is the kth element ofthe row ‘i’
and column ‘j’ ofthe local matrix M
k
mL
½
s
M
k
mL
. This
matrix is extracted from the M
k
m
½
s
M
k
m
matrix and
contains information referring to the vertices of the
specific protein fragments (F
r
) and also to the molecu-
lar environment in step k. The matrix M
k
mL
½
s
M
k
mL
with
elements
k
m
ijL
[
k
sm
ijL
] is defined as (Table 5):
k
m
ijL
[
k
sm
ijL
]=
k
m
ij
[
k
sm
ijL
] if both v
i
and v
j
are
vertices (amino acid) contained within the F
r
=1⁄ 2
k
m
ij
[
k
sm
ijL
]ifv
i
or v
j
are vertices contained
within F
r
but not both
¼ 0 otherwise ð15Þ
These local analogues can also be expressed in
matrix form by the expressions:
b
mk
L
ð
x
m
;
y
m
Þ¼½X
m
T
M
k
mL
½Y
m
ð16Þ
s
b
m
k
ð
x
m
;
y
m
Þ¼½X
m
Ts
M
k
mL
½Y
m
ð17Þ
S. E. Ortega-Broche et al. PredictingthestabilityoftheArc repressor
FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS 3123
It should be noted that the scheme above follows
the spirit ofa Mulliken population analysis [65]. It
should be also noted that for every partitioning of a
protein into Z macromolecular fragments, there will be
Z local macromolecular fragment matrices. In this
case, if aprotein is partitioned into Z molecular frag-
ments, the matrix M
k
m
½
s
M
k
m
can be correspondingly
partitioned into Z local matrices M
k
mL
½
s
M
k
mL
, L =1,
, Z, andthe kth power of matrix M
k
m
½
s
M
k
m
is exactly
the sum ofthe kth power ofthe local Z matrices. In
this way, the total nonstochastic and stochastic bilinear
indices are the sum ofthe nonstochastic and stochastic
bilinear indices, respectively, ofthe Z macromolecular
fragments:
b
m
ð
x
m
;
y
m
Þ¼
X
Z
L¼1
b
mkL
ð
x
m
;
y
m
Þð18Þ
s
b
m
ð
x
m
;
y
m
Þ¼
X
Z
L¼1
s
b
mkL
ð
x
m
;
y
m
Þð19Þ
In addition, the amino acid-type bilinearindices can
also be calculated. Amino acid and amino acid-type
bilinear indices are specific cases of local protein bilin-
ear indices. In this sense, the kth amino acid-bilinear
indices are calculated by summing the kth amino acid
bilinear indicesof all amino acids ofthe same amino
Table 4. Values of nonstochastic and stochastic total bilinearindicesfor two interacting peptides (SKEERN) used as example above (see
also Tables 2 and 3).
Nonstochastic total bilinear indices
b
m0
¼
P
n
i¼1
P
n
j¼1
0
m
ij
x
i
m
y
j
m
¼½X
m
T
M
0
m
½Y
m
¼½1:96 2:84 3:08 3:08 2:88 3:22
100000
010000
001000
000100
000010
000001
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
À1:63
1:41
0:39
0:39
2:52
1:45
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
¼ 15:14
b
m1
¼
P
n
i¼1
P
n
j¼1
1
m
ij
x
i
m
y
j
m
¼½X
m
T
M
1
m
½Y
m
¼½1:96 2:84 3:08 3:08 2:88 3:22
010010
101001
011000
000011
100101
010110
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
À1:63
1:41
0:39
0:39
2:52
1:45
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
¼ 40:59
b
m2
¼
P
n
i¼1
P
n
j¼1
2
m
ij
x
i
m
y
j
m
¼½X
m
T
M
2
m
½Y
m
¼½1:96 2:84 3:08 3:08 2:88 3:22
201102
031120
112001
110211
020131
201113
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
À1:63
1:41
0:39
0:39
2:52
1:45
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
¼ 98:84
Stochastic total bilinear indices
s
b
m0
¼
P
n
i¼1
P
n
j¼1
0
sm
ij
x
i
m
y
j
m
¼½X
m
T
s
M
0
m
½Y
m
¼½1:96 2:84 3:08 3:08 2:88 3:22
100000
010000
001000
000100
000010
000001
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
À1:63
1:41
0:39
0:39
2:52
1:45
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
¼ 15:14
s
b
m1
¼
P
n
i¼1
P
n
j¼1
1
sm
ij
x
i
m
y
j
m
¼½X
m
T
s
M
1
m
½Y
m
¼½1:96 2:84 3:08 3:08 2:88 3:22
0
1
2
00
1
2
0
1
3
0
1
3
00
1
3
0
1
2
1
2
000
0000
1
2
1
2
1
3
00
1
3
0
1
3
0
1
3
0
1
3
1
3
0
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
À1:63
1:41
0:39
0:39
2:52
1:45
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
¼ 17:77
s
b
m2
¼
P
n
i¼1
P
n
j¼1
2
sm
ij
x
i
m
y
j
m
¼½X
m
T
s
M
2
m
½Y
m
¼½1:96 2:84 3:08 3:08 2:88 3:22
1
3
0
1
6
1
6
0
1
3
0
3
7
1
7
1
7
2
7
0
1
5
1
5
2
5
00
1
5
1
6
1
6
0
1
3
1
6
1
6
0
2
7
0
1
7
3
7
1
7
1
4
0
1
8
1
8
1
8
3
8
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
À1:63
1:41
0:39
0:39
2:52
1:45
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
¼ 14:57
Predicting thestabilityoftheArcrepressor S. E. Ortega-Broche et al.
3124 FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS
Table 5. The zero (k = 0), first (k = 1) and second (k = 2) powers ofthe local nonstochastic and stochastic graph–theoretic electronic-
contact matrices of G
m
, respectively.
The zero, first and second powers ofthe local (amino acid) nonstochastic matrices
M
0
ðG
m
; SÞ¼
100000
000000
000000
000000
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; SÞ¼
0
1
2
00
1
2
0
1
2
00000
000000
000000
1
2
00000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; SÞ¼
20
1
2
1
2
01
000000
1
2
00000
1
2
00000
000000
100000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
0
ðG
m
; KÞ¼
000000
010000
000000
000000
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; KÞ¼
1
1
2
0000
1
2
0
1
2
00
1
2
0
1
2
0000
000000
000000
0
1
2
0000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; K Þ¼
000000
03
1
2
1
2
10
0
1
2
0000
0
1
2
0000
010000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
0
ðG
m
; EÞ¼
000000
000000
001000
000000
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; EÞ¼
000000
00
1
2
000
0
1
2
1000
000000
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; EÞ¼
00
1
2
000
00
1
2
000
1
2
1
2
200
1
2
000000
000000
00
1
2
000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
0
ðG
m
; EÞ¼
000000
000000
000000
000100
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; EÞ¼
000000
000000
000000
0000
1
2
1
2
000
1
2
00
000
1
2
00
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; EÞ¼
000
1
2
00
000
1
2
00
000000
1
2
1
2
02
1
2
1
2
000
1
2
00
000
1
2
00
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
0
ðG
m
; RÞ¼
000000
000000
000000
000000
000010
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; RÞ¼
0000
1
2
0
000000
000000
0000
1
2
0
1
2
00
1
2
0
1
2
0000
1
2
0
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; RÞ¼
000000
000010
000000
0000
1
2
0
010
1
2
3
1
2
0000
1
2
0
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
0
ðG
m
; NÞ¼
000000
000000
000000
000000
000000
000001
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; NÞ¼
000000
00000
1
2
000000
00000
1
2
00000
1
2
0
1
2
0
1
2
1
2
0
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
0
ðG
m
; NÞ¼
000001
000000
00000
1
2
00000
1
2
00000
1
2
10
1
2
1
2
1
2
3
2
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
5
The zero, first and second powers ofthe local (amino acid) stochastic matrices
M
0
ðG
m
; SÞ¼
100000
000000
000000
000000
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; SÞ¼
0
1
4
00
1
4
0
1
6
00000
000000
000000
1
6
00000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; SÞ¼
1
3
0
1
12
1
12
0
1
6
1
6
00000
1
10
00000
1
12
00000
1
8
00000
000000
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
M
0
ðG
m
; KÞ¼
000000
010000
000000
000000
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; KÞ¼
0
1
4
0000
1
6
0
1
6
00
1
6
0
1
4
0000
000000
000000
0
1
6
0000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; K Þ¼
000000
0
3
7
1
14
1
14
1
7
0
0
1
10
0000
0
1
12
0000
0
1
7
0000
000000
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
M
0
ðG
m
; EÞ¼
000000
000000
001000
000000
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; EÞ¼
000000
00
1
6
000
0
1
4
1
2
000
000000
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; EÞ¼
00
1
12
000
00
1
14
000
1
10
1
10
2
5
00
1
10
000000
000000
00
1
16
000
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
S. E. Ortega-Broche et al. PredictingthestabilityoftheArc repressor
FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS 3125
acid type inthe protein. Inthe amino acid-type bilin-
ear indices formalism, each amino acid inthe molecule
is classified into an amino acid-type (fragment), such
as apolar, polar uncharged, polar charged, positive
charged, negative charged, aromatic, and so on. For
all data sets, including those with a common molecular
scaffold, as well as those with very diverse structure,
the kth amino acid-type bilinearindices provide
important information. The calculation ofthe three
first values of local (amino acid) bilinearindices for
the example protein (Tables 2 and 3) is shown in
Table 6.
Any local proteinbilinear index has a particular
meaning, especially forthe first values of k, where the
information about the structure ofthe fragment F
R
is
contained. Higher values of k relate to the environ-
ment information ofthe fragment F
R
considered
within the macromolecular pseudograph.
In any case, acomplete series ofindices performs a
specific characterization ofthe chemical structure.
The generalization ofthe matrices anddescriptors to
‘superior analogues’ is necessary forthe evaluation of
situations where only one descriptor is unable to
allow good structural characterization [64,66]. The
local macromolecular indices can also be used
together with the total ones as variables for quantita-
tive structure–activity relationship (QSAR) ⁄ quantita-
tive structure–property relationship (QSPR) modelling
of properties or activities that depend more on a
region or a fragment than on the macromolecule as a
whole.
Data preparation
Computation ofproteinbilinear indices
The calculation of total and local macromolecular
bilinear indicesfor any peptide or protein was
Table 5. (Continued).
M
0
ðG
m
; EÞ¼
000000
000000
000000
000100
000000
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; EÞ¼
000000
000000
000000
0000
1
4
1
4
000
1
6
00
000
1
6
00
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; EÞ¼
000
1
12
00
000
1
14
00
000000
1
12
1
12
0
1
3
1
12
1
12
000
1
14
00
000
1
16
00
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
M
0
ðG
m
; RÞ¼
000000
000000
000000
000000
000010
000000
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; RÞ¼
0000
1
14
0
000000
00000 0
0000
1
14
0
1
6
00
1
6
0
1
6
0000
1
6
0
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; RÞ¼
000 0 0 0
000 0
1
7
0
000 0 0 0
000 0
1
12
0
0
1
7
0
1
14
3
7
1
14
000 0
1
16
0
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
M
0
ðG
m
; NÞ¼
000000
000000
000000
000000
000000
000001
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
1
ðG
m
; NÞ¼
000000
00000
1
6
000000
00000
1
4
00000
1
6
0
1
6
0
1
6
1
6
0
2
6
6
6
6
6
6
4
3
7
7
7
7
7
7
5
M
2
ðG
m
; NÞ¼
000 0 0
1
6
000000
000 0 0
1
10
000 0 0
1
12
000 0 0
1
14
1
8
0
1
16
1
16
1
16
3
8
2
6
6
6
6
6
6
6
6
4
3
7
7
7
7
7
7
7
7
5
Table 6. Values of amino acid-based (local) bilinearindicesfor the
heterodimer SKEERN.
Amino acid
Local nonstochastic bilinear indices
b
0L
(
x
m
,
y
m
) b
1L
(
x
m
,
y
m
) b
2L
(
x
m
,
y
m
)
Ser (S) )3.1948 )0.8104 )13.0522
Lys (K) 4.0044 6.1215 28.6812
Glu (E) 1.2012 3.9264 5.8605
Glu (E) 1.2012 7.3033 10.3029
Arg (R) 7.2576 10.71 43.578
Asn (N) 4.669 13.3352 23.4674
Heterodimer
(SKEERN)
15.1386 40.586 98.8378
Amino acid
Local stochastic bilinear indices
s
b
0L
ð
x
m
;
y
m
Þ
s
b
1L
,ð
x
m
;
y
m
Þ
s
b
2L
ð
x
m
;
y
m
Þ
Ser (S) )3.1948 0.37176667 )2.04034833
Lys (K) 4.0044 2.6327 4.27309429
Glu (E) 1.2012 1.8709 1.08062179
Glu (E) 1.2012 3.4534 1.66443036
Arg (R) 7.2576 4.6284 6.24537857
Asn (N) 4.669 4.81723333 3.34964405
Heterodimer
(SKEERN)
15.1386 17.7744 14.5728207
Predicting thestabilityoftheArcrepressor S. E. Ortega-Broche et al.
3126 FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS
implemented in tomocomd-camps software [67]. The
main steps forthe application of this method in
QSAR ⁄ QSPR can be briefly summarized:
(1) Draw the macromolecular pseudographs for each
protein ofthe data set, using the software’s drawing
mode. This procedure is carried out by selection of the
active amino acid symbol belonging to the ‘natural’
amino acid code. Here, we consider covalent (peptidic
bond) and noncovalent [hydrogen bond and other elec-
trostatic interaction (within a chain as well as between
chains)] interaction. Afterwards, we draw the mutants
by changing an amino acid foralanineand considering
that this change only affects the possibility of this
region oftheprotein to form a polar interaction
(because we suppressed the hydrogen interaction if the
former amino acid had it).
(2) Use appropriated amino acid weights to differenti-
ate the side-chain of each amino acid. Inthe present
study, we used some descriptorsforthe natural amino
acid as the amino acid property: the three z-values
[48], Kyte–Doolittle’s hydrophobicity scale [50], ISA
and ECI [49].
(3) Compute the nonstochastic and stochastic protein
bilinear indices. They can be performed inthe software
calculation mode, where it is possible to select the
side-chain properties andthe family descriptor previ-
ously to calculate thebio-macromolecular indices. This
software generates a table in which the rows and
columns correspond to the compounds and the
b
mk
ð
x
m
;
y
m
Þ,respectively.
(4) Find a QSPR ⁄ QSAR equation by using statistical
techniques, such as multilinear regression analysis,
neural networks, linear discrimination analysis (LDA),
and so on. That is to say, we can find a quantitative
relationship between a property P andthe b
mk
ð
x
m
;
y
m
Þ
having, for example, the appearance:
P ¼ a
0
b
m0
ðx
m
; y
m
Þþa
1
b
m1
ðx
m
; y
m
Þþa
2
b
m2
ðx
m
; y
m
Þ
þ ÁÁÁþa
k
b
mk
ðx
m
; y
m
Þþc ð20Þ
where P is the measurement ofthe property,
b
mk
ð
x
m
;
y
m
Þ½or b
mkL
ð
x
m
;
y
m
Þ is the kth total [or local]
macromolecular nonstochastic bilinear indices, and
the a
k
are the coefficients obtained by the statistical
analysis.
(5) Test the robustness and predictive power of the
QSPR ⁄ QSAR equation by using internal and external
cross-validation techniques.
(6) Develop a structural interpretation ofthe obtained
QSAR ⁄ QSPR model using macromolecular bilinear
indices as molecular descriptors.
Database
Arc is a homodimer in which each monomer inter-
twines with the other to form a single, globular domain
with a well-defined core. Several side-chain hydrogen
bond and salt-bridge interactions are involved in the
Arc crystal structure. An exhaustive representation of
these interactions are provided in detail elsewhere [32].
Nevertheless, an overview of these electrostatic interac-
tions inArcrepressor structure will be given. Hydro-
gen bond interactions take place [32]:
(1) Between a side-chain inthe same subunit (N29-
E36) and between side-chains in different subunits
(R40-S44).
(2) Between a side-chain and main-chain atom
intersubunit (W14-N34, N34-R13) and between a
side-chain and main-chain atom intrasubunits (E17-
E17, S32-S35, S44-R40).
On the other hand, salt-bridge interactions take
place [32]:
(3) Between a side-chain inthe same subunit (R16-
D20, D20-R23, R31-E36, E36-R40, E43-K46, E43-
K47) and between side-chains in different subunits
(E28-R50, R40-E48).
The data ofArcrepressor mutants were taken from
the literature. Inthe present study, alanine substitu-
tions were constructed at each ofthe 51 non-alanine
positions inthe wild-type Arc sequence. To avoid
intracellular proteolysis and purification difficulties,
the alanine substitution mutant was constructed in
backgrounds containing the carboxy-terminal exten-
sions (His)
6
(designated st6) or (His)
6
-Lys-Asn-Gln-
His-Glu (designated st11) [68,69]. These tail sequences
allow affinity purification, reduce degradation and
cause no significant changes inproteinstability [70].
Milla et al. [32] subjected each purified mutant of
Arc to thermal and urea denaturation experiments. The
stability ofthe proteins was checked by melting temper-
ature (t
m
). The values of t
m
for 53 Arc homodimers
reported by these authors are given in Tables 7 and 8.
In equilibrium and kinetic unfolding–refolding stud-
ies, only native Arc dimers and denatured monomers
are significantly populated. Thus, folding and dimer-
ization are concerted processes [32,71,72]. For this
reason, it is important to note that t
m
refers to the
unfolding oftheArc homodimer. Accordingly, the fact
that each single mutation changes two side-chains in
the Arc dimer one must take into consideration, with
stability effects being approximately twice those
observed for monomeric proteins. Moreover, changes
in stability may arise as a result of mutation disrupts
of a native interaction, when the native structure of
S. E. Ortega-Broche et al. PredictingthestabilityoftheArc repressor
FEBS Journal 277 (2010) 3118–3146 ª 2010 The Authors Journal compilation ª 2010 FEBS 3127
[...]... development of linear discriminant functions, which permits the classication of mutants as having near wild-type stability or reduced stability, and therefore describe theproteinstabilityeffectsofacompletesetofalaninesubstitutionsintheArcrepressor Here, we consider a general setof data that consists of 53 A- mutants, with 28 of them having near wildtype stability (128) andthe remainder being... linear combinations of nonstochastic [Eqn (27)] and stochastic [Eqn (28)] proteinbilineardescriptors account for 83% of variance ofthe tm forthe cases inthe training series; the values of F-ratio for Eqns (27) Table 10 Results ofthe stochastic bilinear indices- driven LDA models oftheArc A- mutants inthe training and test sets Mutants with near wild-type stabilitya Mutant DP% 1 PA8-st 6a 2 SA35-st6... 0.96 and 1.03 for Eqns (31) and (32)] In Tables 11 and 12, we depict the observed, calculated [by using Eqns (29) to (32)] and residual values of tm for cases in both training and test sets Different protein folding may be the reason forthe lack of linear correlation between proteinbilinearindicesandstability (tm) for these mutants, leading to a nonlinear dependence between tm andtheprotein bilinear. .. information about the electrostatic interactions among amino acids appears to be necessary Here, we analyze the relevance ofthe inclusion of this type of information for obtaining descriptors that encode relevant structural information correlating with thestability changes oftheArc mutants Accordingly, we compared the accuracies of classication models based on nonstochastic proteinbilinear indices. .. analysis This dataset was randomly divided into two subsets: one containing 39 mutants, which was used as a training set, andthe other containing nine mutants (ve having near wild-type stabilityand four having reduced stability) , which was used as a test set Combining nonstochastic and stochastic total proteinbilinearindices with MLR analysis, we developed the QSSR linear models to describe tm for. .. derivation is straightforward, and it is easy to interpret the QSARs QSPRs that include them We have shown that the use ofprotein total bilinearindices can account forthe thermodynamic parameters for both wild-type and mutant Arc proteins The resulting quantitative models are signicant from a statistical point of view Concluding Remarks Inthe present study, a new setofbio-macromolecular descriptors. .. the data setandthe test set (full set) , the accuracy was 98.11% (52 53) and 96.23% (51 53) for Eqns (25) and (26), respectively, by using nonstochastic and stochastic bilinearindicesin that order These statistical parameters suggest that linear combinations ofproteinbilinearindices are appropriate forthe discrimination of near wild-type stability reduced stability mutants studied here Equations... dimmer These results suggest that Arc folding is a rather complicated process that depends on various processes andthe combinations of parameters (bilinear indices calculated with each pairs of amino acid properties) are necessary to describe adequately the tm of these Arc mutants [Eqns (25) and (26)] From a comparison ofthe accuracies of classication models based on nonstochastic proteinbilinear indices. .. importance ofprotein structural information forthe numerical characterization ofArc mutants and its relationship with stability changes It is well known that salt-bridges and hydrogen bonds play an important role in maintaining the 3D structure of proteins [87] Therefore, to obtain a useful numerical characterization of proteins forthe study of its properties (stability, folding, etc.), the use of information... based on the kind of method use for deriving the QSPR and their statistical parameter, the explored molecular descriptors, the overall accuracy (%), Matthews correlation coefcient andthe validation method used Table 15 shows a comparison between nonstochastic and stochastic proteinbilinearindices based on classication methods and other reported approaches forpredictingthestabilityofArcrepressor . macromolecular fragments: b m ð x m ; y m Þ¼ X Z L¼1 b mkL ð x m ; y m Þð18Þ s b m ð x m ; y m Þ¼ X Z L¼1 s b mkL ð x m ; y m Þð19Þ In addition, the amino acid-type bilinear indices can also be calculated. Amino acid and amino acid-type bilinear indices are specific cases of local protein bilin- ear indices. In this sense,. TOMOCOMD-CAMPS and protein bilinear indices – novel bio-macromolecular descriptors for protein research: I. Predicting protein stability effects of a complete set of alanine substitutions in the Arc repressor Sadiel. the kth amino acid -bilinear indices are calculated by summing the kth amino acid bilinear indices of all amino acids of the same amino Table 4. Values of nonstochastic and stochastic total bilinear