10 GraphBased Analysis of Amino Acid Sequences Luciano da Fontoura Costa CONTENTS 10.1 Introduction 10.2 ComplexNetworks Concepts and Tools 10.2.1 Brief Historic Perspective 10.2.2 Basic Mathematical Concepts 10.2.2.1 Graph Theory Basics 10.2.2.2 Probabilistic Concepts 10.2.2.3 Random Graph Models 10.2.2.4 SmallWorld and ScaleFree Models 10.3 ComplexNetworks Approaches toBioinformatics 10.4 Sequences of Amino Acids as Weighted, Directed Complex Networks 10.5 Results 10.5.1 Zebra Fish 10.5.2 Xenopus 10.5.3 Rat 10.6 Discussion 10.7 Concluding Remarks and Future Work Acknowledgments References 10.1 INTRODUCTION One of the most essential features underlying natural phenomena and dynamical systems are the many connections, implications, and causalities between the several involved elements and processes. For instance, the whole dynamics of gene activation can be understood as a highly complex network of interactions, in the sense that some genes are enhanced while others are inhibited by several environmental factors, including the current biochemical composition of the individual (such as the presence of specific genesproteins) as well as external effects such as temperature and interaction with other individuals. Interestingly, such a network of effects extends much beyond the individual in time and space, in the sense that any living being is Copyright 2005 by Taylor Francis Group, LLC 364 Medical Image Analysis affected by history (i.e., evolutionary processes) and spatial interactions (i.e., ecology). Although biology can only be fully understood and explained by considering the whole of such an intricate network of effects, reductionist approaches can still provide many insights about biological phenomena that are more localized in time and space, such as the genetic dynamics during an individual lifetime or an infectious process. The large masses of data produced by experimental works in biology, molecular biology, and genetics can only be properly organized, analyzed, and modeled by using computer concepts including databases, networks, parallel computing, and artificial intelligence, with special emphasis placed on signal processing and pattern recognition. The incorporation of such modern computer concepts and tools into biology and genetics has been called bioinformatics 1. The applications of this new area to genetics are manifold, ranging from nucleotide analysis to animal development. Among the several signalprocessing methods considered in bioinformatics 2, we have the application of Markov random fields to model the sequences of nucleotides, the use of correlation and covariance to characterize sequences of nucleotides and amino acids, and wavelets 2, 3. One particularly important problem concerns the analysis of proteins, the basic blocks of life 4, 5. Constituted by sequences of amino acids, proteins participate in all vital processes, acting as catalysts; providing the mechanical scaffolding for cells, organs, and tissues; and participating in DNA expression. Proteins are polymers of amino acids, determined from the DNA through the process of protein expression. Many of the properties of proteins derive from their spatial shape and electrical affinities, which are both defined by the specific sequences of constituent amino acids 4, 5. Therefore, given the sequence of amino acids specified by the DNA, the protein folds into specific forms while taking into account the interactions between the amino acids and external influence of chaperones. It remains an open problem how to determine the structural properties of proteins from the respective amino acid sequences, a problem known as protein folding 4, 5. Except for some basic motifs, such as alphahelices and betasheets, which are structures that appear repeatedly in proteins, the prediction of protein shape constitutes an intense research area. Experimentally, the sequences of amino acids underlying proteins can be obtained by using sequencing machines capable of reading the nucleotides, which are subsequently translated into amino acids by considering triples of nucleotides, the socalled codons, translated according to the genetic code 3–5. By being inherently oriented toward representing connections and implications, graphs stand out as one of the most general and interesting data structures that can be used to represent biological systems. Basically, a graph is a representational structure composed of nodes, which are connected through directed or undirected edges. Any structure or phenomenon can be represented to varying degrees of completeness in terms of graphs, where each node would correspond to an aspect of the phenomenon and the edges to interactions. Such a potential for representation and modeling is greatly extended by the many types of graphs, including those with weighted edges, different types of coexisting nodes or edges, and hypergraphs, to name only a few. Interestingly, most biological phenomena can be properly represented in terms of graphs, including gene activation, metabolic networks, evolution Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 365 (recall that hierarchical structures such as trees are special kinds of graphs), ecological interactions, and so on. However, despite the natural potential of graphs for representing and studying natural phenomena, their application was timid until the recent advent of the area of complex networks. One of the possible reasons for that is that graphs had been often understood as representations of static interactions, in the sense that the connections between nodes were typically assumed not to change with time. Thus, the uses of graphs in biology, for instance, were mainly constrained to representing evolutionary hierarchies (in terms of trees) and metabolic networks. This situation underwent an important recent change sparked mainly by the pioneering developments in random networks by Rapoport 6 and Erdös and Rényi 7, Watts and Strogatz smallworld models 8, and by Barabási scalefree networks 9. The research of such types of complex graphs became united under the name of complex networks 10–12. Now, in addition to the inherent potential of graphs to nicely represent natural phenomena, important connections were established with dynamics systems, statistical physics, and critical phenomena, while many possibilities for multidisciplinary research were established between areas such as graph theory, statistical physics, nonlinear dynamical systems, and complexity theory. Despite such promising perspectives, one of the often overlooked reasons why complex networks have become so important for modern science is that studies in this area tend to investigate the dynamical evolution of the graphs 10–12, which can provide key insights about the relationship between the topology and function of such complex systems. For example, one of the most interesting properties exhibited by random graphs is the abrupt appearance, as new edges are progressively added at random, of a giant cluster that dominates the graph structure and connections henceforth. Thus, in addition to being typically large (several studies in complex networks consider infinitely large graphs), the graphs were now used to model growing processes. Allied to the inherent vocation of graphs to represent connections, interactions, and causality, the possibility of modeling dynamical evolution in terms of complex networks has made this area into one of the most promising scientific concepts and tools. The present chapter is aimed at addressing how complexnetwork research has been applied to bioinformatics, with special attention given to the characterization and analysis of amino acid sequences in proteins. The text starts by reviewing the basic context, concepts, and tools of complexnetwork research and continues by presenting some of the main applications of this area in bioinformatics. The remainder of the chapter describes the more specific investigation of amino acid sequences in terms of complex networks obtained for graphs derived from subsequence strings. 10.2 COMPLEXNETWORKS CONCEPTS AND TOOLS 10.2.1 BRIEFHISTORICPERSPECTIVE The beginnings of complexnetwork research can be traced back to the pioneering and outstanding works by Rapoport 6 and Erdos and Renyi 7, who concentrated attention on the type of networks currently known as random networks. This name is somewhat misleading in the sense that many other network models are also Copyright 2005 by Taylor Francis Group, LLC 366 Medical Image Analysis random. The essential property of random networks as understood in graph theory, therefore, is not only being random, but to follow a particular probabilistic model, namely the uniform random distribution 13. In other words, given a set of N nodes, connections are established by choosing pairs of nodes according to the uniform probability density. In the case of undirected graphs, the edges are uniformly sampled out of the N(N–1)2 possible connections. Consequently, random networks correspond to the maximum entropy hypothesis of connectivity evolution, providing a suitable null hypothesis against which several real and theoretical models can be compared and contextualized. One of the most interesting features of random networks is the fact that the progressive addition of new edges tends to abruptly form a giant, dominating cluster (or connected component) in the graph. Such a critical transition is particularly interesting not only because it represents a sudden change of the network connectivity, but because it provides a nice opportunity for connecting graph theory to statistical physics. Indeed, the appearance of the giant cluster can be understood as a percolation of the graph, similar to critical phenomena (phase transitions) underlying the transformation of ice into water. Basically, percolation corresponds to an abrupt change of some property of the analyzed system as some parameter is continually varied. This interesting connection between graph theory and statistical physics has provided unprecedented opportunities for multidisciplinary works and applications, nicely bridging the gap between areas such as complexity analysis, which is typical of graph theory, and the study of systems involving large numbers of elements, typical in statistical physics. In addition to such an exciting perspective, random networks attracted much interest as possible models of real structures and phenomena in nature, with special emphasis given to the Internet and the World Wide Web. After the fruitful studies of Rapoport and Erdos and Renyi, the study of large networks (note that the term complex network was not typical at those times) went through a period of continuing academic investigation followed by few applications, except for promising investigations in areas such as sociology. Indeed, one of the next important steps shaping the modern area of complex networks was the investigation of personal interactions in society, of which the 1998 work by Watts and Strogatz 8 represents the basic reference. Basically, experimental investigations regarding social contacts led to the result that the average length between any two nodes (i.e. persons) is rather small, hence the name smallworld networks. The typical mathematical model of such networks starts with a regular graph, which subsequently has a percentage of its connections rewired according to uniform probability. Although such investigations brought many insights to the area, the smallworld property was later verified to be an almost ubiquitous property of complex networks. The subsequent investigations of the topological properties of the Internet and WWWperformed by Albert and Barabási 9 led to the important discovery that the statistical distribution of the node degrees (i.e., the number of connections of a node) in several complex networks tends to follow a power law, indicating scalefree behavior. Unlike the random model, this property favors the appearance of nodes concentrating many of the connections, the socalled hubs. Such underlying structure has several implications, such as resilience to attack, which Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 367 is particularly fragile for hub attacks. From then on, the developments in complexnetwork research boomed, covering several types of natural systems, from epidemics to economy. The interested reader is encouraged to check the excellent surveys of this area 10–12 for complementary information. 10.2.2 BASICMATHEMATICALCONCEPTS This section provides a brief introductory review of basic concepts and measurements in graph theory, statistics, random graphs, and smallwork and scalefree networks. Readers who are already familiar with such topics can proceed directly to Section 10.2.3. 10.2.2.1 Graph Theory Basics Basically, a typical graph 14–17 in complexnetwork theory 10–12 involves a collection of N nodes i = 1, 2, …, N that are connected through edges (i,j) that can have weights w(i,j). Such a data structure is precise and completely represented by the respective weight matrix W, where each entry W(j,i) represents the weight of edge (i,j). Nonexistent edges are represented as null entries in that matrix. The adjacency matrix K of the graph is a matrix where the value 1 is assigned to an element (i,j) whenever there is an edge connecting node j to I, and 0 otherwise. The adjacency matrix can be obtained from the weight matrix by setting each element larger or equal to a specific threshold value T to 1, assigning 0 otherwise. Such adjacency matrices, henceforth represented as KT , provide indication about the network structure defined by the weights that are higher than the threshold. Therefore, the adjacency matrix for high values of T can be understood as the strongest component, or “kernel,” of the weighted graph. Observe that it is also possible to consider the complementary matrix of KT with respect to K, which is defined as follows. Each element (i,j) of such a matrix, hence abbreviated as QT , receives value 1 iff KT (i,j) = 0 and K(i,j) 0. An undirected graph is characterized by undirected edges, so that K(j,i) = 1 iff K(i,j) = 1, i.e., K is symmetric. A directed graph, or digraph, is characterized by directed edges and not necessarily by a symmetric adjacency matrix. One of the most basic and interesting local feature of a graph or network is the number of connections of a specific node i, which is called the node degree and often abbreviated as ki . Observe that a directed graph has two types of such a degree, the indegree and the outdegree, corresponding to the number of incoming and outgoing edges, respectively. Figure 10.1illustrates the concepts introduced here with respect to an undirected graph G and a directed graph H, identifying the nodes, edges, and weights. This figure also shows the respective weight matrices WG and WH and adjacency matrices AG and AH. The degree of node 1 in G is 2, the outdegree of node 1 in H is 2, and the indegree of node 1 in H is 1. N is equal to 4 for both graphs. A great part of the importance of graphs stems from their generality for representing, in an intuitive and explicit way, virtually any discrete structure while emphasizing the involved entities (nodes) and connections. Indeed, virtually every data structure (e.g., tree, queue, list) is a particular case of a graph. In addition, graphs Copyright 2005 by Taylor Francis Group, LLC 368 Medical Image Analysis can be used to represent the most general mesh of points used for numeric simulation of dynamic systems, from the regular orthogonal lattice used in image representation to the most intricate adaptive triangulations. As such, graphs are poised to provide one of the keys for connecting not only structure and function, but also several different biological areas and even the whole of science. Several measurements or features have been proposed and used to express meaningful and useful global properties of the network structure. In similar fashion to feature selection in the area of pattern recognition (e.g., 13), the choice of such features has to take into account the specific problem of interest. For instance, a problem of communication along the network needs to take into account the distance between nodes. It should be observed that, in most cases, the selected set of features is degenerated, in the sense that it is not enough to reproduce the original network structure. Therefore, great attention must be paid when deriving general conclusions based on incomplete sets of measurements, as is almost always the case. Some of the more traditional network measurements are reviewed in the following paragraph. The global measurement, usually derived from the node degree, is its average value along the whole network. Observe that, for a digraph, the average indegree and outdegree are necessarily identical. The average node degree gives a first idea about the overall connectivity of the network. Additional information about the network connectivity can be obtained from the average clustering coefficient . Given one specific node i, the immediately connected nodes are identified, and the ratio between the number of connections between them and the maximum possible FIGURE 10.1 Basic concepts in graph theory: examples of undirected (G) and directed (H) graphs, with respective nodes, edges, and weights. The weight matrices of G and H are WG and WH, and the respective adjacency matrices considering threshold T = 1 are given as AG and AH. 1 3 2 4 4 1 11 3 2 3 2 2 2 4 node G: H: edge weight 0240 2000 4001 0010 0020 2000 3001 0000 0110 1000 1000 0000 0010 1000 1000 0000 WG= AG= AH= WH= Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 369 value of those connections defines the clustering coefficient of node i, i.e., Ci . This feature tends to express the local connectivity around each node. Another interesting and frequently used network measurement is the length between any two nodes i and j, here denoted as L(i,j). This distance may refer either to the minimal sum of weight along a path from i to j, or to the total number of edges between those two nodes. The present work is restricted to the latter. The respectively derived global feature is the average length considering all possible pairs of network nodes, hence . This measurement provides an idea not only about the proximity between nodes, but also about the overall network connectivity, in the sense that low averagedistance values tend to indicate a densely connected structure. Another interesting measurement that has been used to characterize complex networks is the betweenness centrality. Roughly, the betweenness centrality of a specific network node in an undirected graph corresponds to the number of shortest paths between any pair of node in the network that cross that node 18. 10.2.2.2 Probabilistic Concepts Any measurement whose outcome cannot be exactly predicted, such as the weight of an inhabitant of Chicago, can be represented in terms of a random variable 13, 19. Such variables can be completely characterized in terms of the respective density functions, which can be approximated in terms of the respective relative frequency histogram. Alternatively, a random variable can also be represented in terms of its (possibly) infinite moments, including the mean, variance, and so on. Statistical density functions of special interest for this chapter include the uniform distribution, which assigns the same probability to any possible measurement, and the Poisson distribution, which is characterized in terms of a ratio of event occurrence per length, area, or volume. For instance, we may have that the chance of having a failure in an electricity transmission cable is equal to one failure per 10,000 km. Therefore, the chance of observing the event along the considered structure (e.g., the transmission cable) is also equiprobable along the considered parameter (e.g., length or time). Such concepts can be immediately extended to multivariate measurements by introducing the concept of random vector. For instance, the temperature and pressure of an inhabitant of Chicago can be represented as the twodimensional random vector T, P. Such statistical entities are also completely characterized, in statistical terms, by their respective multivariate densities. Statistical and probabilistic concepts and techniques are essential for representing and modeling natural phenomena and biological data because of the intrinsic variation of such measurements. 10.2.2.3 Random Graph Models The first type of complex networks to be systematically investigated were the random graphs 6, 7, 10–12, 20. In using such graphs, one starts with N unconnected nodes and progressively adds edges between pairs of nodes chosen according to the uniform distribution. Although the measurements described in Section 2.2.1 are useful for characterizing the structure of such networks, it is also important to take into account parameters and measurements governing their dynamical evolution, including the Copyright 2005 by Taylor Francis Group, LLC 370 Medical Image Analysis critical phenomenon of percolation. As more connections are progressively added to a growing network, there is a definite tendency to form a giant cluster (percolation), which henceforth dominates the growing dynamics. Given a network, a cluster is understood as the set of nodes (and respective interconnecting edges) such that one can reach any node while starting from any other node in the cluster, i.e., the cluster is a connected component of the graph. The giant cluster corresponds to the cluster with the largest number of nodes at a given step of the network evolution. For an undirected random network, this phenomenon has been found to take place when the percentage of existing connections with respect to the maximum possible number of connections is about 1N 5. 10.2.2.4 SmallWorld and ScaleFree Models The types of complex networks known as small world and scale free were identified and studied years after Erdos and Renyi investigated random graphs. Smallworld networks 8, 10 are characterized by a short path from any pairs of its constituent nodes. A typical example of such a network is the social interactions within a given society, in the sense that there are just a few (about five or six) relations between any two persons. Characterized later than smallworld models, the scalefree networks 10–12 are characterized by the fact that the statistical distribution of the respective node degrees follows a power law, i.e., the representation of such a density in a loglog plot produces a straight line. Such densities, unlike those observed for other types of networks, implies a substantially higher chance of having nodes of high degree, which are traditionally called hubs. As reviewed in the next section, such nodes have been identified as playing an especially important role in biological networks. Scalefree networks can be produced by using the preferentialattachment growth strategy 10–12, characterized by the progressive addition of new nodes with fixed number of edges that are connected preferentially with nodes of higher degree, giving rise to the paradigm that has become known as “the rich get richer.” At the same time, scalefree networks have also been shown to be less resilient to random node attachments than other types of networks, such as random graphs 10. 10.3 COMPLEXNETWORKS APPROACHES TO BIOINFORMATICS Several possibilities of using complex network and statistical physics in biology have been described and revised by Bose in his interesting and extensive survey 21. Special attention is given to relationships between the network’s topology and functional properties, and the following three situations are covered in considerable depth: 1. The topology of complex biological networks, such as metabolic and protein interaction 2. Nonlinear dynamics in gene expression 3. The effect of stochasticity on the network dynamics Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 371 While we review in the following some of the most representative works applying complexnetwork research to biology, the reader is encouraged to complement and extend our revision by referring to Bose’s survey. Metabolic reactions, one of the key elements of life, were among the first to be studied by complexnetwork approaches. Such networks have their nodes representing the molecular compounds (or substrates), and the edges indicate the metabolic reactions connecting substrates. Incoming links to a substrate are understood to correspond to the reactions of which that substrate is a product. The pioneering investigation by Jeong et al. 22 considered networks that are available for 43 organisms, yielding average node indegree and outdegree in the range from 2.5 to 4, with the respective distribution being understood as scale free with exponents close to 2.2. The metabolic reactions of E. coli have been studied as undirected graphs by Wagner and Fell 23, yielding average node degree of 7 and a clustering coefficient (approximately 0.3) much larger than could be obtained for a random network. An interesting investigation into whether the duplication of information in genomes can significantly affect the power law exponents was reported by Chung et al. 24. By using probabilistic methods as the means to analyze the evolution of graphs under duplication mechanisms, those authors were able to show that such mechanisms can produce networks with low powerlaw exponents, which are compatible with many biological networks 25. The decomposition of biochemical networks into hierarchies of subnetworks, i.e., networks obtained by considering a subset of the nodes of the original graph and some of the respective edges, has been addressed by Holme and Huss 18. These authors use the algorithm of Girvan and Newman 26 for tracing subnetworks, in a form adapted to bipartite representations of biochemical networks. The underlying principle of the algorithm is the fact that vertices between densely connected areas have high betweenness centrality, such that removal with high degree leads to the partition of the whole network into subnetworks that are contained in previous clusters, thereby producing a hierarchy of subnetworks. Another extremely important type of biological network, corresponding to genomic regulatory systems (i.e., the set of processes controlling gene expression), has also been subject of increasing attention in complexnetwork research. This type of directed network is characterized by having nodes corresponding to components of the system, with the edges representing the geneexpression regulations 11. An important type of network in this category is that obtained from proteinprotein interactions. In this type of network, each node corresponds to a protein, and the directed edges represent the interactions. A model of regulatory networks has been described by Kuo and Banzhaf 27. A pioneering approach in this area is the work of Jeong et al. 28, which considered protein–protein interaction networks of S. cerevisiae, containing thousands of edges and nodes. The degree distribution was interpreted as following scalefree behavior with an approximate exponent of 2.5. One of the most important conclusions of that investigation was that the removal of the mostconnected proteins (i.e., hubs, the nodes of a complex network receiving a large number of connections) can have disastrous effects on the proper functioning of the individual. The issue of protein–protein interaction networks has also been Copyright 2005 by Taylor Francis Group, LLC 372 Medical Image Analysis considered in a number of other works, including Qin et al. 29, Wagner 30, PastorSatorras et al. 31, and in studies of the properties and evolution of such networks. Another related work, described by Wuchty 32, considered graphs obtained by assigning a node to every protein domain (or module) and an edge whenever two such domains are found in the same protein. The important problem of determining protein function has been addressed from the perspective of networks of physical interaction by Vazquez et al. 33. Their method is based on the minimization of the number of interacting proteins with different categories, so that the function estimation can be performed on a global scale while considering the entire connectivity of the protein network. The obtained results corroborate the validity of using proteinprotein interaction networks as a means of inferring protein function, despite the unavoidable presence of imperfections and the incompleteness of protein networks. The analysis of geneexpression networks in terms of embedded complex logistics maps (ECLM), a hybrid method blending some concepts from wavelets and coupled logistics maps, has been reported by Shaw 34. That study considered 112 genes collected at nine different time instants along 25 days, with each time point being fitted to an ECLM model with high Pearson correlation coefficient, and the connections between genes were determined by considering models with high pairwise correlation. The obtained connections were interpreted as following scalefree behavior in both topology and dynamics. A work by Bumble et al. 35 suggests that the study of pathways of network syntheses of genes, metabolism, and proteins should be extended to the investigation of the causes and treatment of diseases. Their approach involves methods capable of yielding, for a specific set of candidate reactions, a complete metabolic pathway network. Interesting results are obtained by investigating qualitative attributes, including relationships regarding the connectivity between vertices and the strength of connections, the relationship of interaction energies and chemical potentials with the coordination number of the lattice models, and how the stability of the networks are related to their topology. An interesting approach to analyzing the amino acid sequences of a protein in terms of subsequently overlapping strings of length K has been described by Hao et al. 36. The strings of amino acids are represented as graphs by associating each possible subsequence of length K to each graph node, and having the edges represent the observed successive transitions of subsequences. Their investigation targeted the reconstruction of the original sequences from the overlapping string networks, which can be approached by counting the number of Eulerian loops (i.e., a cyclic sequence of connected edges that are followed without repetition). More specifically, the sequences are reconstructed while starting with the same initial subsequence, using each of the subsequences the same number of times as observed in the original data, and respecting a fixed sequence length. It was therefore verified that the reconstruction is unique for K ≥ 5 for the majority of the considered networks (PDB.SEQ database 37). The present work addresses cooccurrence strings of amino acids (or any other basic biological element) similar to the scheme described in the previous paragraph, but here the subsequences do not necessarily overlap, and the number of times a Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 373 subsequence is followed by another is represented by the weight of the respective edge in the associated graph, following the same scheme used for concept association as described in the literature 38, 39. More specifically, whenever a subsequence of amino acids B is followed by another subsequence C, the weight of the edge connecting the two nodes representing those subsequences is increased by 1. Therefore, such a weighted, direct graph provides information about the number of times a specific subsequence is followed by other possible subsequences, which can be related to the statistical concept of correlation, with the difference that the sequence of the data is, unlike in the correlation, taken into account. As such, the obtained graph can be explored to characterize and model sequences of amino acids according to varying subsequence sizes. Moreover, by thresholding the weight matrix for subsequent threshold values, it is possible to identify subgraphs of the network corresponding to a strongly connected kernel of subsequences. 10.4 SEQUENCES OF AMINO ACIDS AS WEIGHTED, DIRECTED COMPLEX NETWORKS A protein can be specified in terms of its respective sequence of amino acids, represented by the string S = A1 A2 … AN , where each element Ai corresponds to one of the 20 possible amino acids, as indicated in Table 10.1. It is possible to subsume an amino acid sequence S, by grouping subsequences of amino acids into new numerical codes with higher values, in a way similar to that described by Hao et al. 36. The grouping scheme adopted in this work is illustrated in Figure 10.2,where the first and second group contains m and n amino acids, respectively. While it is possible to consider m n, we henceforth adopt m= n. The groups are taken with an overlap of g positions, with 0 ≤ g ≤ m. For each reference position i, we have two numerical codes B and C, obtained as follows B = (Ai–1)20 m–1 + … + (Ai+m–2 –1)20 + Ai+m1 (10.1) and C = (Ai+m–g–1) 20 n–1 + … + (Ai+m+n–g–2–1) 20 + Ai+m+n–g–1 (10.2) Therefore, we have that 1 ≤ B and C ≤ 20 m . FIGURE 10.2 The grouping scheme considered in this work, including two successive windows of size m and n, with overlap of g elements. i1 i i+mg1 i+mg i+m1 i+m i+m+ng1 i+m+n+g g Copyright 2005 by Taylor Francis Group, LLC 374 Medical Image Analysis An example of this coding scheme is given in the following. Let the original protein sequence in abbreviated amino acids be S = MEQWPLLFVVALCI or, in numerical codes S = (13)(6)(7)(18)(15)(11)(11)(14)(20)(20)(1)(11)(5)(10) For m = n = 2 and g = 0, we have: TABLE 10.1 Amino Acids and Respective Numerical Codes Abbreviation Numerical Code A1 R2 D3 N4 C5 E6 Q7 G8 H9 I10 L11 K12 M13 F14 P15 S16 T17 W18 Y19 V20 iB C 1 246 138 2 107 355 3 138 291 4 355 211 5 291 214 6 211 280 7 214 400 8 280 381 9 400 11 10 381 205 11 11 90 Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 375 Similarly, for m = n = 3 and g = 1, we obtain: Observe that the different ranges of i obtained in these two examples is a direct consequence of the fact that the larger size of the subsequences in the second example reduces the number of possible subsequence associations. Now, having defined the grouping scheme and the resulting sequences B and C, the graph representing the subsequent (with possible overlap) cooccurrences of numerical codes in this sequence is obtained as follows: 1. Each code in the sequences B and C is represented as one of the N nodes of the graph, whose number corresponds to the code produced for the respective sequence. For instance, the sequence (13)(6) implies a graph with two nodes identified as 13 and 6 containing a direct edge following from node 13 to node 6. Therefore, for a given m = n, we have a maximum of 20 m nodes, numbered from 1 to 20 m . Observe, however, that the resulting network does not necessarily include all possible nodes, allowing a reduction of the network size. 2. Every time a code B is followed by a code C, the weight of the edge connecting from node B to C is incremented by 1. In other words, the weight of the edge uniting two specific sequences B and C is equal to the number of times those two sequences are found to follow one another, in that same order, along the analyzed sequence of amino acids. Figure 10.3illustrates the graph obtained from the sequence (13)(6)(7)(18)(15) (11)(11)(14)(20)(20)(1)(11)(5)(10)(15)(11)(14) considering m = 1, where each node is represented by the respective code, and the edge weights (shown in italics) represent the number of successive subsequence (in this case a single amino acid) transitions. In this sense, the obtained graph represents the “unidirectional” correlations between two subsequent (with possible overlap) subsequences of amino acids in the analyzed protein. Such a network can be understood as a statistical model of the original protein for the specific correlation length implied by m and g. As such, it is possible to obtain simulated sequences of amino acids following such statistical models by performing MonteCarlo simulation over the outdegrees of each node, in the sense that each outgoing edge is taken with frequency corresponding to its iB C 1 4907 2755 2 2138 7091 3 2755 5811 4 7091 4214 5 5811 4280 6 4214 5600 7 4280 7981 8 5600 205 9 7981 4090 Copyright 2005 by Taylor Francis Group, LLC 376 Medical Image Analysis respective normalized weight (i.e., the sum of the weights of the outgoing edges must add up to 1). Therefore, the transition probabilities are proportional to the respective weights. Observe that the statistically normalized weight matrix of the network corresponds to a Markov chain, as the sum of any of its columns will be equal to 1. By thresholding the weight matrix for successive values of T (see Section 12.2.2), it is possible to obtain a family of graphs that can be understood as follows. The clusters defined for the highest values of T represent the kernels of the whole weighted network, corresponding to the subsequence associations that are most representative and more frequent along the whole protein. As the threshold is lowered, these kernels are augmented by incorporation of new nodes and merging of existing clusters. Such a thresholdbased evolution of the graph can be related to the evolutionary history of the protein formation, in the sense that the kernels would have appeared first and served as organizing structures around which the rest of the molecule evolved. At the same time, the strongest connections in the obtained network also reflect the repetition of basic protein motifs, such as alpha helices and beta sheets. 10.5 RESULTS In the following investigations, we consider proteins from three animal species: zebra fish, Xenopus (frog), and rat. The gene sequencing data were obtained from the NIH Gene Collection repository (http:zgc.nci.nih.gov,files verb+dr_mgc_ cds_aa.fasta, verb+xl_mgc_cds_aa.fasta, and verb+rn_mgc_cds_aa.fasta). The raw data consisted of sequences of amino acids for the 2948, 1977, and 640 proteins (each containing on the average of 400 amino acids) in each of those files. The obtained results, which considers m = n = 2 and g = 0, are presented respectively for each species in the following subsections. The average node degree was obtained by adding all columns of the adjacency matrix. The clustering coefficient was obtained by identifying the n nodes connected to each node and dividing the number of existing edges between those nodes by n(n – 1)2, i.e., the maximum number of FIGURE 10.3 The network obtained for m = 1 for the amino acid sequence (13)(6)(7)(18)(15) (11)(11)(14)(20)(20)(1)(11)(5)(10)(15)(11)(14). The weights of the edges are shown in italics. 1 5 6 7 10 11 14 13 11 1 1 1 1 2 2 1 1 1 1 1 1 18 20 15 Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 377 edges between those nodes. The minimum distances were calculated by using Dijkstra’s method 14. 10.5.1 ZEBRAFISH The obtained 400×400 weight matrix (recall from the previous section that 400 = 20 m = 20 2 ) had a maximum value of 487, obtained for the transition from SS to SS, and a minimum value of zero was obtained for 15,274 transitions. The maximum weight for transition between different nodes was 170, observed for the transition from EE to ED. The performed measurements included the average node degree (Figure 10.4(a)), clustering coefficient (Figure 10.5(a)),average length (Figure 10.6(a)), and maximum cluster size (Figure 10.7(a))for the series of thresholded matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 170. We also calculated the indegree and outdegree densities, which are shown in Figure 10.8(a)and Figure 10.8(b), respectively, for T = 0. It is clear from this figure that both node degrees tend to be similar to one another, presenting a plateau for 6 < log(k) < 8.4 followed by a sharp decrease of node degree. The selfconnections between nodes representing subsequences, immediately obtained from the diagonal of the respective adjacency matrices of two identical amino acids, are given in Table 10.2. The initial kernel was also identified for T = 95, with the obtained digraph shown in Figure 10.9,where the edge widths correspond to the respective weights. Observe that although the original graph was thresholded at T to obtain the kernel in Figure 10.9, the graph in that figure incorporates all edges, including those with weight smaller than T, to provide a more comprehensive visualization of the obtained kernel. This fully connected (except selfconnections, which were not considered in this case) digraph presents dominance of the E, D, and S amino acids, with strong connections obtained for the node EE. The maximum weight was 170, obtained for the transition from node EE to ED. 10.5.2 XENOPUS The weight matrix had a maximum value of 293, obtained for the transition from EE to EE, and a minimum value of zero was obtained for 22,787 transitions. The maximum weight for transition between different nodes was 207, observed for the transition from GP to LQ. The performed measurements included the average node degree (Figure 10.4(b)),clustering coefficient (Figure 10.5(b)),average length (Figure 10.6(b)), and maximum cluster size (Figure 10.7(b))for the series of thresholded matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 170. The indegree and outdegree densities are shown in Figure 10.10(a)and Figure 10.10(b), respectively, for T = 0. Both densities again tend to be similar to one another, presenting a plateau for 6 < log(k) < 8 followed by a sharp decrease of node degree. The selfconnections between nodes representing subsequences of two identical amino acids are given in Table 10.2.The initial kernel containing nine nodes was identified for T = 64, with the obtained digraph shown in Figure 10.11, which is dominated by the P and G amino acids. Copyright 2005 by Taylor Francis Group, LLC 378 Medical Image Analysis 10.5.3 RAT The weight matrix had a maximum value of 98, obtained for the transition from LL to LL, and a minimum value of zero was obtained for 69,792 transitions. Such a large number of null transitions is a consequence of the smaller number of proteins available for this animal in the original data. The maximum weight for transition between different nodes was 35, observed for the transition from LL to AA. The performed measurements included the average node degree (Figure 10.4(c)), clustering coefficient (Figure 10.5(c)), average length (Figure 10.6(c)),and maximum cluster size (Figure 10.7(c)) for the series of thresholded matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 35. FIGURE 10.4 The average node degree as a function of the weight threshold T (solid line = KT , dashed line = QT ) for (a) zebrafish data, (b) Xenopus, and (c) rat. 3200 2800 2400 2000 1600 1200 800 400 0 020406080100 120 140 160 180 T 2400 2000 1600 1200 800 400 0 020406080100 120 140 160 180 T (a) (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 379 The indegree and outdegree densities are shown in Figure 10.12(a)and Figure 10.12(b), respectively, for T = 0. Both of the resulting node degrees were again similar to one another, presenting a plateau for 4 < log(k) < 6 followed by a moderate decrease of node degree. The selfconnections between nodes representing subsequences of two identical amino acids are given in Table 10.2.The initial kernel was also identified for T = 22, with the obtained digraph shown in Figure 10.13.The dominant amino acids were L and A. 10.6 DISCUSSION Despite the different number of proteins and overall amino acid sequence lengths available for each of the three species, the clustering coefficient, average length, and maximum cluster size are determined from the respective adjacency matrices (not the weight), and therefore they are more significant statistically so that we can attempt a comparison between such measurements in the case of zebra fish and Xenopus. It is clear from Figure 10.4that, as expected, the average node degree of the graph KT decreases monotonically with the threshold value T, while the opposite happens for QT. The abrupt way in which the average node degree varies for the thresholded and complementary matrices suggests that a kind of phase transition (critical phenomenon) takes place as the values of T are increased. As shown in Figure 10.5,the average clustering coefficient for KT tends to decrease steadily with the threshold values, undergoing a relatively abrupt transition (near T = 20 for zebra fish), while the clustering coefficient of QT increases even more abruptly near T = 10, suggesting a phase transition also for this measurement. Generally, the local connectivity reaches less than 10% of its maximum value after just onethird of the considered T excursion, which suggests that the network connectivity is dominated by stronger connections surrounded by much smaller connection weights. FIGURE 10.4 (continued) 600 500 400 300 200 100 0 0102030405060708090100 T (c) Copyright 2005 by Taylor Francis Group, LLC 380 Medical Image Analysis The average lengths of KT shown in Figure 10.6suffer from the typical problem that such distances tend to fall as a consequence of the disappearance of connections. In other words, because nonexistent edges are not considered for the average length calculation, a network containing no connections has null average length, less than for a fully connected network, for which the average length would be 1 (overlooking selfconnections). To any extent, the average length presents a sharp discontinuity (near T = 80 for zebra fish, 60 for Xenopus, and 20 for rat), possibly indicating that a large number of edges are cut by thresholds larger than these values. At the same time, the maximum average lengths in each case are similar and relatively small. An abrupt increase of the average length is observed for QT for small values of T, FIGURE 10.5 Average clustering coefficient as a function of the weight threshold T (solid line = KT , dashed line = QT ) for (a) zebra fish, (b) Xenopus, and (c) rat data. 020406080100 120 140 160 180 T (a) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 020406080100120 140 160 180 T (b) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 381 indicating that that matrix indeed suffers an abrupt change of its connection for small threshold values. The graphs in Figures 10.7show that the maximum cluster size for KT decreases steadily for higher threshold values, as expected. The maximum cluster size for QT remained fixed at 400, confirming that the complementary matrix is highly connected. As indicated in Figure 10.8, Figure 10.10,and Figure 10.12,the node degree densities tend to present two distinct regions: one plateau portion at the lefthand side, followed by an abrupt descending portion at the righthand side of the graph. While the indegree and outdegree densities also produced similar profiles for the three species, the respective kernels identified at different threshold levels (because of the different length of the amino acid sequences) were found to be rather different, with distinct pairs of amino acids dominating each kernel. While such a result may be strongly affected by the different amounts of data available for each of the considered species, it may also suggest different fundamental structures for the amino acid sequencing in those animals. 10.7 CONCLUDING REMARKS AND FUTURE WORK This chapter has addressed the promising perspective of using modern complexnetwork concepts and tools as a means of characterizing, modeling, and analyzing biological sequences, with special attention given to amino acid sequences in proteins. After presenting a brief historic perspective of complexnetwork research and some of its most representative applications to bioinformatics, the basic concepts of complex networks and respective topological measurements were presented. The problem of characterizing proteins in terms of weighted digraphs obtained from consecutive (with possible overlap) subsequences of amino acids was addressed FIGURE 10.5 (continued) 0102030405060708090100 T (c) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Copyright 2005 by Taylor Francis Group, LLC 382 Medical Image Analysis next, with respect to a specific protein in zebra fish, Xenopus, and rat. This investigation included the calculation of the average node degree, average clustering coefficient, the average length (in number of edges), and the size of the maximum cluster in the graph for a sequence of threshold values. The obtained curves were found to provide interesting insights about the structure of the overall protein, especially regarding the appearance of critical transitions of several of the considered measurements as Twas increased. In addition, kernels were identified for each case, suggesting an interesting basic organization in the amino acid sequences. Despite FIGURE 10.6 Average length as a function of the weight threshold T (solid line = KT, dashed line = QT ) for (a) zebra fish, (b) Xenopus, and (c) rat data. 020406080100120 140 160 180 T (a) 3.8 3.4 3.0 2.6 2.2 1.8 1.4 1.0 020406080100120 140 160 180 T (b) 3.8 3.4 3.0 2.6 2.2 1.8 1.4 1.0 Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 383 FIGURE 10.6 (continued) TABLE 10.2 SelfConnections of Subsequences Composed of Two Identical Amino Acids Subsequence Number of SelfConnections (Zebra fish) Number of SelfConnections (Xenopus) Number of SelfConnections (Rat) AA 274 126 27 RR 85 41 10 DD 216 186 11 NN 23 13 2 CC 14 3 3 EE 467 293 49 QQ 216 95 11 GG 310 79 21 HH 67 48 0 II 882 LL 161 126 98 KK 188 104 29 MM 040 FF 24 9 0 PP 299 176 71 SS 487 233 61 TT 52 16 6 WW 000 YY 620 VV 13 19 3 0102030405060708090100 T (c) 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Copyright 2005 by Taylor Francis Group, LLC 384 Medical Image Analysis the different sizes of the amino acid sequences, which do imply problems of statistical meaningfulness, some interesting trends have been identified regarding the comparison of the measurements obtained for the three different species, especially the general similarity between the topological properties for each species while completely different kernels and dominant amino acids have been identified for those cases. Future extensions of this work include the consideration of other m, n, and g configurations, the use of additional structural features such as betweenness centrality as well as the ratios suggested in the literature 40, 41, and the identification of the hierarchical backbone of the directed network, as suggested in the literature 39. FIGURE 10.7 Maximum cluster size of KT as a function of the weight threshold T for (a) zebra fish, (b) Xenopus, and (c) rat data. 400 360 320 280 240 200 160 120 80 40 0 020406080100120 140 160 180 T (a) 400 360 320 280 240 200 160 120 80 40 0 020406080100120 140 160 180 T (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 385 It would also be possible to consider the progressive merging of nodes and connected components into the initial kernel to obtain the hierarchical structure underlying the growth of the kernel, with possible applications to the complex problem of protein folding 42. Finally, it would be interesting to use such measurements to compare proteins (in terms of amino acids and bases) from the same or distinct individuals, as well as to infer philogenetic evolution of the proteins. In the case of DNA analysis, the obtained topological measurements can provide a means for distinguishing between coding and noncoding regions. ACKNOWLEDGMENTS The author is grateful to Fundação de Amparo à Pesquisa do Estado de São Paulo — FAPESP (proc. 99127652), Conselho Nacional de Desenvolvimento Científico e Tecnológico — CNPq (proc. 308231031), and the Human Frontier Science Program for financial support. FIGURE 10.7 (continued) 400 360 320 280 240 200 160 120 80 40 0 0102030405060708090100 T (c) Copyright 2005 by Taylor Francis Group, LLC 386 Medical Image Analysis FIGURE 10.8 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (zebrafish data). 5 4 3 2 1 Log (k) 0 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 (a) 5 4 3 2 1 Log (k) 0 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 387 FIGURE 10.9 The tennode kernel obtained for T = 95 for zebrafish data. The weights are represented in terms of the edge widths. The maximum and minimum weights are 170 and zero, the latter corresponding to selfconnections, as these have been excluded from the matrix used to obtain this picture. EE ED EL EK LE SD SE AE DD DE Copyright 2005 by Taylor Francis Group, LLC 388 Medical Image Analysis FIGURE 10.10 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (Xenopus data). 4.0 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Log (k) 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 (a) 4.0 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Log (k) 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 389 FIGURE 10.11 The ninenode kernel obtained for T = 64 for Xenopus data. The weights are represented in terms of the edge widths. GE GA RG AG SG PP PG GP GL Copyright 2005 by Taylor Francis Group, LLC 390 Medical Image Analysis FIGURE 10.12 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (rat data). 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Log (k) 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 (a) 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Log (k) 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 391 FIGURE 10.13 The tennode kernel obtained for T = 2 for the rat data. The weights are represented in terms of the edge widths. LA GL EA AL AA VL LS LF LL LG Copyright 2005 by Taylor Francis Group, LLC 392 Medical Image Analysis REFERENCES 1. Baldi, P. and Brunak, S., Bioinformatics, MIT Press, Cambridge, MA, 2001. 2. da F. Costa, L., Signal processing in bioinformatics, IEEE Proc. Digital Signal Process. Conf., New Jersey, 2002, pp. 23–27. 3. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G., Biological Sequence Analysis, Cambridge University Press, Cambridge, U.K., 1998. 4. Alberts, B., Bray, D., Lewins, L., Raff, M., Roberts, K., and Watson, J.D., Molecular Biology of the Cell, 3rd ed., Garland Publishing, New York, 1994. 5. Garrett, R.H. and Grisham, C.M., Biochemistry, Saunders College Publishing, Fort Worth, TX, 1995. 6. Rapoport, A., Contribution to the theory of random A biased nets, Bull. Math. Biophys., 19, 257–277, 1957. 7. Erdös, P. and Rényi, A., On the evolution of random graphs, Publications Mathematicae, 6, 290–297, 1959. 8. Watts, D.J. and Strogatz, S.H., Collective dynamics of smallworld networks, Nature, 393, 440–442, 1998. 9. Albert, R., Jeong, H., and Barabási, A.L., The diameter of the worldwide web, Nature, 401: 130–131, 1999. 10. Albert, R. and Barabási, A.L., Statistical mechanics of complex networks, Rev. Mod. Phys., 74, 47–97, 2002. 11. Dorogovtsev, S.N. and Mendes, J.F.F., Evolution of networks, Adv. Phys., 51, 1079–1187, 2002. 12. Newman, M.E.J., The structure and function of complex networks, SIAM Rev., 45, 167–256, 2003. 13. da F. Costa, L. and Cesar, R.M., Jr., Shape Analysis and Classification: Theory and Practice, CRC Press, Boca Raton, FL, 2001. 14. Aldous, J.M. and Wilson, R.J., Graphs and Applications: an Introductory Approach, SpringerVerlag, London, 2000. 15. West, D.B., Introduction to Graph Theory, Prentice Hall, Upper Saddle River, NJ, 2001. 16. Harary, F., Graph Theory, AddisonWesley, Reading, MA, 1995. 17. Bollobas, B., Modern Graph Theory, SpringerVerlag, Heidelberg, 1998. 18. Holme, P., Huss, M., and Jeong, H., Subnetwork hierarchies of biochemical pathways, Bioinformatics, 19, 532–538, 2003. 19. Alon, N. and Spencer, J.H., The Probabilistic Method, Wiley Interscience, New York, 2000. 20. Bollobas, B., Random Graphs, Cambridge University Press, Cambridge, U.K., 2001. 21. Bose, I., Biological networks, available online at http:arxiv.orgabscondmat 0202192. last accessed March 2005. 22. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., and Barabasi, A.L., The largescale organization of metabolic networks, Nature, 407, 651–654, 2000. 23. Wagner, A. and Fell, D.A., The small world inside large metabolic networks, Proc. R. Soc. London B, 268, 1803–1810, 2001. 24. Chung, F., Lu, L., Dewey, T.G., and Galas, D.J., Duplication models for biological networks, Journal of Computational Biology, 10(5), 677–688, 2003. 25. Aiello, W., Chung, F., and Lu, L., in Proc. 32nd Annu. ACM Symp. Theory Computing, 171–180, 2000. 26. Girvan, M. and Newman, M.E.J., Community structure in social and biological networks, Proc. Nat. Acad. Sci. USA, 99, 7821–7826, 2002. Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 393 27. Kuo, P.D. and Banzhaf, W., Small world and scalefree network topologies in an artificial regulatory network model, Journal of Biological Physics and Chemistry, 4, 85–92, 2004. 28. Jeong, H., Mason, S.P., Barabási, A.L., and Oltvai, Z.N., Lethality and centrality in protein networks, Nature, 411, 41–42, 2001. 29. Qin, H., Lu, H.S.S., Wu, W.B., and Li, W.H., Evolution of the yeast protein interaction network, Proc. Nat. Acad. Sci., 100, 12,820–12,824, 2003. 30. Wagner, A., How large protein interaction networks evolve, Proceedings of the Royal Society of London Seriew B, 270, 457–466, 2003. 31. PastorSatorras, R., Smith, E., and Solé, R.V., Evolving protein interaction networks through gene duplication, J. Theor. Biol., 222, 99–210, 2003. 32. Wuchty, S., Scalefree behavior in protein domain networks, Mol. Biol. Evol., 18, 1697–1702, 2001. 33. Vazquez, A., Flammini, A., Maritan, A., and Vespignani, A., Global protein function prediction in proteinprotein interaction networks, Nat. Biotech., 21, 697–700, 2003. 34. Shaw, S., Evidence of scalefree topology and dynamics in gene regulatory networks, in Proc. ISCA 12th International Conference on Intelligent and Adaptive Systems and Software Engineering, 37–40, 2003. (ISBN 1880843471). 35. Bumble, S., Friedler, F., and Fan, L.T., A toy model for comparative phenomenon in molecular biology and the utilization of biochemical applications of PNS in genetic applications, available on line at http:arxiv.orgabscondmat0304348. 36. Hao, B., Xie, H., and Zhang, S., Compositional representation of protein sequence and the number of Eulerian loops, available on line at http:arxiv.orgabsphysics 0103028. 37. PDB.SEQ, A Collection of SWISSPROT Entries; available online at http:www. expasy.orgsprot. 38. da F. Costa, L., What’s in a name?, International Journal of Modern Physics C, 15, 371–379, 2004. 39. da F. Costa, L., The hierarchical backbone of complex networks, Physical Review Letters, 93 (9), paper 98702, 4p., 2004. 40. da F. Costa, L., Lpercolations of complex networks, Physical Review EStatistical Physics, Plasmas, Fluids and Related Interdisciplinary Topics, 70, paper 056106, 8p., 2004. 41. da F. Costa, L., Reinforcing the resilience of complex networks, Physical Review EStatistical Physics, Plasmas, Fluids and Related Interdisciplinary Topics, 69, paper 066127, 7p., 2004. 42. Crescenzi, P., Goldman, D., Papadimitriou, C., Piccolboni, A., and Yaxnakakis, M., On the complexity of protein folding, in Annu. Conf. Res. Computational Molecular Biol., ACM, New York, 1998, pp. 61–62. Copyright 2005 by Taylor Francis Group, LLC
2089_book.fm copy Page 363 Tuesday, May 10, 2005 9:34 PM 10 Graph-Based Analysis of Amino Acid Sequences Luciano da Fontoura Costa CONTENTS 10.1 Introduction 10.2 Complex-Networks Concepts and Tools 10.2.1 Brief Historic Perspective 10.2.2 Basic Mathematical Concepts 10.2.2.1 Graph Theory Basics 10.2.2.2 Probabilistic Concepts 10.2.2.3 Random Graph Models 10.2.2.4 Small-World and Scale-Free Models 10.3 Complex-Networks Approaches to Bioinformatics 10.4 Sequences of Amino Acids as Weighted, Directed Complex Networks 10.5 Results 10.5.1 Zebra Fish 10.5.2 Xenopus 10.5.3 Rat 10.6 Discussion 10.7 Concluding Remarks and Future Work Acknowledgments References 10.1 INTRODUCTION One of the most essential features underlying natural phenomena and dynamical systems are the many connections, implications, and causalities between the several involved elements and processes For instance, the whole dynamics of gene activation can be understood as a highly complex network of interactions, in the sense that some genes are enhanced while others are inhibited by several environmental factors, including the current biochemical composition of the individual (such as the presence of specific genes/proteins) as well as external effects such as temperature and interaction with other individuals Interestingly, such a network of effects extends much beyond the individual in time and space, in the sense that any living being is Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 364 Tuesday, May 10, 2005 9:34 PM 364 Medical Image Analysis affected by history (i.e., evolutionary processes) and spatial interactions (i.e., ecology) Although biology can only be fully understood and explained by considering the whole of such an intricate network of effects, reductionist approaches can still provide many insights about biological phenomena that are more localized in time and space, such as the genetic dynamics during an individual lifetime or an infectious process The large masses of data produced by experimental works in biology, molecular biology, and genetics can only be properly organized, analyzed, and modeled by using computer concepts including databases, networks, parallel computing, and artificial intelligence, with special emphasis placed on signal processing and pattern recognition The incorporation of such modern computer concepts and tools into biology and genetics has been called bioinformatics [1] The applications of this new area to genetics are manifold, ranging from nucleotide analysis to animal development Among the several signal-processing methods considered in bioinformatics [2], we have the application of Markov random fields to model the sequences of nucleotides, the use of correlation and covariance to characterize sequences of nucleotides and amino acids, and wavelets [2, 3] One particularly important problem concerns the analysis of proteins, the basic blocks of life [4, 5] Constituted by sequences of amino acids, proteins participate in all vital processes, acting as catalysts; providing the mechanical scaffolding for cells, organs, and tissues; and participating in DNA expression Proteins are polymers of amino acids, determined from the DNA through the process of protein expression Many of the properties of proteins derive from their spatial shape and electrical affinities, which are both defined by the specific sequences of constituent amino acids [4, 5] Therefore, given the sequence of amino acids specified by the DNA, the protein folds into specific forms while taking into account the interactions between the amino acids and external influence of chaperones It remains an open problem how to determine the structural properties of proteins from the respective amino acid sequences, a problem known as protein folding [4, 5] Except for some basic motifs, such as alpha-helices and beta-sheets, which are structures that appear repeatedly in proteins, the prediction of protein shape constitutes an intense research area Experimentally, the sequences of amino acids underlying proteins can be obtained by using sequencing machines capable of reading the nucleotides, which are subsequently translated into amino acids by considering triples of nucleotides, the so-called codons, translated according to the genetic code [3–5] By being inherently oriented toward representing connections and implications, graphs stand out as one of the most general and interesting data structures that can be used to represent biological systems Basically, a graph is a representational structure composed of nodes, which are connected through directed or undirected edges Any structure or phenomenon can be represented to varying degrees of completeness in terms of graphs, where each node would correspond to an aspect of the phenomenon and the edges to interactions Such a potential for representation and modeling is greatly extended by the many types of graphs, including those with weighted edges, different types of coexisting nodes or edges, and hypergraphs, to name only a few Interestingly, most biological phenomena can be properly represented in terms of graphs, including gene activation, metabolic networks, evolution Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 365 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 365 (recall that hierarchical structures such as trees are special kinds of graphs), ecological interactions, and so on However, despite the natural potential of graphs for representing and studying natural phenomena, their application was timid until the recent advent of the area of complex networks One of the possible reasons for that is that graphs had been often understood as representations of static interactions, in the sense that the connections between nodes were typically assumed not to change with time Thus, the uses of graphs in biology, for instance, were mainly constrained to representing evolutionary hierarchies (in terms of trees) and metabolic networks This situation underwent an important recent change sparked mainly by the pioneering developments in random networks by Rapoport [6] and Erdös and Rényi [7], Watts and Strogatz small-world models [8], and by Barabási scale-free networks [9] The research of such types of complex graphs became united under the name of complex networks [10–12] Now, in addition to the inherent potential of graphs to nicely represent natural phenomena, important connections were established with dynamics systems, statistical physics, and critical phenomena, while many possibilities for multidisciplinary research were established between areas such as graph theory, statistical physics, nonlinear dynamical systems, and complexity theory Despite such promising perspectives, one of the often overlooked reasons why complex networks have become so important for modern science is that studies in this area tend to investigate the dynamical evolution of the graphs [10–12], which can provide key insights about the relationship between the topology and function of such complex systems For example, one of the most interesting properties exhibited by random graphs is the abrupt appearance, as new edges are progressively added at random, of a giant cluster that dominates the graph structure and connections henceforth Thus, in addition to being typically large (several studies in complex networks consider infinitely large graphs), the graphs were now used to model growing processes Allied to the inherent vocation of graphs to represent connections, interactions, and causality, the possibility of modeling dynamical evolution in terms of complex networks has made this area into one of the most promising scientific concepts and tools The present chapter is aimed at addressing how complex-network research has been applied to bioinformatics, with special attention given to the characterization and analysis of amino acid sequences in proteins The text starts by reviewing the basic context, concepts, and tools of complex-network research and continues by presenting some of the main applications of this area in bioinformatics The remainder of the chapter describes the more specific investigation of amino acid sequences in terms of complex networks obtained for graphs derived from subsequence strings 10.2 COMPLEX-NETWORKS CONCEPTS AND TOOLS 10.2.1 BRIEF HISTORIC PERSPECTIVE The beginnings of complex-network research can be traced back to the pioneering and outstanding works by Rapoport [6] and Erdos and Renyi [7], who concentrated attention on the type of networks currently known as random networks This name is somewhat misleading in the sense that many other network models are also Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 366 Tuesday, May 10, 2005 9:34 PM 366 Medical Image Analysis random The essential property of random networks as understood in graph theory, therefore, is not only being random, but to follow a particular probabilistic model, namely the uniform random distribution [13] In other words, given a set of N nodes, connections are established by choosing pairs of nodes according to the uniform probability density In the case of undirected graphs, the edges are uniformly sampled out of the N(N–1)/2 possible connections Consequently, random networks correspond to the maximum entropy hypothesis of connectivity evolution, providing a suitable null hypothesis against which several real and theoretical models can be compared and contextualized One of the most interesting features of random networks is the fact that the progressive addition of new edges tends to abruptly form a giant, dominating cluster (or connected component) in the graph Such a critical transition is particularly interesting not only because it represents a sudden change of the network connectivity, but because it provides a nice opportunity for connecting graph theory to statistical physics Indeed, the appearance of the giant cluster can be understood as a percolation of the graph, similar to critical phenomena (phase transitions) underlying the transformation of ice into water Basically, percolation corresponds to an abrupt change of some property of the analyzed system as some parameter is continually varied This interesting connection between graph theory and statistical physics has provided unprecedented opportunities for multidisciplinary works and applications, nicely bridging the gap between areas such as complexity analysis, which is typical of graph theory, and the study of systems involving large numbers of elements, typical in statistical physics In addition to such an exciting perspective, random networks attracted much interest as possible models of real structures and phenomena in nature, with special emphasis given to the Internet and the World Wide Web After the fruitful studies of Rapoport and Erdos and Renyi, the study of large networks (note that the term complex network was not typical at those times) went through a period of continuing academic investigation followed by few applications, except for promising investigations in areas such as sociology Indeed, one of the next important steps shaping the modern area of complex networks was the investigation of personal interactions in society, of which the 1998 work by Watts and Strogatz [8] represents the basic reference Basically, experimental investigations regarding social contacts led to the result that the average length between any two nodes (i.e persons) is rather small, hence the name small-world networks The typical mathematical model of such networks starts with a regular graph, which subsequently has a percentage of its connections rewired according to uniform probability Although such investigations brought many insights to the area, the small-world property was later verified to be an almost ubiquitous property of complex networks The subsequent investigations of the topological properties of the Internet and WWW performed by Albert and Barabási [9] led to the important discovery that the statistical distribution of the node degrees (i.e., the number of connections of a node) in several complex networks tends to follow a power law, indicating scale-free behavior Unlike the random model, this property favors the appearance of nodes concentrating many of the connections, the so-called hubs Such underlying structure has several implications, such as resilience to attack, which Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 367 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 367 is particularly fragile for hub attacks From then on, the developments in complexnetwork research boomed, covering several types of natural systems, from epidemics to economy The interested reader is encouraged to check the excellent surveys of this area [10–12] for complementary information 10.2.2 BASIC MATHEMATICAL CONCEPTS This section provides a brief introductory review of basic concepts and measurements in graph theory, statistics, random graphs, and small-work and scale-free networks Readers who are already familiar with such topics can proceed directly to Section 10.2.3 10.2.2.1 Graph Theory Basics Basically, a typical graph [14–17] in complex-network theory [10–12] involves a collection of N nodes i = 1, 2, …, N that are connected through edges (i,j) that can have weights w(i,j) Such a data structure is precise and completely represented by the respective weight matrix W, where each entry W(j,i) represents the weight of edge (i,j) Nonexistent edges are represented as null entries in that matrix The adjacency matrix K of the graph is a matrix where the value is assigned to an element (i,j) whenever there is an edge connecting node j to I, and otherwise The adjacency matrix can be obtained from the weight matrix by setting each element larger or equal to a specific threshold value T to 1, assigning otherwise Such adjacency matrices, henceforth represented as KT, provide indication about the network structure defined by the weights that are higher than the threshold Therefore, the adjacency matrix for high values of T can be understood as the strongest component, or “kernel,” of the weighted graph Observe that it is also possible to consider the complementary matrix of KT with respect to K, which is defined as follows Each element (i,j) of such a matrix, hence abbreviated as QT, receives value iff KT(i,j) = and K(i,j) An undirected graph is characterized by undirected edges, so that K(j,i) = iff K(i,j) = 1, i.e., K is symmetric A directed graph, or digraph, is characterized by directed edges and not necessarily by a symmetric adjacency matrix One of the most basic and interesting local feature of a graph or network is the number of connections of a specific node i, which is called the node degree and often abbreviated as ki Observe that a directed graph has two types of such a degree, the indegree and the outdegree, corresponding to the number of incoming and outgoing edges, respectively Figure 10.1 illustrates the concepts introduced here with respect to an undirected graph G and a directed graph H, identifying the nodes, edges, and weights This figure also shows the respective weight matrices WG and WH and adjacency matrices AG and AH The degree of node in G is 2, the outdegree of node in H is 2, and the indegree of node in H is N is equal to for both graphs A great part of the importance of graphs stems from their generality for representing, in an intuitive and explicit way, virtually any discrete structure while emphasizing the involved entities (nodes) and connections Indeed, virtually every data structure (e.g., tree, queue, list) is a particular case of a graph In addition, graphs Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 368 Tuesday, May 10, 2005 9:34 PM 368 Medical Image Analysis G: H: weight node 2 edge 1 4 WG = 0 0 0 0 WH = 0 0 0 0 0 1 0 0 0 0 0 AH = 0 0 0 0 0 AG = FIGURE 10.1 Basic concepts in graph theory: examples of undirected (G) and directed (H) graphs, with respective nodes, edges, and weights The weight matrices of G and H are WG and WH, and the respective adjacency matrices considering threshold T = are given as AG and AH can be used to represent the most general mesh of points used for numeric simulation of dynamic systems, from the regular orthogonal lattice used in image representation to the most intricate adaptive triangulations As such, graphs are poised to provide one of the keys for connecting not only structure and function, but also several different biological areas and even the whole of science Several measurements or features have been proposed and used to express meaningful and useful global properties of the network structure In similar fashion to feature selection in the area of pattern recognition (e.g., [13]), the choice of such features has to take into account the specific problem of interest For instance, a problem of communication along the network needs to take into account the distance between nodes It should be observed that, in most cases, the selected set of features is degenerated, in the sense that it is not enough to reproduce the original network structure Therefore, great attention must be paid when deriving general conclusions based on incomplete sets of measurements, as is almost always the case Some of the more traditional network measurements are reviewed in the following paragraph The global measurement, usually derived from the node degree, is its average value along the whole network Observe that, for a digraph, the average indegree and outdegree are necessarily identical The average node degree gives a first idea about the overall connectivity of the network Additional information about the network connectivity can be obtained from the average clustering coefficient Given one specific node i, the immediately connected nodes are identified, and the ratio between the number of connections between them and the maximum possible Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 369 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 369 value of those connections defines the clustering coefficient of node i, i.e., Ci This feature tends to express the local connectivity around each node Another interesting and frequently used network measurement is the length between any two nodes i and j, here denoted as L(i,j) This distance may refer either to the minimal sum of weight along a path from i to j, or to the total number of edges between those two nodes The present work is restricted to the latter The respectively derived global feature is the average length considering all possible pairs of network nodes, hence This measurement provides an idea not only about the proximity between nodes, but also about the overall network connectivity, in the sense that low averagedistance values tend to indicate a densely connected structure Another interesting measurement that has been used to characterize complex networks is the betweenness centrality Roughly, the betweenness centrality of a specific network node in an undirected graph corresponds to the number of shortest paths between any pair of node in the network that cross that node [18] 10.2.2.2 Probabilistic Concepts Any measurement whose outcome cannot be exactly predicted, such as the weight of an inhabitant of Chicago, can be represented in terms of a random variable [13, 19] Such variables can be completely characterized in terms of the respective density functions, which can be approximated in terms of the respective relative frequency histogram Alternatively, a random variable can also be represented in terms of its (possibly) infinite moments, including the mean, variance, and so on Statistical density functions of special interest for this chapter include the uniform distribution, which assigns the same probability to any possible measurement, and the Poisson distribution, which is characterized in terms of a ratio of event occurrence per length, area, or volume For instance, we may have that the chance of having a failure in an electricity transmission cable is equal to one failure per 10,000 km Therefore, the chance of observing the event along the considered structure (e.g., the transmission cable) is also equiprobable along the considered parameter (e.g., length or time) Such concepts can be immediately extended to multivariate measurements by introducing the concept of random vector For instance, the temperature and pressure of an inhabitant of Chicago can be represented as the two-dimensional random vector [T, P] Such statistical entities are also completely characterized, in statistical terms, by their respective multivariate densities Statistical and probabilistic concepts and techniques are essential for representing and modeling natural phenomena and biological data because of the intrinsic variation of such measurements 10.2.2.3 Random Graph Models The first type of complex networks to be systematically investigated were the random graphs [6, 7, 10–12, 20] In using such graphs, one starts with N unconnected nodes and progressively adds edges between pairs of nodes chosen according to the uniform distribution Although the measurements described in Section 2.2.1 are useful for characterizing the structure of such networks, it is also important to take into account parameters and measurements governing their dynamical evolution, including the Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 370 Tuesday, May 10, 2005 9:34 PM 370 Medical Image Analysis critical phenomenon of percolation As more connections are progressively added to a growing network, there is a definite tendency to form a giant cluster (percolation), which henceforth dominates the growing dynamics Given a network, a cluster is understood as the set of nodes (and respective interconnecting edges) such that one can reach any node while starting from any other node in the cluster, i.e., the cluster is a connected component of the graph The giant cluster corresponds to the cluster with the largest number of nodes at a given step of the network evolution For an undirected random network, this phenomenon has been found to take place when the percentage of existing connections with respect to the maximum possible number of connections is about 1/N [5] 10.2.2.4 Small-World and Scale-Free Models The types of complex networks known as small world and scale free were identified and studied years after Erdos and Renyi investigated random graphs Small-world networks [8, 10] are characterized by a short path from any pairs of its constituent nodes A typical example of such a network is the social interactions within a given society, in the sense that there are just a few (about five or six) relations between any two persons Characterized later than small-world models, the scale-free networks [10–12] are characterized by the fact that the statistical distribution of the respective node degrees follows a power law, i.e., the representation of such a density in a log-log plot produces a straight line Such densities, unlike those observed for other types of networks, implies a substantially higher chance of having nodes of high degree, which are traditionally called hubs As reviewed in the next section, such nodes have been identified as playing an especially important role in biological networks Scale-free networks can be produced by using the preferential-attachment growth strategy [10–12], characterized by the progressive addition of new nodes with fixed number of edges that are connected preferentially with nodes of higher degree, giving rise to the paradigm that has become known as “the rich get richer.” At the same time, scale-free networks have also been shown to be less resilient to random node attachments than other types of networks, such as random graphs [10] 10.3 COMPLEX-NETWORKS APPROACHES TO BIOINFORMATICS Several possibilities of using complex network and statistical physics in biology have been described and revised by Bose in his interesting and extensive survey [21] Special attention is given to relationships between the network’s topology and functional properties, and the following three situations are covered in considerable depth: The topology of complex biological networks, such as metabolic and protein interaction Nonlinear dynamics in gene expression The effect of stochasticity on the network dynamics Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 371 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 371 While we review in the following some of the most representative works applying complex-network research to biology, the reader is encouraged to complement and extend our revision by referring to Bose’s survey Metabolic reactions, one of the key elements of life, were among the first to be studied by complex-network approaches Such networks have their nodes representing the molecular compounds (or substrates), and the edges indicate the metabolic reactions connecting substrates Incoming links to a substrate are understood to correspond to the reactions of which that substrate is a product The pioneering investigation by Jeong et al [22] considered networks that are available for 43 organisms, yielding average node indegree and outdegree in the range from 2.5 to 4, with the respective distribution being understood as scale free with exponents close to 2.2 The metabolic reactions of E coli have been studied as undirected graphs by Wagner and Fell [23], yielding average node degree of and a clustering coefficient (approximately 0.3) much larger than could be obtained for a random network An interesting investigation into whether the duplication of information in genomes can significantly affect the power law exponents was reported by Chung et al [24] By using probabilistic methods as the means to analyze the evolution of graphs under duplication mechanisms, those authors were able to show that such mechanisms can produce networks with low power-law exponents, which are compatible with many biological networks [25] The decomposition of biochemical networks into hierarchies of subnetworks, i.e., networks obtained by considering a subset of the nodes of the original graph and some of the respective edges, has been addressed by Holme and Huss [18] These authors use the algorithm of Girvan and Newman [26] for tracing subnetworks, in a form adapted to bipartite representations of biochemical networks The underlying principle of the algorithm is the fact that vertices between densely connected areas have high betweenness centrality, such that removal with high degree leads to the partition of the whole network into subnetworks that are contained in previous clusters, thereby producing a hierarchy of subnetworks Another extremely important type of biological network, corresponding to genomic regulatory systems (i.e., the set of processes controlling gene expression), has also been subject of increasing attention in complex-network research This type of directed network is characterized by having nodes corresponding to components of the system, with the edges representing the gene-expression regulations [11] An important type of network in this category is that obtained from protein-protein interactions In this type of network, each node corresponds to a protein, and the directed edges represent the interactions A model of regulatory networks has been described by Kuo and Banzhaf [27] A pioneering approach in this area is the work of Jeong et al [28], which considered protein–protein interaction networks of S cerevisiae, containing thousands of edges and nodes The degree distribution was interpreted as following scale-free behavior with an approximate exponent of 2.5 One of the most important conclusions of that investigation was that the removal of the most-connected proteins (i.e., hubs, the nodes of a complex network receiving a large number of connections) can have disastrous effects on the proper functioning of the individual The issue of protein–protein interaction networks has also been Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 372 Tuesday, May 10, 2005 9:34 PM 372 Medical Image Analysis considered in a number of other works, including Qin et al [29], Wagner [30], Pastor-Satorras et al [31], and in studies of the properties and evolution of such networks Another related work, described by Wuchty [32], considered graphs obtained by assigning a node to every protein domain (or module) and an edge whenever two such domains are found in the same protein The important problem of determining protein function has been addressed from the perspective of networks of physical interaction by Vazquez et al [33] Their method is based on the minimization of the number of interacting proteins with different categories, so that the function estimation can be performed on a global scale while considering the entire connectivity of the protein network The obtained results corroborate the validity of using protein-protein interaction networks as a means of inferring protein function, despite the unavoidable presence of imperfections and the incompleteness of protein networks The analysis of gene-expression networks in terms of embedded complex logistics maps (ECLM), a hybrid method blending some concepts from wavelets and coupled logistics maps, has been reported by Shaw [34] That study considered 112 genes collected at nine different time instants along 25 days, with each time point being fitted to an ECLM model with high Pearson correlation coefficient, and the connections between genes were determined by considering models with high pairwise correlation The obtained connections were interpreted as following scale-free behavior in both topology and dynamics A work by Bumble et al [35] suggests that the study of pathways of network syntheses of genes, metabolism, and proteins should be extended to the investigation of the causes and treatment of diseases Their approach involves methods capable of yielding, for a specific set of candidate reactions, a complete metabolic pathway network Interesting results are obtained by investigating qualitative attributes, including relationships regarding the connectivity between vertices and the strength of connections, the relationship of interaction energies and chemical potentials with the coordination number of the lattice models, and how the stability of the networks are related to their topology An interesting approach to analyzing the amino acid sequences of a protein in terms of subsequently overlapping strings of length K has been described by Hao et al [36] The strings of amino acids are represented as graphs by associating each possible subsequence of length K to each graph node, and having the edges represent the observed successive transitions of subsequences Their investigation targeted the reconstruction of the original sequences from the overlapping string networks, which can be approached by counting the number of Eulerian loops (i.e., a cyclic sequence of connected edges that are followed without repetition) More specifically, the sequences are reconstructed while starting with the same initial subsequence, using each of the subsequences the same number of times as observed in the original data, and respecting a fixed sequence length It was therefore verified that the reconstruction is unique for K ≥ for the majority of the considered networks (PDB.SEQ database [37]) The present work addresses co-occurrence strings of amino acids (or any other basic biological element) similar to the scheme described in the previous paragraph, but here the subsequences not necessarily overlap, and the number of times a Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 379 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 379 600 500 400 300 200 100 T 0 10 20 30 40 50 (c) 60 70 80 90 100 FIGURE 10.4 (continued) The indegree and outdegree densities are shown in Figure 10.12(a) and Figure 10.12(b), respectively, for T = Both of the resulting node degrees were again similar to one another, presenting a plateau for < log(k) < followed by a moderate decrease of node degree The self-connections between nodes representing subsequences of two identical amino acids are given in Table 10.2 The initial kernel was also identified for T = 22, with the obtained digraph shown in Figure 10.13 The dominant amino acids were L and A 10.6 DISCUSSION Despite the different number of proteins and overall amino acid sequence lengths available for each of the three species, the clustering coefficient, average length, and maximum cluster size are determined from the respective adjacency matrices (not the weight), and therefore they are more significant statistically so that we can attempt a comparison between such measurements in the case of zebra fish and Xenopus It is clear from Figure 10.4 that, as expected, the average node degree of the graph KT decreases monotonically with the threshold value T, while the opposite happens for QT The abrupt way in which the average node degree varies for the thresholded and complementary matrices suggests that a kind of phase transition (critical phenomenon) takes place as the values of T are increased As shown in Figure 10.5, the average clustering coefficient for KT tends to decrease steadily with the threshold values, undergoing a relatively abrupt transition (near T = 20 for zebra fish), while the clustering coefficient of QT increases even more abruptly near T = 10, suggesting a phase transition also for this measurement Generally, the local connectivity reaches less than 10% of its maximum value after just one-third of the considered T excursion, which suggests that the network connectivity is dominated by stronger connections surrounded by much smaller connection weights Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 380 Tuesday, May 10, 2005 9:34 PM 380 Medical Image Analysis 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 T 0 20 40 60 80 100 120 140 160 180 (a) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 T 0 20 40 60 80 100 120 140 160 180 (b) FIGURE 10.5 Average clustering coefficient as a function of the weight threshold T (solid line = KT, dashed line = QT) for (a) zebra fish, (b) Xenopus, and (c) rat data The average lengths of KT shown in Figure 10.6 suffer from the typical problem that such distances tend to fall as a consequence of the disappearance of connections In other words, because nonexistent edges are not considered for the average length calculation, a network containing no connections has null average length, less than for a fully connected network, for which the average length would be (overlooking self-connections) To any extent, the average length presents a sharp discontinuity (near T = 80 for zebra fish, 60 for Xenopus, and 20 for rat), possibly indicating that a large number of edges are cut by thresholds larger than these values At the same time, the maximum average lengths in each case are similar and relatively small An abrupt increase of the average length is observed for QT for small values of T, Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 381 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 381 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 T 0 10 20 30 40 50 60 70 80 90 100 (c) FIGURE 10.5 (continued) indicating that that matrix indeed suffers an abrupt change of its connection for small threshold values The graphs in Figures 10.7 show that the maximum cluster size for KT decreases steadily for higher threshold values, as expected The maximum cluster size for QT remained fixed at 400, confirming that the complementary matrix is highly connected As indicated in Figure 10.8, Figure 10.10, and Figure 10.12, the node degree densities tend to present two distinct regions: one plateau portion at the left-hand side, followed by an abrupt descending portion at the right-hand side of the graph While the indegree and outdegree densities also produced similar profiles for the three species, the respective kernels identified at different threshold levels (because of the different length of the amino acid sequences) were found to be rather different, with distinct pairs of amino acids dominating each kernel While such a result may be strongly affected by the different amounts of data available for each of the considered species, it may also suggest different fundamental structures for the amino acid sequencing in those animals 10.7 CONCLUDING REMARKS AND FUTURE WORK This chapter has addressed the promising perspective of using modern complexnetwork concepts and tools as a means of characterizing, modeling, and analyzing biological sequences, with special attention given to amino acid sequences in proteins After presenting a brief historic perspective of complex-network research and some of its most representative applications to bioinformatics, the basic concepts of complex networks and respective topological measurements were presented The problem of characterizing proteins in terms of weighted digraphs obtained from consecutive (with possible overlap) subsequences of amino acids was addressed Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 382 Tuesday, May 10, 2005 9:34 PM 382 Medical Image Analysis 3.8 3.4 3.0 2.6 2.2 1.8 1.4 T 1.0 20 40 60 80 100 120 140 160 180 100 120 140 160 180 (a) 3.8 3.4 3.0 2.6 2.2 1.8 1.4 T 1.0 20 40 60 80 (b) FIGURE 10.6 Average length as a function of the weight threshold T (solid line = KT, dashed line = QT) for (a) zebra fish, (b) Xenopus, and (c) rat data next, with respect to a specific protein in zebra fish, Xenopus, and rat This investigation included the calculation of the average node degree, average clustering coefficient, the average length (in number of edges), and the size of the maximum cluster in the graph for a sequence of threshold values The obtained curves were found to provide interesting insights about the structure of the overall protein, especially regarding the appearance of critical transitions of several of the considered measurements as T was increased In addition, kernels were identified for each case, suggesting an interesting basic organization in the amino acid sequences Despite Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 383 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 383 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 T 0 10 20 30 40 50 60 70 80 90 100 (c) FIGURE 10.6 (continued) TABLE 10.2 Self-Connections of Subsequences Composed of Two Identical Amino Acids Subsequence Number of Self-Connections (Zebra fish) Number of Self-Connections (Xenopus) Number of Self-Connections (Rat) AA RR DD NN CC EE QQ GG HH II LL KK MM FF PP SS TT WW YY VV 274 85 216 23 14 467 216 310 67 161 188 24 299 487 52 13 126 41 186 13 293 95 79 48 126 104 176 233 16 19 27 10 11 49 11 21 98 29 0 71 61 0 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 384 Tuesday, May 10, 2005 9:34 PM 384 Medical Image Analysis 400 360 320 280 240 200 160 120 80 40 T 0 20 40 60 80 100 120 140 160 180 (a) 400 360 320 280 240 200 160 120 80 40 T 0 20 40 60 80 100 120 140 160 180 (b) FIGURE 10.7 Maximum cluster size of KT as a function of the weight threshold T for (a) zebra fish, (b) Xenopus, and (c) rat data the different sizes of the amino acid sequences, which imply problems of statistical meaningfulness, some interesting trends have been identified regarding the comparison of the measurements obtained for the three different species, especially the general similarity between the topological properties for each species while completely different kernels and dominant amino acids have been identified for those cases Future extensions of this work include the consideration of other m, n, and g configurations, the use of additional structural features such as betweenness centrality as well as the ratios suggested in the literature [40, 41], and the identification of the hierarchical backbone of the directed network, as suggested in the literature [39] Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 385 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 385 400 360 320 280 240 200 160 120 80 40 T 0 10 20 30 40 50 (c) 60 70 80 90 100 FIGURE 10.7 (continued) It would also be possible to consider the progressive merging of nodes and connected components into the initial kernel to obtain the hierarchical structure underlying the growth of the kernel, with possible applications to the complex problem of protein folding [42] Finally, it would be interesting to use such measurements to compare proteins (in terms of amino acids and bases) from the same or distinct individuals, as well as to infer philogenetic evolution of the proteins In the case of DNA analysis, the obtained topological measurements can provide a means for distinguishing between coding and noncoding regions ACKNOWLEDGMENTS The author is grateful to Fundação de Amparo Pesquisa Estado de São Paulo — FAPESP (proc 99/12765-2), Conselho Nacional de Desenvolvimento Científico e Tecnológico — CNPq (proc 308231/03-1), and the Human Frontier Science Program for financial support Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 386 Tuesday, May 10, 2005 9:34 PM 386 Medical Image Analysis Log (k) 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 (a) Log (k) 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 (b) FIGURE 10.8 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (zebra-fish data) Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 387 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences ED 387 DE DD EE AE EL SE EK LE SD FIGURE 10.9 The ten-node kernel obtained for T = 95 for zebra-fish data The weights are represented in terms of the edge widths The maximum and minimum weights are 170 and zero, the latter corresponding to self-connections, as these have been excluded from the matrix used to obtain this picture Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 388 Tuesday, May 10, 2005 9:34 PM 388 Medical Image Analysis 4.0 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 Log (k) 0.4 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 (a) 4.0 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 Log (k) 0.4 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 (b) FIGURE 10.10 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (Xenopus data) Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 389 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 389 GA GE RG GL AG GP SG PG PP FIGURE 10.11 The nine-node kernel obtained for T = 64 for Xenopus data The weights are represented in terms of the edge widths Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 390 Tuesday, May 10, 2005 9:34 PM 390 Medical Image Analysis 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 4.4 Log (k) 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 (a) 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 4.4 Log (k) 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 (b) FIGURE 10.12 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (rat data) Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 391 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences GL 391 EA LA AL AA LG LL VL LF LS FIGURE 10.13 The ten-node kernel obtained for T = for the rat data The weights are represented in terms of the edge widths Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 392 Tuesday, May 10, 2005 9:34 PM 392 Medical Image Analysis REFERENCES Baldi, P and Brunak, S., Bioinformatics, MIT Press, Cambridge, MA, 2001 da F Costa, L., Signal processing in bioinformatics, IEEE Proc Digital Signal Process Conf., New Jersey, 2002, pp 23–27 Durbin, R., Eddy, S., Krogh, A., and Mitchison, G., Biological Sequence Analysis, Cambridge University Press, Cambridge, U.K., 1998 Alberts, B., Bray, D., Lewins, L., Raff, M., Roberts, K., and Watson, J.D., Molecular Biology of the Cell, 3rd ed., Garland Publishing, New York, 1994 Garrett, R.H and Grisham, C.M., Biochemistry, Saunders College Publishing, Fort Worth, TX, 1995 Rapoport, A., Contribution to the theory of random A biased nets, Bull Math Biophys., 19, 257–277, 1957 Erdös, P and Rényi, A., On the evolution of random graphs, Publications Mathematicae, 6, 290–297, 1959 Watts, D.J and Strogatz, S.H., Collective dynamics of small-world networks, Nature, 393, 440–442, 1998 Albert, R., Jeong, H., and Barabási, A.-L., The diameter of the world-wide web, Nature, 401: 130–131, 1999 10 Albert, R and Barabási, A.-L., Statistical mechanics of complex networks, Rev Mod Phys., 74, 47–97, 2002 11 Dorogovtsev, S.N and Mendes, J.F.F., Evolution of networks, Adv Phys., 51, 1079–1187, 2002 12 Newman, M.E.J., The structure and function of complex networks, SIAM Rev., 45, 167–256, 2003 13 da F Costa, L and Cesar, R.M., Jr., Shape Analysis and Classification: Theory and Practice, CRC Press, Boca Raton, FL, 2001 14 Aldous, J.M and Wilson, R.J., Graphs and Applications: an Introductory Approach, Springer-Verlag, London, 2000 15 West, D.B., Introduction to Graph Theory, Prentice Hall, Upper Saddle River, NJ, 2001 16 Harary, F., Graph Theory, Addison-Wesley, Reading, MA, 1995 17 Bollobas, B., Modern Graph Theory, Springer-Verlag, Heidelberg, 1998 18 Holme, P., Huss, M., and Jeong, H., Subnetwork hierarchies of biochemical pathways, Bioinformatics, 19, 532–538, 2003 19 Alon, N and Spencer, J.H., The Probabilistic Method, Wiley Interscience, New York, 2000 20 Bollobas, B., Random Graphs, Cambridge University Press, Cambridge, U.K., 2001 21 Bose, I., Biological networks, available on-line at http://arxiv.org/abs/cond-mat/ 0202192 last accessed March 2005 22 Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., and Barabasi, A.-L., The large-scale organization of metabolic networks, Nature, 407, 651–654, 2000 23 Wagner, A and Fell, D.A., The small world inside large metabolic networks, Proc R Soc London B, 268, 1803–1810, 2001 24 Chung, F., Lu, L., Dewey, T.G., and Galas, D.J., Duplication models for biological networks, Journal of Computational Biology, 10(5), 677–688, 2003 25 Aiello, W., Chung, F., and Lu, L., in Proc 32nd Annu ACM Symp Theory Computing, 171–180, 2000 26 Girvan, M and Newman, M.E.J., Community structure in social and biological networks, Proc Nat Acad Sci USA, 99, 7821–7826, 2002 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 393 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 393 27 Kuo, P.D and Banzhaf, W., Small world and scale-free network topologies in an artificial regulatory network model, Journal of Biological Physics and Chemistry, 4, 85–92, 2004 28 Jeong, H., Mason, S.P., Barabási, A.-L., and Oltvai, Z.N., Lethality and centrality in protein networks, Nature, 411, 41–42, 2001 29 Qin, H., Lu, H.S.S., Wu, W.B., and Li, W.H., Evolution of the yeast protein interaction network, Proc Nat Acad Sci., 100, 12,820–12,824, 2003 30 Wagner, A., How large protein interaction networks evolve, Proceedings of the Royal Society of London Seriew B, 270, 457–466, 2003 31 Pastor-Satorras, R., Smith, E., and Solé, R.V., Evolving protein interaction networks through gene duplication, J Theor Biol., 222, 99–210, 2003 32 Wuchty, S., Scale-free behavior in protein domain networks, Mol Biol Evol., 18, 1697–1702, 2001 33 Vazquez, A., Flammini, A., Maritan, A., and Vespignani, A., Global protein function prediction in protein-protein interaction networks, Nat Biotech., 21, 697–700, 2003 34 Shaw, S., Evidence of scale-free topology and dynamics in gene regulatory networks, in Proc ISCA 12th International Conference on Intelligent and Adaptive Systems and Software Engineering, 37–40, 2003 (ISBN 1-880843-47-1) 35 Bumble, S., Friedler, F., and Fan, L.T., A toy model for comparative phenomenon in molecular biology and the utilization of biochemical applications of PNS in genetic applications, available on line at http://arxiv.org/abs/cond-mat/0304348 36 Hao, B., Xie, H., and Zhang, S., Compositional representation of protein sequence and the number of Eulerian loops, available on line at http://arxiv.org/abs/physics/ 0103028 37 PDB.SEQ, A Collection of SWISS-PROT Entries; available on-line at http://www expasy.org/sprot 38 da F Costa, L., What’s in a name?, International Journal of Modern Physics C, 15, 371–379, 2004 39 da F Costa, L., The hierarchical backbone of complex networks, Physical Review Letters, 93 (9), paper 98702, 4p., 2004 40 da F Costa, L., L-percolations of complex networks, Physical Review E-Statistical Physics, Plasmas, Fluids and Related Interdisciplinary Topics, 70, paper 056106, 8p., 2004 41 da F Costa, L., Reinforcing the resilience of complex networks, Physical Review E-Statistical Physics, Plasmas, Fluids and Related Interdisciplinary Topics, 69, paper 066127, 7p., 2004 42 Crescenzi, P., Goldman, D., Papadimitriou, C., Piccolboni, A., and Yaxnakakis, M., On the complexity of protein folding, in Annu Conf Res Computational Molecular Biol., ACM, New York, 1998, pp 61–62 Copyright 2005 by Taylor & Francis Group, LLC [...]... Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 376 Tuesday, May 10, 2005 9:34 PM 376 Medical Image Analysis 1 13 1 1 6 18 7 1 1 10 15 1 2 1 1 1 20 1 2 14 1 1 11 5 1 FIGURE 10.3 The network obtained for m = 1 for the amino acid sequence (13)(6)(7)(18)(15) (11)(11)(14)(20)(20)(1)(11)(5) (10)( 15)(11)(14) The weights of the edges are shown in italics respective normalized weight (i.e., the... Medical Image Analysis 5 4 3 2 1 Log (k) 0 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 (a) 5 4 3 2 1 Log (k) 0 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 (b) FIGURE 10.8 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (zebra-fish data) Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 387 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis. .. 389 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences 389 GA GE RG GL AG GP SG PG PP FIGURE 10.11 The nine-node kernel obtained for T = 64 for Xenopus data The weights are represented in terms of the edge widths Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 390 Tuesday, May 10, 2005 9:34 PM 390 Medical Image Analysis 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4... 391 Tuesday, May 10, 2005 9:34 PM Graph-Based Analysis of Amino Acid Sequences GL 391 EA LA AL AA LG LL VL LF LS FIGURE 10.13 The ten-node kernel obtained for T = 2 for the rat data The weights are represented in terms of the edge widths Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 392 Tuesday, May 10, 2005 9:34 PM 392 Medical Image Analysis REFERENCES 1 Baldi, P and Brunak, S.,... obtained from consecutive (with possible overlap) subsequences of amino acids was addressed Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 382 Tuesday, May 10, 2005 9:34 PM 382 Medical Image Analysis 3.8 3.4 3.0 2.6 2.2 1.8 1.4 T 1.0 0 20 40 60 80 100 120 140 160 180 100 120 140 160 180 (a) 3.8 3.4 3.0 2.6 2.2 1.8 1.4 T 1.0 0 20 40 60 80 (b) FIGURE 10.6 Average length as a function... 79 48 8 126 104 4 9 176 233 16 0 2 19 27 10 11 2 3 49 11 21 0 2 98 29 0 0 71 61 6 0 0 3 Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 384 Tuesday, May 10, 2005 9:34 PM 384 Medical Image Analysis 400 360 320 280 240 200 160 120 80 40 T 0 0 20 40 60 80 100 120 140 160 180 (a) 400 360 320 280 240 200 160 120 80 40 T 0 0 20 40 60 80 100 120 140 160 180 (b) FIGURE 10.7 Maximum cluster... + (Ai+m+n–g–2–1) 20 + Ai+m+n–g–1 (10.2) and Therefore, we have that 1 ≤ B and C ≤ 20m Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 374 Tuesday, May 10, 2005 9:34 PM 374 Medical Image Analysis TABLE 10.1 Amino Acids and Respective Numerical Codes Abbreviation Numerical Code A R D N C E Q G H I L K M F P S T W Y V 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 An example of... self-connections, as these have been excluded from the matrix used to obtain this picture Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 388 Tuesday, May 10, 2005 9:34 PM 388 Medical Image Analysis 4.0 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 Log (k) 0.4 0 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 (a) 4.0 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 Log (k) 0.4 0 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 (b)... obtained digraph shown in Figure 10.11, which is dominated by the P and G amino acids Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 378 Tuesday, May 10, 2005 9:34 PM 378 Medical Image Analysis 3200 2800 2400 2000 1600 1200 800 400 T 0 0 20 40 60 80 100 120 140 160 180 (a) 2400 2000 1600 1200 800 400 T 0 0 20 40 60 80 100 120 140 160 180 (b) FIGURE 10.4 The average node degree... connectivity is dominated by stronger connections surrounded by much smaller connection weights Copyright 2005 by Taylor & Francis Group, LLC 2089_book.fm copy Page 380 Tuesday, May 10, 2005 9:34 PM 380 Medical Image Analysis 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 T 0 0 20 40 60 80 100 120 140 160 180 (a) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 T 0 0 20 40 60 80 100 120 140 160 180 (b) FIGURE 10.5 Average clustering