A Survey of Clustering Algorithms for Graph Data 285 Here 𝑁𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠(𝑒, 𝑖, 𝑗) refers to the number of (global) short- est paths between 𝑖 and 𝑗 which pass through 𝑒, and 𝑁𝑢𝑚𝑆ℎ𝑜𝑟𝑡𝑃 𝑎𝑡ℎ𝑠(𝑖, 𝑗) refers to the number of shortest paths between 𝑖 and 𝑗. Note that the value of 𝑁𝑢𝑚𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑒𝑑𝑃 𝑎𝑡ℎ𝑠)(𝑒, 𝑖, 𝑗) may be 0 if none of the shortest paths between 𝑖 and 𝑗 contain 𝑒. The algorithm ranks the edges by order of their betweenness, and and deletes the edge with the highest score. The between- ness coefficients are recomputed, and the process is repeated. The set of con- nected components after repeated deletion form the natural clusters. A variety of termination-criteria (eg. fixing the number of connected components) can be used in conjunction with the algorithm. A key issue is the efficient determination of edge-betweenness centrality. The number of paths between any pair of nodes can be exponentially large, and it would seem that the computation of the betweenness measure would be a key bottleneck. It has been shown in [36], that the network structure index can also be used in order to estimate edge-betweenness centrality effectively by pairwise node sampling. 2.5 The Spectral Clustering Method Eigenvector techniques are often used in multi-dimensional data in order to determine the underlying correlation structure in the data. It is natural to question as to whether such techniques can also be used for the more general case of graph data. It turns out that this is indeed possible with the use of a method called spectral clustering. In the spectral clustering method, we make use of the node-node adjacency matrix of the graph. For a graph containing 𝑛 nodes, let us assume that we have a 𝑛 ×𝑛 adjacency matrix, in which the entry (𝑖, 𝑗) correspond to the weight of the edge between the nodes 𝑖 and 𝑗. This essentially corresponds to the similar- ity between nodes 𝑖 and 𝑗. This entry is denoted by 𝑤 𝑖𝑗 , and the corresponding matrix is denoted by 𝑊. This matrix is assumed to be symmetric, since we are working with undirected graphs. Therefore, we assume that 𝑤 𝑖𝑗 = 𝑤 𝑗𝑖 for any pair (𝑖, 𝑗). All diagonal entries of the matrix 𝑊 are assumed to be 0. As discussed earlier, the aim of any node partitioning algorithm is minimize (a function of) the weights across the partitions. The spectral clustering method constructs this minimization function in terms of the matrix structure of the adjacency matrix, and another matrix which is referred to as the degree matrix. Thedegree matrix 𝐷 is simply a diagonal matrix, in which all entries are zero except for the diagonal values. The diagonal entry 𝑑 𝑖𝑖 is equal to the sum of the weights of the incident edges. In other words, the entry 𝑑 𝑖𝑗 is defined as follows: 𝑑 𝑖𝑗 = ∑ 𝑛 𝑗=1 𝑤 𝑖𝑗 𝑖 = 𝑗 0 𝑖 ∕= 𝑗 286 MANAGING AND MINING GRAPH DATA We formally define the Laplacian Matrix as follows: Definition 9.2. (Laplacian Matrix) The Laplacian Matrix 𝐿 is defined by subtracting the weighted adjacency matrix from the degree matrix. In other words, we have: 𝐿 = 𝐷 − 𝑊 (9.4) This matrix encodes the structural behavior of the graph effectively and its eigenvector behavior can be used in order to determine the important clusters in the underlying graph structure. We can be shown that the Laplacian matrix 𝐿 is positive semi-definite i.e., for any 𝑛-dimensional row vector 𝑓 = [𝑓 1 . . . 𝑓 𝑛 ] we have 𝑓 ⋅𝐿 ⋅ 𝑓 𝑇 ≥ 0. This can be easily shown by expressing 𝐿 in terms of its constituent entries which are a function of the corresponding weights 𝑤 𝑖𝑗 . Upon expansion, it can be shown that: 𝑓 ⋅𝐿 ⋅ 𝑓 𝑇 = (1/2) ⋅ 𝑛 ∑ 𝑖=1 𝑛 ∑ 𝑗=1 𝑤 𝑖𝑗 ⋅ (𝑓 𝑖 − 𝑓 𝑗 ) 2 (9.5) We summarize as follows. Lemma 9.3. The Laplacian matrix 𝐿 is positive semi-definite. Specifically, for any 𝑛-dimensional row vector 𝑓 = [𝑓 1 . . . 𝑓 𝑛 ], we have: 𝑓 ⋅𝐿 ⋅ 𝑓 𝑇 = (1/2) ⋅ 𝑛 ∑ 𝑖=1 𝑛 ∑ 𝑗=1 𝑤 𝑖𝑗 ⋅ (𝑓 𝑖 − 𝑓 𝑗 ) 2 At this point, let us examine some interpretations of the vector 𝑓 in terms of the underlying graph partitioning. Let us consider the case in which each 𝑓 𝑖 is drawn from the set {0, 1}, and this determines a two-way partition by labeling each node either 0 or 1. The particular partition to which the node 𝑖 belongs is defined by the corresponding label. Note that the expansion of the expression 𝑓 ⋅ 𝐿 ⋅ 𝑓 𝑇 from Lemma 9.3 simply represents the sum of the weights of the edges across the partition defined by 𝑓 . Thus, the determination of an appropriate value of 𝑓 for which the function 𝑓 ⋅ 𝐿 ⋅ 𝑓 𝑇 is minimized also provides us with a good node partitioning. Unfortunately, it is not easy to determine the discrete values of 𝑓 which determine this optimum partitioning. Nevertheless, we will see later in this section that even when we restrict 𝑓 to real values, this provides us with the intuition necessary to create an effective partitioning. An immediate observation is that the indicator vector 𝑓 = [1 . . . 1] is an eigenvector with a corresponding eigenvalue of 0. We note that 𝑓 = [1 . . . 1] must be an eigenvector, since 𝐿 is positive semi-definite and 𝑓 ⋅𝐿 ⋅𝑓 𝑇 can be 0 only for eigenvectors with 0 eigenvalues. This observation can be generalized A Survey of Clustering Algorithms for Graph Data 287 further in order to determine the number of connected components in the graph. We make the following observation. Lemma 9.4. The number of (linearly independent) eigenvectors with zero eigenvalues for the Laplacian matrix 𝐿 is equal to the number of connected components in the underlying graph. Proof: Without loss of generality, we can order the vertices corresponding to the particular connected component that they belong to. In this case, the Laplacian matrix takes on the following block form, which is illustrated below for the case of three connected components. 𝐿 = 𝐿 1 0 0 0 𝐿 2 0 0 0 𝐿 3 Each of the blocks 𝐿 1 , 𝐿 2 and 𝐿 3 is a Laplacian itself of the corresponding component. Therefore, the corresponding indicator vector for that component is an eigenvector with corresponding eigenvalue 0. The result follows. □ We observe that connected components are the most obvious examples of clusters in the graph. Therefore, the determination of eigenvectors correspond- ing to zero eigenvalues provides us information about this (relatively rudimen- tary set) of clusters. Broadly speaking, it may not be possible to glean such clean membership behavior from the other eigenvectors. One of the problems is that other than this particular rudimentary set of eigenvectors (which corre- spond to the connected components), the vector components of the other eigen- vectors are drawn from the real domain rather than the discrete {0, 1} domain. Nevertheless, because of the nature of the natural interpretation of 𝑓 ⋅𝐿 ⋅𝑓 𝑇 in terms of the weights of the edges across nodes with very differing values of 𝑓 𝑖 , it is natural to cluster together nodes for which the values of 𝑓 𝑖 are as similar as possible across any particular eigenvector on the average. This provides us with the intuition necessary to define an effective spectral clustering algorithm, which partitions the data set into 𝑘 clusters for any arbitrary value of 𝑘. The algorithm is as follows: Determine the 𝑘 eigenvectors with the smallest eigenvalues. Note that each eigenvector has as many components as the number of nodes. Let the component of the 𝑗th eigenvector for the 𝑖th node be denoted by 𝑝 𝑖𝑗 . Create a new data set with as many records as the number of nodes. The 𝑖th record in this data set corresponds to the 𝑖th node, and has 𝑘 com- ponents. The record for this node is simply the eigenvector components for that node, which are denoted by 𝑝 𝑖1 . . . 𝑝 𝑖𝑘 . 288 MANAGING AND MINING GRAPH DATA Since we would like to cluster nodes with similar eigenvector compo- nents, we use any conventional clustering algorithm (e.g. 𝑘-means) in or- der to create 𝑘 clusters from this data set. Note that the main focus of the approach was to create a transformation of a structural clustering algo- rithm into a more conventional multi-dimensional clustering algorithm, which is easy to solve. The particular choice of the multi-dimensional clustering algorithm is orthogonal to the broad spectral approach. The above algorithm provides a broad framework for the spectral clustering al- gorithm. The input parameter for the above algorithm is the number of clusters 𝑘. In practice, a number of variations are possible in order to tune the quality of the clusters which are found. Some examples are as follows: It is not necessary to use the same number of eigenvectors as the input parameter for the number of clusters. In general, one should use at least as many eigenvectors as the number of clusters to be created. However, the exact number of eigenvectors to be used in order to get the optimum results may vary with the particular data set. This can be known only with experimentation. There are other ways of creating normalized Laplacian matrices which can provide more effective results in some situations. Some classic ex- amples of such Laplacian matrices in terms of the adjacency matrix 𝑊 , degree matrix 𝐷 and the identity matrix 𝐼 are defined as follows: 𝐿 𝐴 = 𝐼 − 𝐷 −(1/2) ⋅ 𝑊 ⋅𝐷 −(1/2) 𝐿 𝐵 = 𝐼 − 𝐷 −1 ⋅ 𝑊 More details on the different methods which can be used for effective spectral graph clustering may be found in [9]. 2.6 Determining Quasi-Cliques A different way of determining massive graphs in the underlying data is that of determining quasi-cliques. This technique is different from many other partitioning algorithms, in that it focuses on definitions which maximize edge densities within a partition, rather than minimizing edge densities across par- titions. A clique is a graph in which every pair of nodes are connected by an edge. A quasi-clique is a relaxation on this concept, and is defined by im- posing a lower bound on the degree of each vertex in the given set of nodes. Specifically, we define a 𝛾-quasiclique is as follows: Definition 9.5. A 𝑘-graph (𝑘 ≥ 1) 𝐺 is a 𝛾-quasiclique if the degree of each node in the corresponding sub-graph of vertices is at least 𝛾 ⋅ 𝑘. A Survey of Clustering Algorithms for Graph Data 289 The value of 𝛾 always lies in the range (0, 1]. We note that by choosing 𝛾 = 1, this definition reverts to that of standard cliques. Choosing lower values of 𝛾 allows for the relaxations which are more true in the case of real applications. This is because we rarely encounter complete cliques in real applications, and at least some edges within a dense subgraph would always be missing. A vertex is said to be critical, if its degree in the corresponding subgraph is the smallest integer which is at least equal to 𝛾 ⋅ 𝑘. The earliest piece of work on this problem is from [1] The work in [1] uses a greedy randomized adaptive search algorithm GRASP, to find a quasi-clique with the maximum size. A closely related problem is that of finding find- ing frequently occurring cliques in multiple data sets. In other words, when multiple graphs are obtained from different data sets, some dense subgraphs occur frequently together in the different data sets. Such graphs help in deter- mining important dense patterns of behavior in different data sources. Such techniques find applicability in mining important patterns in graphical repre- sentations of customers. The techniques are also helpful in mining cross-graph quasi-cliques in gene expression data. A description of the application of the technique to the problem of gene-expression data may be found in [33]. An efficient algorithm for determining cross graph quasi-cliques was proposed in [32]. The main restriction of the work proposed in [32] is that the support threshold for the algorithms is assumed to be 100%. This restriction has been relaxed in subsequent work [43]. The work in [43] examines the problem of mining frequent closed quasi-cliques from a graph database with arbitrary sup- port thresholds. In [31] a multi-graph version of the quasi-clique problem was explored. However, instead of finding the complete set of quasi-cliques in the graph, they proposed an approximation algorithm to cover all the vertices in the graph with a minimum number of 𝑝-quasi-complete subgraphs. Thus, this technique is more suited for summarization of the overall graph with a smaller number of densely connected subgraphs. 2.7 The Case of Massive Graphs A closely related problem is that of dense subgraph determination in mas- sive graphs. This problem is frequently encountered in large graph data sets. For example, the problem of determining large subgraphs of web graphs was studied in [5, 22]. A min-hash approach was first used in [5] in order to deter- mine syntactically related clusters. This paper also introduces the advantages of using a min-hash approach in the context of graph clustering. Subsequently, the approach was generalized to the case of large dense graphs with the use of recursive application of the basic min-hash algorithm. The broad idea in the min-hash approach is to represent the outlinks of a particular node as sets. Two nodes are considered similar, if they share many 290 MANAGING AND MINING GRAPH DATA outlinks. Thus, consider a node 𝐴 with an outlink set 𝑆 𝐴 and a node 𝐵 with outlink set 𝑆 𝐵 . Then the similarity between the two nodes is defined by the Jaccard coefficient, which is defined as 𝑆 𝐴 ∩𝑆 𝐵 𝑆 𝐴 ∪𝑆 𝐵 . We note that explicit enumera- tion of all the edges in order to compute this can be computationally inefficient. Rather, a min-hash approach is used in order to perform the estimation. This min-hash approach is as follows. We sort the universe of nodes in a random order. For any set of nodes in random sorted order, we determine the first node 𝐹 𝑖𝑟𝑠𝑡(𝐴) for which an outlink exists from 𝐴 to 𝐹 𝑖𝑟𝑠𝑡(𝐴). We also determine the first node 𝐹𝑖𝑟𝑠𝑡(𝐵) for which an outlink exists from 𝐵 to 𝐹𝑖𝑟𝑠𝑡(𝐵). It can be shown that the Jaccard coefficient is an unbiased estimate of the probability that 𝐹 𝑖𝑟𝑠𝑡(𝐴) and 𝐹 𝑖𝑟𝑠𝑡(𝐵) are the same node. By repeating this process over different permutations over the universe of nodes, it is possible to accurately estimate the Jaccard coefficient. This is done by using a constant number of permutations 𝑐 of the node order. The actual permutations are implemented by associated 𝑐 different randomized hash values with each node. This cre- ates 𝑐 sets of hash values of size 𝑛. The sort-order for any particular set of hash-values defines the corresponding permutation order. For each such per- mutation, we store the minimum node index of the outlink set. Thus, for each node, there are 𝑐 such minimum indices. This means that, for each node, a fingerprint of size 𝑐 can be constructed. By comparing the fingerprints of two nodes, the Jaccard coefficient can be estimated. This approach can be further generalized with the use of every 𝑠 element set contained entirely with 𝑆 𝐴 and 𝑆 𝐵 . Thus, the above description is the special case when 𝑠 is set to 1. By using different values of 𝑠 and 𝑐, it is possible to design an algorithm which distinguishes between two sets that are above or below a certain threshold of similarity. The overall technique in [22] first generates a set of 𝑐 shingles of size 𝑠 for each node. The process of generating the 𝑐 shingles is extremely straight- forward. Each node is processed independently. We use the min-wise hash function approach in order to generate subsets of size 𝑠 from the outlinks at each node. This results in 𝑐 subsets for each node. Thus, for each node, we have a set of 𝑐 shingles. Thus, if the graph contains a total of 𝑛 nodes, the total size of this shingle fingerprint is 𝑛 ×𝑐 ×𝑠𝑝, where 𝑠𝑝 is the space required for each shingle. Typically 𝑠𝑝 will be 𝑂(𝑠), since each shingle contains 𝑠 nodes. For each distinct shingle thus created, we can create a list of nodes which contain it. In general, we would like to determine groups of shingles which contain a large number of common nodes. In order to do so, the method in [22] performs a second-order shingling in which the meta-shingles are created from the shingles. Thus, this further compresses the graph in a data structure of size 𝑐×𝑐. This is essentially a constant size data structure. We note that this group of meta-shingles have the the property that they contain a large num- A Survey of Clustering Algorithms for Graph Data 291 ber of common nodes. The dense subgraphs can then be extracted from these meta-shingles. More details on this approach may be found in [22]. The min-hash approach is frequently used for graphs which are extremely large and cannot be easily processed by conventional quasi-clique mining algo- rithms. Since the min-hash approach summarizes the massive graph in a small amount of space, it is particularly useful in leveraging the small space represen- tation for a variety of query-processing techniques. Examples of such applica- tions include the web graph and social networks. In the case of web graphs, we desire to determine closely connected clusters of web pages with similar con- tent. The related problem in social networks is that of finding closely related communities. The min-hash approach discussed in [5, 22] precisely helps us achieve this goal, because we can process the summarized min-hash structure in a variety of ways in order to extract the important communities from the summarized structure. More details of this approach may be found in [5, 22]. 3. Clustering Graphs as Objects In this section, we will discuss the problem of clustering entire graphs in a multi-graph database, rather than examining the node clustering problem within a single graph. Such situations are often encountered in the context of XML data, since each XML document can be regarded as a structural record, and it may be necessary to create clusters from a large number of such objects. We note that XML data is quite similar to graph data in terms of how the data is organized structurally. The attribute values can be treated as graph labels and the corresponding semi-structural relationships as the edges. In has been shown in [2, 10, 28, 29] that this structural behavior can be leveraged in order to create effective clusters. 3.1 Extending Classical Algorithms to Structural Data Since we are examining entre graphs in this version of the clustering prob- lem, the problem simply boils down to that of clustering arbitrary objects, where the objects in this case have structural characteristics. Many of the conventional algorithms discussed in [24] (such as 𝑘-means type partitional algorithms and hierarchical algorithms can be extended to the case of graph data. The main changes required in order to extend these algorithms are as follows: Most of the underlying classical algorithms typically use some form of distance function in order to measure similarity. Therefore, we need appropriate measures in order to define similarity (or distances) between structural objects. 292 MANAGING AND MINING GRAPH DATA Many of the classical algorithms (such as 𝑘-means) use representative objects such as centroids in critical intermediate steps. While this is straightforward in the case of multi-dimensional objects, it is much more challenging in the case of graph objects. Therefore, appropriate meth- ods need to be designed in order to create representative objects. Fur- thermore, in some cases it may be difficult to create representatives in terms of single objects. We will see is that it is often more robust to use representative summaries of the underlying objects. There are two main classes of conventional techniques, which have been extended to the case of structural objects. These techniques are as follows: Structural Distance-based Approach: This approach computes struc- tural distances between documents and uses them in order to compute clusters of documents. One of the earliest work on clustering tree struc- tured data is the XClust algorithm [28], which was designed to cluster XML schemas in order for efficient integration of large numbers of Doc- ument Type Definitions (DTDs) of XML sources. It adopts the agglom- erative hierarchical clustering method which starts with clusters of single DTDs and gradually merges the two most similar clusters into one larger cluster. The similarity between two DTDs is based on their element sim- ilarity, which can be computed according to the semantics, structure, and context information of the elements in the corresponding DTDs. One of the shortcomings of the XClust algorithm is that it does not make full use of the structure information of the DTDs, which is quite important in the context of clustering tree-like structures. The method in [7] com- putes similarity measures based on the structural edit-distance between documents. This edit-distance is used in order to compute the distances between clusters of documents. S-GRACE is hierarchical clustering algorithm [29]. In [29], an XML document is converted to a structure graph (or s-graph), and the distance between two XML documents is defined according to the number of the common element-subelement relationships, which can capture bet- ter structural similarity relationships than the tree edit distance in some cases [29]. Structural Summary Based Approach: In many cases, it is possible to create summaries from the underlying documents. These summaries are used for creating groups of documents which are similar to these summaries. The first summary-based approach for clustering XML doc- uments was presented in [10]. In [10], the XML documents are modeled as rooted ordered labeled trees. A framework for clustering XML docu- ments by using structural summaries of trees is presented. The aim is to improve algorithmic efficiency without compromising cluster quality. A Survey of Clustering Algorithms for Graph Data 293 A second approach for clustering XML documents is presented in [2]. This technique is a partition-based algorithm. The primary idea in this approach is to use frequent-pattern mining algorithms in order to deter- mine the summaries of frequent structures in the data. The technique uses a 𝑘-means type approach in which each cluster center comprises a set of frequent patterns which are local to the partition for that clus- ter. The frequent patterns are mined using the documents assigned to a cluster center in the last iteration. The documents are then further re- assigned to a cluster center based on the average similarity between the document and the newly created cluster centers from the local frequent patterns. In each iteration the document-assignment and the mined fre- quent patterns are iteratively re-assigned, until the cluster centers and document partitions converge to a final state. It has been shown in [2] that such a structural summary based approach is significantly superior to a similarity function based approach as presented in [7]. The method of also superior to the structural approach in [10] because of its use of more robust representations of the underlying structural summaries. Since the most recent algorithm is the structural summary method discussed in [2], we will discuss this in more detail in the next section. 3.2 The XProj Approach In this section, we will present XProj, which is a summary-based approach for clustering of XML documents. The pseudo-code for clustering of XML documents is illustrated in Figure 9.1. The primary approach is to use a sub- structural modification of a partition based approach in which the clusters of documents are built around groups of representative sub-structures. Thus, in- stead of a single representative of a partition-based algorithm, we use a sub- structural set representative for the structural clustering algorithm. Initially, the document set 𝒟 is randomly divided into 𝑘 partitions with equal size, and the sets of sub-structure representatives are generated by mining frequent sub- structures of size 𝑙 from these partitions. In each iteration, the sub-structural representatives (of a particular size, and a particular support level) of a given partition are the frequent structures from that partition. These structural rep- resentatives are used to partition the document collection and vice-versa. We note that this can be a potentially expensive operation because of the deter- mination of frequent substructures; in the next section, we will illustrate an interesting way to speed it up. In order to actually partition the document col- lection, we calculate the number of nodes in a document which are covered by each sub-structural set representative. A larger coverage corresponds to a greater level of similarity. The aim of this approach is that the algorithm will determine the most important localized sub-structures over time. This 294 MANAGING AND MINING GRAPH DATA Algorithm XProj(Document Set: 𝒟, Minimum Support: 𝑚𝑖𝑛 𝑠𝑢𝑝, Structural Size: 𝑙, NumClusters: 𝑘 ) begin Initialize representative sets 𝒮 1 . . . 𝒮 𝑘 ; while (𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒𝑐𝑟𝑖𝑡𝑒𝑟𝑖𝑜𝑛 =false) begin Assign each document 𝐷 ∈ 𝒟 to one of the sets in {𝒮 1 . . . 𝒮 𝑘 } using coverage based similarity criterion; /* Let the corresponding document partitions be denoted by ℳ 1 . . . ℳ 𝑘 ; */ Compute the freq. substructures of size 𝑙 from each set ℳ 𝑖 using sequential transformation paradigm; if (∣ℳ 𝑖 ∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) ≥ 1 set 𝒮 𝑖 to frequent substructures of size 𝑙 from ℳ 𝑖 ; /* If (∣ℳ 𝑖 ∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) < 1, 𝒮 𝑖 remains unchanged; */ end; end Figure 9.1. The Sub-structural Clustering Algorithm (High Level Description) is analogous to the projected clustering approach which determines the most important localized projections over time. Once the partitions have been com- puted, we use them to re-compute the representative sets. These re-computed representative sets are defined as the frequent sub-structures of size 𝑙 from each partition. Thus, the representative set 𝑆 𝑖 is defined as the substructural set from the partition ℳ 𝑖 which has size 𝑙, and which has absolute support no less than (∣ℳ 𝑖 ∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝). Thus, the newly defined representative set 𝑆 𝑖 also corresponds to the local structures which are defined from the parti- tion ℳ 𝑖 . Note that if the partition ℳ 𝑖 contains too few documents such that (∣ℳ 𝑖 ∣ × 𝑚𝑖𝑛 𝑠𝑢𝑝) < 1, the representative set 𝑆 𝑖 remains unchanged. Another interesting observation is that the similarity function between a document and a given representative set is defined by the number of nodes in the document which are covered by that set. This makes the similarity func- tion more sensitive to the underlying projections in the document structures. This leads to more robust similarity calculations in most circumstances. In order to ensure termination, we need to design a convergence criterion. One useful criterion is based on the increase of the average sub-structural self-similarity over the 𝑘 partitions of documents. Let the partitions of doc- uments with respect to the current iteration be ℳ 1 . . . ℳ 𝑘 , and their corre- sponding frequent sub-structures of size 𝑙 be 𝒮 1 . . . 𝒮 𝑘 respectively. Then, the average sub-structural self-similarity at the end of the current iteration