Managing and Mining Graph Data part 43 pdf

A Survey on Streaming Algorithms for Massive Graphs 407 of the graph 𝐺 = (𝑉, 𝐸) is an (additive) (𝛼, 𝛽)-spanner of 𝐺 if for every pair of vertices 𝑢, 𝑣 ∈ 𝑉 , 𝑑𝑖𝑠𝑡 𝐺 ′ (𝑢, 𝑣) ≤ 𝛼 ⋅ 𝑑𝑖𝑠𝑡 𝐺 (𝑢, 𝑣) + 𝛽. We describe the algorithm of [21] and its subroutine in the following fashion. We describe first the distributed version of the algorithm and then its adaptation to the streaming model. As observed in [21], leaving space complexity aside, it is easy to see that many distributed algorithms with time complexity 𝑇 translate directly into streaming algorithms that use 𝑇 passes. For example, a straightforward streaming adaptation of a synchronous distributed algorithm for constructing a BFS tree would be the following: in each pass over the input stream, the BFS tree grows one more level. An exploration of 𝑑 levels would result in 𝑑 passes over the input stream. On the other hand, there are cases in which the running time of a synchronous algorithm may not translate directly to the number of passes of the streaming adaptation. In the example of the BFS tree, if two BFS trees are being constructed in parallel, some edges may be explored by both constructions, resulting in congestion that may increase the running time of the distributed algorithm. But for a streaming algorithm, both explorations of the same edge can be done using only one pass over the stream. We follow the notations used in [21]. Let 𝑑𝑖𝑎𝑚(𝐺) denote the diameter of the graph 𝐺, i.e., 𝑑𝑖𝑎𝑚(𝐺) = 𝑚𝑎𝑥 𝑢,𝑣∈𝑉 𝑑𝑖𝑠𝑡 𝐺 (𝑢, 𝑣). Given a subset 𝑉 ′ ⊆ 𝑉 , denote by 𝐸 𝐺 (𝑉 ′ ) the set of edges in 𝐺 induced by 𝑉 ′ , i.e., 𝐸 𝐺 (𝑉 ′ ) = {(𝑢, 𝑤) ∣ (𝑢, 𝑤) ∈ 𝐸 and 𝑢, 𝑤 ∈ 𝑉 ′ }. Let 𝐺(𝑉 ′ ) = (𝑉 ′ , 𝐸 𝐺 (𝑉 ′ )). Denote by Γ 𝑘 (𝑣, 𝑉 ′ ) the 𝑘-neighborhood of vertex 𝑣 in the graph 𝐺(𝑉 ′ ), i.e., Γ 𝑘 (𝑣, 𝑉 ′ ) = {𝑢 ∣ 𝑢 ∈ 𝑉 ′ and 𝑑𝑖𝑠𝑡 (𝑉 ′ ,𝐸 𝐺 (𝑉 ′ )) (𝑢, 𝑣) ≤ 𝑘}. The diameter of a subset 𝑉 ′ ⊆ 𝑉 , denoted by 𝑑𝑖𝑎𝑚(𝑉 ′ ), is the maximum pairwise distance in 𝐺 between a pair of vertices from 𝑉 ′ . For a collection ℱ of subsets 𝑉 ′ ⊆ 𝑉 , let 𝑑𝑖𝑎𝑚(ℱ) = max 𝑉 ′ ∈ℱ {𝑑𝑖𝑎𝑚(𝑉 ′ )}. The spanner construction utilizes graph covers. For a graph 𝐺 = (𝑉, 𝐸) and two integers 𝜅, 𝑊 > 0, a (𝜅, 𝑊 )-cover [5, 11, 18] C is a collection of not necessarily disjoint subsets (or clusters) 𝐶 ⊆ 𝑉 that satisfy the following conditions. (1) ∪ 𝐶∈C 𝐶 = 𝑉 . (2) 𝑑𝑖𝑎𝑚(C ) = 𝑂(𝜅𝑊 ). (3) The size of the cover 𝑠(C ) = ∑ 𝐶∈C ∣𝐶∣ is 𝑂(𝑛 1+1/𝜅 ), and furthermore, every vertex belongs to polylog(𝑛) ⋅𝑛 1/𝜅 clusters. (4) For every pair of vertices 𝑢, 𝑣 ∈ 𝑉 that are at distance at most 𝑊 from one another, there exists a cluster 𝐶 ∈ C that contains both vertices, along with the shortest path between them. Note that many constructions of (𝜅, 𝑊)-cover will also build one BFS tree for each cluster in the cover as a by-product. The BFS tree spans the whole cluster and is rooted at one vertex in the cluster. Algorithm 13.5 shows the construction [11, 19, 21] of (𝜅, 𝑊)-covers. It will be used as a subroutine in the spanner construction. Algorithm 13.5 builds a (𝜅, 𝑊)-cover in 𝜅 phases. A vertex 𝑣 in graph 𝐺 is called covered if there is a cluster 𝐶 ∈ C such that Γ 𝑊 (𝑣, 𝑉 ) ⊆ 𝐶. Let 𝑈 𝑖 be the set of uncovered 408 MANAGING AND MINING GRAPH DATA Algorithm 13.5: Cover Input: a graph 𝐺 = (𝑉, 𝐸) and two positive integer parameters 𝜅 and 𝑊 . 𝑈 1 ← 𝑉 .1 for 𝑖 = 1, 2, . . . , 𝜅 do2 Include each vertex 𝑣 ∈ 𝑈 𝑖 independently at random with probability3 𝑝 𝑖 = 𝑚𝑖𝑛{1, 𝑛 𝑖/𝜅 𝑛 ⋅ log 𝑛} in the set 𝑆 𝑖 of phase 𝑖. Each vertex 𝑠 ∈ 𝑆 𝑖 constructs a cluster by growing a BFS tree of4 depth 𝑑 𝑖−1 = 2((𝜅 − 𝑖) + 1)𝑊 in the graph (𝑈 𝑖 , 𝐸(𝑈 𝑖 )). We call 𝑠 the center of the cluster and the set Γ 2(𝜅−𝑖)𝑊 (𝑠, 𝑈 𝑖 ) the core set of the cluster Γ 2((𝜅−𝑖)+1)𝑊 (𝑠, 𝑈 𝑖 ). Let 𝑅 𝑖 be the union of the core sets of the clusters constructed in step5 4. Set 𝑈 𝑖+1 ← 𝑈 𝑖 ∖ 𝑅 𝑖 . end 6 vertices at phase 𝑖. At the beginning, 𝑈 1 = 𝑉 . At each phase 𝑖, a subset of vertices is covered and removed from 𝑈 𝑖 . A streaming version of Algorithm 13.5 is also given in [21]. The streaming version proceeds in 𝜅 phases. In each phase 𝑖, the algorithm passes through the input stream 𝑑 𝑖−1 times to build the BFS trees 𝜏(𝑣) of depth 𝑑 𝑖−1 for each selected vertex 𝑣 ∈ 𝑆 𝑖 . The cluster and its core set can be computed during the construction of these BFS trees. Note that for any 𝑖, 𝑑 𝑖−1 ≤ 2𝜅𝑊 . Therefore, with high probability, the streaming version of Algorithm 13.5 constructs a (𝜅, 𝑊 )-cover using at most 2𝜅 2 𝑊 passes over the input stream. We now describe the distributed algorithm in [21] that constructs the spanner. Given a cluster 𝐶, let C (𝐶) be the cover constructed for the graph (𝐶, 𝐸 𝐺 (𝐶)). For a cluster 𝐶 ′ ∈ C (𝐶), we define 𝑃𝑎𝑟𝑒𝑛𝑡(𝐶 ′ ) = 𝐶. An execution of the algorithm can be divided into ℓ stages (levels). The original graph is viewed as a cluster on level 0. The algorithm starts level 1 by constructing a cover for this cluster. Recall that a cover is also a collection of clusters. The clusters of ∪C (𝐶), where the union is over all the clusters 𝐶 on level 0, are called clusters on level 1, and we denote the set of those clusters by C 1 . If a cluster 𝐶 ∈ C 1 satisfies ∣𝐶∣ ≥ ∣𝑃 𝑎𝑟𝑒𝑛𝑡(𝐶)∣ 1−𝜈 , we say that 𝐶 is a large cluster on level 1. Otherwise, we say that 𝐶 is a small cluster on level 1. We denote by C 𝐻 1 the set of large clusters on level 1 and C 𝐿 1 the set of small clusters on level 1. Note that the cover-construction subroutine (Al- gorithm 13.5) builds a BFS-spanning tree for each cluster in the cover. The algorithm includes all the BFS-spanning trees in the spanner and then goes on to make interconnections between all pairs of clusters in C 𝐻 1 that are close to each other. A Survey on Streaming Algorithms for Massive Graphs 409 Algorithm 13.6: Additive Spanner Construction Input: a graph 𝐺 = (𝑉, 𝐸) on 𝑛 vertices and four parameters 𝜅, 𝜈, 𝐷, and Δ, where 𝜅, 𝐷, and Δ are positive integers and 0 < 𝜈 < 1. C 𝐿 0 ← {𝑉 }, C 𝐻 0 = 𝜙.1 for level 𝑖 = 1, 2, . . . , ℓ = ⌈log 1/(1−𝜈) log Δ 𝑛⌉ do 2 Cover Construction: For all clusters 𝐶 ∈ C 𝐿 𝑖−1 , in parallel, construct3 (𝜅, 𝐷 ℓ )-covers using Algorithm 13.5. (Invoking Algorithm 13.5 with parameters 𝜅 and 𝑊 = 𝐷 ℓ .) Include the edges of the BFS-spanning trees of all the clusters in the 4 spanner. Set C 𝑖 ← ∪ 𝐶∈C 𝐿 𝑖−1 C (𝐶), C 𝐻 𝑖 ← {𝐶 ∈ C 𝑖 ∣ ∣𝐶∣ ≥ ∣𝑃𝑎𝑟𝑒𝑛𝑡(𝐶)∣ (1−𝜈) }, C 𝐿 𝑖 ← C 𝑖 ∖C 𝐻 𝑖 . Interconnection: For all clusters 𝐶 ′ ∈ C 𝐻 𝑖 , in parallel, construct BFS5 trees in 𝐺(𝐶), where 𝐶 = 𝑃 𝑎𝑟𝑒𝑛𝑡(𝐶 ′ ). For each cluster 𝐶 ′ , the BFS tree is rooted at the center of the cluster, and the depth of the BFS tree is 2𝐷 ℎ + 𝐷 ℎ+1 , where ℎ = ⌈log 1/(1−𝜈) log Δ ∣𝑃 𝑎𝑟𝑒𝑛𝑡(𝐶 ′ )∣⌉. For all the clusters 𝐶 ′′ whose center vertex is in the BFS tree, if6 𝐶 ′′ ∈ C 𝐻 𝑖 and 𝑃 𝑎𝑟𝑒𝑛𝑡(𝐶 ′′ ) = 𝑃𝑎𝑟𝑒𝑛𝑡(𝐶 ′ ), add to the spanner the shortest path between the center of 𝐶 ′ and the center of 𝐶 ′′ . end 7 Add to the spanner all the edges of the set ∪ 𝐶∈C ℓ+1 𝐸 𝐺 (𝐶). 8 After these interconnections are completed, the algorithm enters level 2. For each cluster in C 𝐿 1 , it constructs a cover. We call the clusters in each of these covers the clusters on level 2. The union of all the level-2 clusters is denoted by C 2 . If a cluster 𝐶 ∈ C 2 satisfies ∣𝐶∣ ≥ ∣𝑃 𝑎𝑟𝑒𝑛𝑡(𝐶)∣ 1−𝜈 , we say that 𝐶 is a large cluster on level 2. Otherwise, we say that 𝐶 is a small cluster on level 2. Again, we denote by C 𝐻 2 the set of large clusters on level 2 and C 𝐿 2 the set of small clusters on level 2. The BFS-spanning trees of all the clusters in C 2 are included into the spanner and all the close pairs of clusters in C 𝐻 2 get interconnected by the algorithm. The algorithm proceeds in a similar fashion at levels 3 and above. That is, at level 𝑖, the algorithm constructs covers for each small cluster in C 𝐿 𝑖−1 , and interconnects all the close pairs of large clusters in C 𝐻 𝑖 . Similarly, we denote by C 𝑖 the collection of all the clusters in the covers constructed at level 𝑖, by C 𝐻 𝑖 the set of large clusters of C 𝑖 , and C 𝐿 𝑖 the set of small clusters of C 𝑖 . After level ℓ, each of the small clusters of level ℓ contains very few vertices and the algorithm can include in the spanner all the edges induced by these clusters. A description of the detailed algorithm is given in Algorithm 13.6. See Figure 13.2 for an example of covers and clusters constructed by the algorithm. The circles in the figure represent the clusters. C 1 = {𝐶 1 , 𝐶 2 , 𝐶 3 }, 410 MANAGING AND MINING GRAPH DATA C C C C C C C C C C 4 5 7 6 8 9 10 3 1 2 Figure 13.2. Example of clusters in covers. C 𝐻 1 = {𝐶 1 } and C 𝐿 1 = {𝐶 2 , 𝐶 3 }. Note that for each cluster in C 𝐿 1 , a cover is constructed. The union of the clusters in these covers forms C 2 , i.e., C 2 = {𝐶 4 , 𝐶 5 , 𝐶 6 , 𝐶 7 , 𝐶 8 , 𝐶 9 , 𝐶 10 }. The large clusters in C 2 form C 𝐻 2 = {𝐶 4 , 𝐶 5 , 𝐶 8 , 𝐶 9 }and the small clusters in C 2 form C 𝐿 2 = {𝐶 6 , 𝐶 7 , 𝐶 10 }. Also note that a pair of close, large clusters 𝐶 8 and 𝐶 9 is interconnected by a shortest path between them. The streaming version [21] of Algorithm 13.6 is recursive, and the recursion has ℓ levels. At level 𝑖, a cover is constructed for each of the small clusters in C 𝐿 𝑖−1 using the streaming algorithm for constructing covers described above. Because the processes of building BFS trees for constructing covers are independent, they can be carried out in parallel. That is, when the algorithm encounters an edge in the input stream, it examines its two endpoints. For each of the clusters in C 𝐿 𝑖−1 that contains both endpoints, for each of the BFS-tree constructions in those clusters that has reached one of the endpoints, the algorithm checks whether the edge would help to extend the BFS tree. If so, the edge would be added to that BFS tree. After the construction of the covers is completed, the algorithm makes interconnections between close, large clusters of each cover. Again, the constructions of the BFS trees that are invoked by different interconnection subroutines are independent and can be performed in parallel. It is shown [21] that: Theorem 13.3. Given an unweighted, undirected graph on 𝑛 vertices, presented as a stream of edges, and constants 0 < 𝜌, 𝛿, 𝜖 < 1, such that 𝛿/2 + 1/3 > 𝜌 > 𝛿/2, the streaming adaptation of Algorithm 13.6, with high probability, constructs a (1 + 𝜖, 𝛽)-spanner of size 𝑂(𝑛 1+𝛿 ). The adaptation A Survey on Streaming Algorithms for Massive Graphs 411 accesses the stream sequentially in 𝑂(1) passes, uses 𝑂(𝑛 1+𝛿 ⋅ log 𝑛) bits of space, and processes each edge of the stream in 𝑂(𝑛 𝜌 ) time. The parameters for Algorithm 13.6 are determined as follows: Set Δ = 𝑛 𝛿/2 , 1 𝜅𝜈 = 𝛿/2, 1 𝜅𝜈 + 𝜈 = 𝜌. This gives 𝜈 = 𝜌 − 𝛿 2 > 0, 𝜈 = 𝑂(1), 𝜅 = 2 (𝜌−𝛿/2)𝛿 = 𝑂(1), and ℓ = log 1/(1−𝜈) log Δ 𝑛 = log 1/(1−𝜈) 2 𝛿 = 𝑂(1), satisfying the requirement that 𝜅, 𝜈, and ℓ are all constants. Also set 𝐷 = 𝑂( 𝜅ℓ 𝜖 ). Then 𝛽 = 𝑂(𝜅𝐷 ℓ ) = 𝑂(1). Note that once the spanner is computed, the algorithm is able to compute all-pairs, almost-shortest paths and distances in the graph by computing the exactly shortest paths and distances in the spanner using the same space. This computation of the shortest paths in the spanner requires no additional passes through the input, and also, no additional space if one does not need to store the paths found. 5.2 Distance Approximation in One Pass A one-pass spanner construction is given by Feigenbaum et al.in [23]. The algorithm is randomized and constructs a multiplicative (2𝑡 + 1)-spanner for an unweighted, undirected graph in one pass. With high probability, it uses 𝑂(𝑡 ⋅ 𝑛 1+1/𝑡 log 2 𝑛) bits of space and processes each edge in the stream in 𝑂(𝑡 2 ⋅𝑛 1/𝑡 log 𝑛) time. It is also shown in [23] that, with 𝑂(𝑛 1+1/𝑡 ) space, we cannot approximate the distance between two vertices better than by a factor of 𝑡. Therefore, this algorithm is close to the optimal. The algorithm labels the vertices of the graph while going through the stream of edges. A label 𝑙 is a positive integer. Given two parameters 𝑛 and 𝑡, the set of labels 𝐿 used by the algorithm is generated in the following way. Initially, we have the labels 1, 2, . . . , 𝑛. We denote by 𝐿 0 this set of labels and call them the level 0 labels. Independently, and with probability 1 𝑛 1/𝑡 , each label 𝑙 ∈ 𝐿 0 will be selected for membership in the set 𝑆 0 and 𝑙 will be marked as selected. From each label 𝑙 in 𝑆 0 , we generate a new label 𝑙 ′ = 𝑙 + 𝑛. We denote by 𝐿 1 the set of newly generated labels and call them level 1 labels. We then apply the above selection and new-label-generation procedure to 𝐿 1 to get the set of level 2 labels 𝐿 2 . We continue this until the level ⌊ 𝑡 2 ⌋ labels 𝐿 ⌊ 𝑡 2 ⌋ are generated. If a level 𝑖 + 1 label 𝑙 is generated from a level 𝑖 label 𝑙 ′ , we call 𝑙 the successor of 𝑙 ′ and denote this by 𝑆𝑢𝑐𝑐(𝑙 ′ ) = 𝑙. The set of labels used in the algorithm is the union of labels of level 1, 2, . . . , ⌊ 𝑡 2 ⌋, i.e., 𝐿 = ∪𝐿 𝑖 . Note that 𝐿 can be generated before the algorithm sees the edges in the stream. But, in order to generate the labels, except in the case 𝑡 = 𝑂(log 𝑛), the algorithm needs to know 𝑛, the number of vertices in the graph, before seeing the edges in the input stream. For 𝑡 = 𝑂(log 𝑛), a simple modification of the above method can be used to generate 𝐿 without knowing 𝑛, because the probability of a label’s being selected can be any constant smaller than 1 2 . 412 MANAGING AND MINING GRAPH DATA While going through the stream, the algorithm labels each vertex with labels chosen from 𝐿. Let 𝐶(𝑙) be the collection of vertices that are labeled with 𝑙. We call the subgraph induced by the vertices in 𝐶(𝑙) a cluster, and we say that the label of the cluster is 𝑙. Each label thus defines a cluster. The algorithm may label a vertex 𝑣 with multiple labels; however, 𝑣 will be labeled by at most one label from 𝐿 𝑖 , for 𝑖 = 1, 2, . . . , ⌊ 𝑡 2 ⌋. Moreover, if 𝑣 is labeled by a label 𝑙, and 𝑙 is selected, the algorithm also labels 𝑣 with the label 𝑆𝑢𝑐𝑐(𝑙). Denote by 𝑙 𝑖 a label of level 𝑖, i.e., 𝑙 𝑖 ∈ 𝐿 𝑖 . Let 𝐿(𝑣) = {𝑙 0 , 𝑙 𝑘 1 , 𝑙 𝑘 2 , . . . , 𝑙 𝑘 𝑗 }, 0 < 𝑘 1 < 𝑘 2 < . . . < 𝑘 𝑗 < 𝑡/2 be the collection of labels that has been assigned to the vertex 𝑣. Let 𝐻𝑒𝑖𝑔ℎ𝑡(𝑣) = 𝑚𝑎𝑥{𝑗∣𝑙 𝑗 ∈ 𝐿(𝑣)} and 𝑇 𝑜𝑝(𝑣) = 𝑙 𝑘 ∈ 𝐿(𝑣) s.t. 𝑘 = 𝐻𝑒𝑖𝑔ℎ𝑡(𝑣). At the beginning of the algorithm, the set 𝐿(𝑣 𝑖 ) contains only the label 𝑖 ∈ 𝐿 0 . The set 𝐶(𝑙) = {𝑣 𝑙 } for 𝑙 = 1, 2, . . . , 𝑛 and is empty for other labels. 𝐿(𝑣) and 𝐶(𝑙) grow while the algorithm goes through the stream and labels the vertices. For each 𝐶(𝑙), the algorithm stores a rooted spanning tree 𝑇 𝑟𝑒𝑒(𝑙), on the vertices of 𝐶(𝑙). For 𝑙 ∈ 𝐿 𝑖 , the depth of the spanning tree is at most 𝑖, i.e., the deepest leaf is at distance 𝑖 from the root. We say an edge (𝑢, 𝑣) connects 𝐶(𝑙) and 𝐶(𝑙 ′ ) if 𝑢 is labeled with 𝑙 and 𝑣 is labeled with 𝑙 ′ . If there are edges connecting two clusters at level ⌊ 𝑡 2 ⌋, the algorithm stores one such edge for this pair of clusters. We denote by 𝐻 the set of these edges stored by the algorithm. Another small set of edges is also stored for each vertex. We denote by 𝑀(𝑣) the edges in the set for the vertex 𝑣. The spanner constructed by the algorithm is the union of the spanning trees for all the clusters, 𝑀(𝑣) for all the vertices, and the set 𝐻. The detailed algorithm is given in Algorithm 13.7. In a later work [20], Elkin gave an improved algorithm that constructs a (2𝑡 − 1)-spanner in one pass over the stream. The size of the spanner is 𝑂(𝑡 ⋅ (log 𝑛) 1−1/𝑡 ⋅𝑛 1+1/𝑡 ) with high probability. The algorithm processes each edge in the stream with 𝑂(1) time. 6. Random Walks on Graphs The Construction of actual random walk on a graph in the streaming model is considered by Sarma et al.in [40]. The algorithm of [40] that constructs a random walk from a single starting node is presented in Algorithm 13.8. The algorithm begins by randomly sampling a set of nodes, each independently with probability 𝛼. Using each sampled node as a starting point, it performs a short random walk of length 𝑤. (𝑤 is a parameter that will be set later.) This can be done in 𝑤 passes over the stream. It then tries to stitch together the short random walks one by one to form a long walk and eventually produce a walk of the required length. A Survey on Streaming Algorithms for Massive Graphs 413 Algorithm 13.7: One-Pass Spanner Construction Input: an unweighted, undirected graph 𝐺 = (𝑉, 𝐸), presented as a stream of edges, and two positive integer parameters 𝑛 and 𝑡. Generate the set 𝐿 of labels as described. ∀ 𝑣 𝑖 ∈ 𝑉 , label vertex 𝑣 𝑖 with1 label 𝑖 ∈ 𝐿 0 . If 𝑖 is selected, label 𝑣 𝑖 with 𝑆𝑢𝑐𝑐(𝑖). Continue until we see a label that is not selected. Set 𝐻 ← 𝜙 and 𝑀(𝑣 𝑖 ) ← 𝜙; for each edge (𝑢, 𝑣) in the stream do 2 if 𝐿(𝑣) ∩ 𝐿(𝑢) = ∅ then3 if 𝐻𝑒𝑖𝑔ℎ𝑡(𝑣) = 𝐻𝑒𝑖𝑔ℎ𝑡(𝑢) = ⌊ 𝑡 2 ⌋, and there is no edge in 𝐻 that4 connects 𝐶(𝑇 𝑜𝑝(𝑣)) and 𝐶(𝑇 𝑜𝑝(𝑢)) then set 𝐻 ← 𝐻 ∪ {(𝑢, 𝑣)}; 5 else6 Assume, without loss of generality,7 ⌊ 𝑡 2 ⌋ ≥ 𝐻𝑒𝑖𝑔ℎ𝑡(𝑢) ≥ 𝐻𝑒𝑖𝑔ℎ𝑡(𝑣). Consider the collection of labels 𝐿 𝑣 (𝑢) = {𝑙 𝑘 1 , 𝑙 𝑘 2 , . . . , 𝑙 𝐻𝑒𝑖𝑔ℎ𝑡(𝑢) } ⊆ 𝐿(𝑢), where 𝑘 1 ≥ 𝐻𝑒𝑖𝑔ℎ𝑡(𝑣) and 𝑘 1 < 𝑘 2 < . . . < 𝐻𝑒𝑖𝑔ℎ𝑡(𝑢). Let 𝑙 = 𝑙 𝑖 ∈ 𝐿 𝑣 (𝑢) such that 𝑙 𝑖 is marked as selected and there is no 𝑙 𝑗 ∈ 𝐿 𝑣 (𝑢) with 𝑗 < 𝑖 that is marked as selected. if such a label 𝑙 exists then 8 label the vertex 𝑣 with the successor 𝑙 ′ = 𝑆𝑢𝑐𝑐(𝑙) of 𝑙, i.e.,9 𝐿(𝑣) ← 𝐿(𝑣) ∪ {𝑙 ′ }. Incorporate the edge in the spanning tree 𝑇 𝑟𝑒𝑒(𝑙 ′ ). If 𝑙 ′ is selected, label 𝑣 with 𝑙 ′′ = 𝑆𝑢𝑐𝑐(𝑙 ′ ) and incorporate the edge in the tree 𝑇 𝑟𝑒𝑒(𝑙 ′′ ). Continue this until we see a label that is not marked as selected; else 10 if There is no edge (𝑢 ′ , 𝑣) in 𝑀(𝑣) such that 𝑢, 𝑢 ′ are11 labeled with the same label 𝑙 ∈ 𝐿 𝑣 (𝑢) then add (𝑢, 𝑣) to 𝑀(𝑣), i.e., set 12 𝑀(𝑣) ← 𝑀(𝑣) ∪ {(𝑢, 𝑣)}; end 13 end14 end15 end16 end17 After seeing all the edges in the stream, output the union of the spanning18 trees for all the clusters, 𝑀(𝑣) for all the vertices, and the set 𝐻 as the spanner. 414 MANAGING AND MINING GRAPH DATA The stitch works if the 𝑤-length random walk from a node 𝑢 ends on a node 𝑣 such that 𝑢 and 𝑣 are both in the set 𝑇 of sampled nodes and the 𝑤-length random walk from 𝑣 has not been used previously in the stitch process. If the random walk from 𝑢 ends on a node outside of 𝑇 or on a node in 𝑇 but whose random-walk path has already been used, the stitch process gets stuck. This situation is dealt by the subroutine described in Algorithm 13.9. Algorithm 13.8: Random Walk Input: starting node 𝑢, walk length 𝑙, control parameter 0 < 𝛼 ≤ 1. 𝑇 ← sample each node independently with probaility 𝛼. 1 In 𝑤 passes, perform walks of length 𝑤 from every node in 𝑇 . Let 𝑊 [𝑡]2 be the end point of the 𝑤-length walk from 𝑡 ∈ 𝑇. 𝑆 ← {}. 3 Let L 𝑢 be the random walk from 𝑢 to be constructed. Initialize L 𝑢 to be4 𝑢. Let 𝑥 ← 𝑢. while ∣L 𝑢 ∣ < 𝑙 do5 if 𝑥 ∈ 𝑇 and 𝑥 ∕∈ 𝑆 then6 Extend L 𝑢 by appending the walk 𝑊 [𝑥]. 𝑆 ← 𝑆 ∪{𝑥}.7 𝑥 ← 𝑊 [𝑥]. else 8 HanddleStuckNode(𝑥, 𝑇, 𝑆, L 𝑢 , 𝑙).9 end10 end11 Algorithm 13.9 first tries to extend the random walk by a length 𝑠. (𝑠 is another parameter whose value will be determined later.) It does so by randomly sample (with repetition) 𝑠 edges for the node on which the stitch process is currently stuck plus each node in 𝑇 whose 𝑤-length path has been used in the stitch process up to now. Let 𝑂 be the set of the nodes for which we sample edges. (𝑂 = 𝑆 ∪ 𝑅 where 𝑆 and 𝑅 are the notations used in Algorithm 13.9.) The random walk can be extended (as far as possible) using these edges. Let 𝑥 be the end node of this extension. If 𝑥 is one of the nodes in 𝑂, we repeat the sampling and the extension. If 𝑥 is outside 𝑂 but in 𝑇 , and the 𝑤-length random-walk path from 𝑥 has not be used, we go back to Algorithm 13.8 and continue the stitch process. Finally, if 𝑥 falls on a new node that is neither in 𝑇 nor in 𝑂, we add 𝑥 to 𝑂 and perform the sampling and the extension again. Each stitch extends the random walk by length 𝑤. When handling the stuck situation, either an 𝑠-progress is made or the algorithm encounters a node outside of 𝑂. With probability 𝛼, this node is in 𝑇 (because 𝑇 is the set of nodes sampled with probability 𝛼) and the algorithm can make a 𝑤-progress. There- fore, after a pass over the stream to sample the edges for the nodes in 𝑂, Algo- A Survey on Streaming Algorithms for Massive Graphs 415 Algorithm 13.9: HandleStuckNode 𝑅 ← 𝑥.1 while ∣L 𝑢 ∣ < 𝑙 do2 𝐸 ← sample 𝑠 edges (with repetition) out of each node in 𝑆 ∪𝑅.3 Extend L 𝑢 as far as possible by walking along the edges in 𝐸.4 𝑥 ← new end point of L 𝑢 . One of the following arise:5 1 if (𝑥 ∈ 𝑆 ∪𝑅) continue; 2 if (𝑥 ∈ 𝑇 and 𝑥 ∕∈ 𝑆 ∪ 𝑅) return; 3 if (𝑥 ∕∈ 𝑇 and 𝑥 ∕∈ 𝑆 ∪ 𝑅) 𝑅 ← 𝑅 ∪ {𝑥}. end 6 rithm 13.9 can make a progress whose length is at least min(𝑠, 𝛼𝑤) on average. Sarma et al.showed in [40] that, by setting 𝑤 = √ 𝑙/𝛼 and 𝑠 = √ 𝑙𝛼, the 𝑙- length random walk from a single start node can be performed in 𝑂( √ 𝑙/𝛼) passes and 𝑂(𝑛𝛼 + √ 𝑙/𝛼) space for any 0 < 𝛼 ≤ 1. This single starting-point random walk is then extended to perform a large number 𝐾 of random walks. A naive extension would simply run 𝐾 copies of the single random walk in parallel. Sarma et al.introduced an extension that uses much less space than the naive one. They estimate the probability that the 𝑤-length walk would be used for each node. Based on this probability they store an appropriate number of 𝑤-length walks for each sampled node for 𝐾 execution of Algorithm 13.8. In this way, instead of 𝑂(𝐾(𝑛𝛼+ √ 𝑙/𝛼)) space, one needs only ˜ 𝑂(𝑛𝛼 + 𝐾 √ 𝑙/𝛼 + 𝐾𝑙𝛼) space. (An alternative algorithm for running multiple random walks is also given in [40] that uses ˜ 𝑂(𝑛𝛼 √ 𝑙𝛼 + 𝐾 √ 𝑙/𝛼 + 𝑙) space. Combining the two, the space requirement for performing a large number of walks is ˜ 𝑂(min {𝑛𝛼 +𝐾 √ 𝑙/𝛼 +𝐾𝑙𝛼, 𝑛𝛼 √ 𝑙𝛼 +𝐾 √ 𝑙/𝛼 + 𝑙}).) Sarma et al.further shows that the algorithms can be used to estimate probability distributions and to approximate mixing time. In a later work [41], Sarma et al.modify and apply the above random-walk algorithms to compute sparse graph cut. Definition 13.4. The conductance of a cut 𝑆 is defined as Φ(𝑆) = 𝐸(𝑆,𝑉 ∖𝑆) min{𝐸(𝑆),𝐸(𝑉 ∖𝑆)} where 𝐸(𝑆, 𝑉 ∖ 𝑆) is the number of edges cross- ing the cut (𝑆, 𝑉 ∖ 𝑆) and 𝐸(𝑆) is the number of edges with at least one endpoint in 𝑆. The conductance of a graph 𝐺 = (𝑉, 𝐸) is defined as Φ = min 𝑆:𝐸(𝑆)≤𝐸(𝑉 )/2 𝐸(𝑆,𝑉 ∖𝑆) 𝐸(𝑆) . For d-regular graphs, Φ = 416 MANAGING AND MINING GRAPH DATA min 𝑆:∣𝑆∣≤∣𝑉 ∣/2 𝐸(𝑆,𝑉 ∖𝑆) 𝑑∣𝑆∣ . The sparsity of a d-regular graph is related to the conductance by a factor d. It is well known that a sparse cut of a graph can be obtained by performing random walks [34, 42]. In particular, one can start from a random source and perform a random walk of length about 1/Φ. The random walk defines a probability 𝑝 𝑖 for each node 𝑖 that is the probability of the random walk landing on node 𝑖. One can sort the nodes in decreasing order of 𝑝 𝑖 . Each prefix of this ordered sequence of nodes gives a cut. Lovasz and Simonovits [34] showed that one of the 𝑛 cuts can be sparse. Sarma et al.extended the result to the case where an estimate ˜𝑝 𝑖 of 𝑝 𝑖 is available. Let 𝜌 𝑝 (𝑖) = 𝑝 𝑖 /𝑑 𝑖 (where 𝑑 𝑖 is the degree of the 𝑖-th node). They show in [41] that: Theorem 13.5. Let ˜𝑝 𝑖 be an estimate for 𝑝 𝑖 where the error ∣˜𝑝 𝑖 − 𝑝 𝑖 ∣ ≤ 𝜖(𝑝 +  𝑝/𝑛 + 1/𝑛) for a source 𝑠 from 𝑈, where there is a cut (𝑈, 𝑉 ∖ 𝑈) of conductance at most Φ (with ∣𝑈∣ ≤ ∣𝑉 ∣/2), and a random walk of length 𝑙. Order the nodes in decreasing order of 𝜌 ˜𝑝 (𝑖). Each prefix of this ordered sequence gives a cut. If the source node 𝑠 is chosen randomly and 𝑙 is chosen randomly in the range {1, 2, . . . , 𝑂(1/Φ)}, then one of the 𝑛 cuts 𝑆 gives Φ(𝑆) ≤ ˜ 𝑂( √ Φ) if 𝜖 ≤ 𝑜(Φ), with constant probability. Following Theorem 13.5 and using a modified version of the random walk algorithm of [40], Sarma et al.provided an algorithm that finds, with high probability, a cut of conductance at most ˜ 𝑂( √ Φ) for any 𝑑-regular graph that has a cut of conductance at most Φ and balance 𝑏. The algorithm goes through the graph stream ˜ 𝑂(  1 Φ𝛼 ) passes and uses space ˜ 𝑂(min{𝑛𝛼 + 1 𝑏 ( 𝑛𝛼 𝑑Φ 3 + 𝑛 𝑑 √ 𝛼Φ 2.5 ), (𝑛𝛼 + 1 𝑏 𝑛 𝑑𝛼Φ 2 )  1 Φ𝛼 + 1 Φ }). In [41], they also give algorithms that computes sparse projected cuts. 7. Conclusions Massive graphs emerged in recent years that may be too large to fit into main memory. Streaming is considered as a computation model to handle massive data sets (including massive graphs). Despite the restriction imposed by the model, there are algorithms for many graph problems. We surveyed recent algorithms for computing graph statistics, matching and distance in a graph, and random walks on a graph. Due to the limitation of the model, many algorithms output an approximate result. Streaming algorithms are a topic of considerable research interest. Efforts are being made to improve the approximation and to design more algorithms for problems arising from applications.

Định dạng
Số trang	10
Dung lượng	1,69 MB