Managing and Mining Graph Data part 42 pdf

A Survey on Streaming Algorithms for Massive Graphs 397 Other variants of the streaming model also exist. For example, the W-stream model [15] allows the algorithm to write to (annotate) the stream during each pass. These annotations can then be utilized by the algorithm in the successive passes. Another variant [1] augments the streaming model by adding a sorting primitive. 3. Statistics and Counting Triangles In this section, we describe a set of problems that involve graphs but es- sentially can be reduced to problems whose input is an array presented as a stream of the array elements (or as a sequence of increments to the elements). For example, the array 𝑎 = [2, 1, 3] can be given as a stream {(𝑎[1] + 1), (𝑎[3] + 1), (𝑎[3] + 1), (𝑎[2] + 1), (𝑎[1] + 1), (𝑎[3] + 1)}. As- suming all the entries of the array take value 0 at the beginning of the stream, after the operations in the stream, we obtain the array 𝑎. There are many streaming algorithms that computes, for this array, statistics such as frequency moments [3, 24, 31], heavy hitters [13, 10], and construct succinct data structures that support queries such as range queries [38]. These algorithms can be directly applied once the graph problem is reduced to the corresponding problem of an array. We consider these problems involving the degree of the graph nodes. For an undirected graph, the degree of a node is the number of edges incident to the node. One may view that there is a virtual array 𝐷 associated with each graph such that 𝐷[𝑖] is the degree of the 𝑖-th node. In the streaming setting, a stream of edges translates to updates to the array 𝐷. For example, the stream {(1, 2), (4, 8), (2, 7) . . .} means the operation sequence: {(𝐷[1] + 1), (𝐷[2] + 1), (𝐷[4] + 1), (𝐷[8] + 1), (𝐷[2] + 1), (𝐷[7] + 1), . . .}. (The degree array can be extended to directed graph, where we may have one out-degree array and one in-degree array.) The frequency moment problem is to compute the k-th moment 𝑓 𝑘 = ∑ 𝑛 𝑖=1 (𝐷[𝑖]) 𝑘 of the node degrees. The heavy hitter problem is to report, after seeing the graph stream, the nodes having the largest degrees. The range query requires to construct a succinct representation of the array (one that is much smaller in size than the array), from which ∑ 𝑘 𝑖=𝑗 𝐷[𝑖], given j and k as query input, can be calculated. Cormode and Muthu show [14] that these problems can be solved using corresponding streaming algorithms that work for an array. They further provide algorithms for these problems when the graph is a multigraph, but the degree of a node is defined to count only the distinct edges. (e.g. if the stream for a multigraph has edges (1, 2), (2, 5), (1, 2), the degree of the node 1 is 1, not 2 and the degree of the node 2 is 2, not 3.) The details of the algorithms are out of the scope of this survey. We refer readers to [14] and the aforementioned 398 MANAGING AND MINING GRAPH DATA literatures for streaming algorithms that compute statistics and other queries for an array. The node degree of a graph is also related to the entropy 𝐻 of an unbiased random walk on the graph [9]. In particular, 𝐻 is defined to be 𝐻 = 1 2∣𝐸∣ ∑ 𝑛 𝑖=1 𝐷[𝑖] log 𝐷[𝑖]. A streaming algorithm that computes the entropy for an array, of which the 𝑖-th entry represents the frequency of the 𝑖-th element in a set is given in [9]. The authors showed that the algorithm can be applied to compute the entropy when the array is the node-degree array 𝐷 for a graph, and therefore the entropy of an unbiased random walk can be calculated for a graph stream. They also extended the algorithm to multigraphs where only distinct edges are counted for the degree. Another problem that can be reduced to computing statistics of an array is the triangle counting problem, i.e., to find the number of triangles in an undirected graph. We describe here the reduction introduced by Bar-Yossef et. al [6]. Similar to the earlier problems, there is a virtual array 𝑃 associated with the graph. Each entry in the array corresponds to an (unordered) triple of the graph nodes. e.g., if 𝑣 𝑖 , 𝑣 𝑗 , 𝑣 𝑘 are three nodes in the graph, there is an entry 𝑃 [(𝑖, 𝑗, 𝑘)] in the array corresponds to the triple {𝑣 𝑖 , 𝑣 𝑗 , 𝑣 𝑘 }. The value of the entry counts how many of the three pairs {𝑣 𝑖 , 𝑣 𝑗 }, {𝑣 𝑖 , 𝑣 𝑘 }, and {𝑣 𝑗 , 𝑣 𝑘 } are actual edges in the graph. There are 4 possible values for the entries. 0, 1, 2, and 3. Let 𝑇 0 , 𝑇 1 , 𝑇 2 , and 𝑇 3 be the number of entries that take the corresponding value. Clearly, 𝑇 3 is exactly the number of triangles in the graph. (We will abuse the notation and also use 𝑇 𝑖 to denote the set of triples whose entry value is 𝑖.) Different from the reduction described earlier, an edge in the graph stream here maps into updates of multiple entries in the array. If we see an edge (𝑢, 𝑣), it means (𝑃 [(𝑢, 𝑣, 𝑠)] + 1) for all nodes 𝑠 ∕= 𝑢, 𝑣. Now consider the frequency moments of the array 𝑓 𝑘 = ∑ 𝑡 (𝑃 [𝑡]) 𝑘 . It can be decomposed into 𝑓 𝑘 = 𝑇 1 ⋅ 1 𝑘 + 𝑇 2 ⋅ 2 𝑘 + 𝑇 3 ⋅ 3 𝑘 because each entry with value 1 contributes 1 𝑘 to 𝑓 𝑘 , with value 2, 2 𝑘 and with value 3, 3 𝑘 . We can have the following equations: ⎛ ⎝ 𝑓 0 𝑓 1 𝑓 2 ⎞ ⎠ = ⎛ ⎝ 1 1 1 1 2 3 1 4 9 ⎞ ⎠ ⋅ ⎛ ⎝ 𝑇 1 𝑇 2 𝑇 3 ⎞ ⎠ . Using streaming algorithms one can estimate 𝑓 0 , 𝑓 2 . 𝑓 1 can be easily ob- tained from the stream. Solving the above equation then gives us the estimate of 𝑇 3 . (Although the size of the virtual array is larger than the size of the graph stream, e.g., a stream of 𝑚 edges corresponds to an array with 𝑚(𝑛 −2) entries, the estimate algorithms often use space logarithmic to the size of the array. Therefore, the memory space needed is not significantly affected by the reduction.) A Survey on Streaming Algorithms for Massive Graphs 399 In [6], Bar-Yossef et al.also proposed improved streaming frequency- moment estimate algorithms. Using the reduction and their frequency-moment estimation, they show that for 𝜖, 𝛿 > 0, the number of triangles in a graph can be estimated within 𝜖 error (i.e., the estimate is bounded between (1−𝜖)𝑇 3 and (1 + 𝜖)𝑇 3 ) with at least 1 − 𝛿 probability. The algorithm uses space 𝑠 = 𝑂  1 𝜖 3 ⋅ log 1 𝛿 ⋅  𝑇 1 + 𝑇 2 + 𝑇 3 𝑇 3  3 ⋅ log 𝑛  and poly(𝑠) process time for each edge. When the stream is an incident stream, they show that, the number of triangles can be (𝜖, 𝛿)-estimated using space 𝑂  1 𝜖 2 ⋅ log 1 𝛿 ⋅  𝑇 3 + 𝑇 2 𝑇 3  2 ⋅ log 𝑛 + 𝑑 𝑚𝑎𝑥 log 𝑛  . where 𝑑 𝑚𝑎𝑥 is the maximum degree. In a follow-up work, Jowhari and Ghodsi [33] introduced several estimators for the number of triangles. One estimator uses sequences of random numbers in a way similar to [3]. Let 𝑅 be an array of uniform, ±1-valued random numbers, i.e., 𝑃 (𝑅[𝑖] = 1) = 𝑃 (𝑅[𝑖] = −1) = 0.5 and 𝐸(𝑅[𝑖]) = 0. The random numbers in the array are 12-wise independent. A family of such random arrays can be constructed using the BCH code [3] in log-space. While the edges stream by, one computes 𝑍 =  (𝑖,𝑗)∈𝐸 𝑅[𝑖]𝑅[𝑗]. 𝑋 = 𝑍 3 /6 is then an estimator for the number of triangles in the graph. This is so because 𝐸(𝑅 𝑘 [𝑖]) = 0 for odd 𝑘 and the numbers in 𝑅 are 12-wise independent. After the expansion of 𝑋, the expectations of the terms all evaluate to zero except those in form of 6𝑅 2 [𝑖]𝑅 2 [𝑗]𝑅 2 [𝑘], which correspond to the triangles. Jowhari and Ghodsi showed that the variance of the estimator can be controlled such that only 𝑂( 1 𝜖 2 ⋅log 1 𝛿 ⋅( 𝑚 3 +𝑚𝐶 4 +𝐶6 𝑇 2 3 +1)⋅log 𝑛) space and per-edge processing time is needed for an (𝜖, 𝛿)-estimation. (𝐶 𝑘 is the number of cycles of length 𝑘 in the graph.) Another two sample-based estimators are also proposed in [33]. Buriol et al.also proposed sample-based algorithms for counting triangles in [8]. We present one of their algorithms in Algorithm 13.1. 𝛽 is a {0, 1}-valued random variable whose expectation is 3𝑇 3 𝑇 1 +2𝑇 2 +3𝑇 3 . Be- cause 𝑇 1 + 2𝑇 2 + 3𝑇 3 = 𝑚(𝑛 − 2), (Consider the triples consist of two end nodes of an edge plus one node from the other 𝑛 −2. There are 𝑚(𝑛 −2) such combinations. On the other hand, this way of counting counts each triple in 𝑇 1 once, triples in 𝑇 2 twice and triples in 𝑇 3 three times. Hence the equality.) 𝑇 3 can be estimated using a set of samples of 𝛽. For making (𝜖, 𝛿)-estimation, this algorithm uses 𝑂(( 1 𝜖 2 ⋅log 1 𝛿 ⋅ 𝑇 1 +𝑇 2 +𝑇 3 𝑇 3 ) memory space and constant expected per-edge process time. Buriol et al.further showed that Algorithm 13.1 can be modified into a one- pass algorithm. The uniform sampling of the edges can be done in one pass by 400 MANAGING AND MINING GRAPH DATA Algorithm 13.1: Sample Triangle 1st pass: Count the number of edges in the graph.1 2nd pass: Sample an edge (𝑢, 𝑣) uniformly. Choose a node 𝑤 uniformly2 from 𝑉 ∖ {(𝑢, 𝑣)}. 3rd pass: 3 if Both (𝑢, 𝑤) and (𝑣, 𝑤) are actual edges in the stream then4 𝛽 = 15 else6 𝛽 = 07 end8 return 𝛽9 reservoir sampling [43]. One difference here is that edges (𝑢, 𝑤) and (𝑣, 𝑤) may arrive before (𝑢, 𝑣) in the stream. When (𝑢, 𝑣) gets selected as a sample, we have missed (𝑢, 𝑤) and (𝑣, 𝑤) and would not detect 𝑢, 𝑣, 𝑤 as an triangle. This happens when (𝑢, 𝑣) is not the first edge of the triangle in the stream and it reduces the expectation of 𝛽 by a factor of 3. Sample-based algorithms are also proposed in [8] for incidence streams. 4. Graph Matching A matching in a graph is a set of edges without common nodes. For an unweighted graph, the maximum matching problem is to find a matching having the largest cardinality (number of edges). For a weighted graph, the problem is to find a matching whose edges give the largest weight sum. We survey unweighted and weighted matching algorithms for graph streams in the following sections. 4.1 Unweighted Matching An early algorithm for approximating unweighted bipartite matching in the streaming model is given in [22]. We describe the algorithm here. It is easy to see that a maximal matching (A matching no more edge can be added because every edge outside the match share a vertex with some edge in the matching.) can be constructed in one pass over the graph stream. Given a matching 𝑀 for a bipartite graph 𝐺 = (𝐿 ∪ 𝑅, 𝐸), a length-3 augmenting path for an edge 𝑒 = (𝑢, 𝑣) ∈ 𝑀, 𝑢 ∈ 𝐿 and 𝑣 ∈ 𝑅, is a quadruple (𝑤 𝑙 , 𝑢, 𝑣, 𝑤 𝑟 ) such that (𝑢, 𝑤 𝑙 ), (𝑤 𝑟 , 𝑣) ∈ 𝐸, and 𝑤 𝑙 and 𝑤 𝑟 are free vertices. We call 𝑤 𝑙 and 𝑤 𝑟 the wing-tips of the augmenting path, (𝑢, 𝑤 𝑙 ) the left wing and (𝑤 𝑟 , 𝑣) the right wing. A set of simultaneously augmentable length-3 augmenting paths is a set of length-3 augmenting paths that are vertex disjoint. A Survey on Streaming Algorithms for Massive Graphs 401 Algorithm 13.2: Find Augmenting Paths Input: a graph 𝐺 = (𝐿 ∪ 𝑅, 𝐸), a matching 𝑀 for 𝐺 and a parameter 0 < 𝛿 < 1. while true do 1 In one pass, find a maximal set of disjoint left wings. If the number of2 left wings found is ≤ 𝛿𝑀, terminate. In a second pass, for the edges in 𝑀 with left wings, find a maximal 3 set of disjoint right wings. In a third pass we identify the set of vertices that are 4 endpoints of a matched edge that got a left wing, or the wing tips of a matched edge that got both wings, or endpoints of a matched edge that is no longer 3 augmentable. We remember these vertices and in subsequent passes, we ignore any edge incident on one of these vertices. end 5 Given a bipartite graph and a matching in the graph, the subroutine in Al- gorithm 13.2 finds a set of simultaneously augmentable length-3 augmenting paths. It will be used in the main algorithm that computes the matching for a bipartite graph. Let 𝑋 be a maximum-sized set of simultaneously augmentable length-3 augmenting paths for the maximal matching 𝑀. Let 𝛼 = ∣𝑋∣ ∣𝑀∣ . It is shown in [22] that Algorithm 13.2 finds at least 𝛼∣𝑀∣−2𝛿∣𝑀∣ 3 simultaneously augmentable length-3 augmenting paths in 3/𝛿 passes. The main matching algorithm increases the size of a matching by repeatedly finding a set of simultaneously augmentable length-3 augmenting paths and augmenting the matching using these paths. The for-loop in Algorithm 13.3 runs ⌈ log 6𝜖 𝑙𝑜𝑔8/9 ⌉ times. During each run, the subroutine described in Algorithm 13.2 needs to go through the input graph stream 3/𝛿 passes. Therefore, Algorithm 13.3 in total goes through the stream 𝑂 ( log 1/𝜖 𝜖 ) passes. Each call to the subroutine will find a set of simultaneously augmentable length-3 augmenting paths which increases the size of the matching. The final matching size reaches at least (2/3 − 𝜖) of the maximum matching. The algorithm processes each edge in 𝑂(1) time in each pass except the first pass, in which the bipartition is found. The storage space required by the algorithm is 𝑂(𝑛 log 𝑛). 402 MANAGING AND MINING GRAPH DATA Algorithm 13.3: Unweighted Bipartite Matching Input: a bipartite graph 𝐺 = (𝐿 ∪ 𝑅, 𝐸) and a parameter 0 < 𝜖 < 1/3. In one pass, find a maximal matching 𝑀 and the bipartition of 𝐺. 1 for 𝑘 = 1, 2, . . . , ⌈ log 6𝜖 𝑙𝑜𝑔8/9 ⌉ do 2 Run Algorithm 13.2 with 𝐺, 𝑀 and 𝛿 = 𝜖 2−3𝜖 . 3 for each 𝑒 = (𝑢, 𝑣) ∈ 𝑀 for which an augmenting path (𝑤 𝑙 , 𝑢, 𝑣, 𝑤 𝑟 )4 is found by algorithm 13.2 do remove (𝑢, 𝑣) from 𝑀 and add (𝑢, 𝑤 𝑙 ) and (𝑤 𝑟 , 𝑣) to 𝑀.5 end6 end7 Figure 13.1. Layered Auxiliary Graph. Left, a graph with a matching (solid edges); Right, a layered auxiliary graph. (An illustration, not constructed from the graph on the left. The solid edges show potential augmenting paths.) In [35], McGregor introduced an improved algorithm to find augmenting paths in an unweighted graph for which a maximal match has been constructed. Given the original input graph 𝐺 and a matching 𝑀, McGregor constructed an auxiliary graph 𝐺 𝐴 to help searching for augment paths. Fig 13.1 gives an example of one auxiliary graph. The auxiliary graph is a layered graph with a small number, 𝑘+2, of layers. It is derived as follows: Let 𝐿 0 , 𝐿 1 , . . . , 𝐿 𝑘+1 be the layers in 𝐺 𝐴 . The free nodes in 𝐺, i.e. the nodes that haven’t been covered by an edge in 𝑀, are randomly projected to be nodes in 𝐿 0 or 𝐿 𝑘+1 . The edges in 𝑀 are projected to be a node in 𝐺 𝐴 and this node is randomly assigned to be in a layer of 𝐿 1 , 𝐿 2 , . . . , 𝐿 𝑘 . There is an edge between a node 𝑥 ∈ 𝐿 𝑖 (that corresponding to (𝑣 1 , 𝑣 2 ) ∈ 𝑀) and a node 𝑦 ∈ 𝐿 𝑖−1 (that corresponding to (𝑣 3 , 𝑣 4 ) ∈ 𝑀) if (𝑣2, 𝑣3) ∈ 𝐺. With this construction, an (𝑖 + 1)-length path in 𝐺 𝐴 can be mapped to a (2𝑖 + 1)-length augmenting path for 𝑀 in 𝐺. Identifying a set of augmenting paths for 𝑀 in 𝐺 now is transformed to find a set of node-disjoint paths in 𝐺 𝐴 . Because one doesn’t have enough space to store the whole graph 𝐺 in the streaming model, normally, the auxiliary graph 𝐺 𝐴 cannot be stored as a whole graph neither. However, the nodes in A Survey on Streaming Algorithms for Massive Graphs 403 𝐺 𝐴 can be stored. While the algorithm passes through the input stream of 𝐺, the edges in 𝐺 𝐴 also gets revealed. Hence, the problem boils down to find a near-maximal set of node-disjoint paths in 𝐺 𝐴 . A search algorithm was proposed in [35] for this purpose. The algorithm finds a maximal matching between layers 𝐿 𝑖−1 and 𝐿 𝑖 . Let 𝑆 𝑖 ∈ 𝐿 𝑖 be the set of nodes involved in this matching. The algorithm then goes ahead to find a maximal matching between 𝑆 𝑖 and 𝐿 𝑖+1 . It continues in this fashion to grow a set of node-disjoint paths. Clearly, the size of 𝑆 𝑖 may decrease while 𝑖 increases and may become empty before the last layer is reached. To avoid this, the path growth process may backtrack if the size of 𝑆 𝑖 becomes too small. The backtrack is done by marking the nodes in 𝑆 𝑖 as deadends, removing them from 𝐺 𝐴 and continuing path growth in the remaining of 𝐺 𝐴 . For a particular 𝐺 𝐴 construction and path growth, the resulting set of paths may be small. However, the 𝐺 𝐴 construction is random because the nodes corresponding to the edges in 𝑀 are randomly assigned to the layers. A matching algorithm is given in [35] that is similar to Algorithm 13.3 in structure but uti- lizes the 𝐺 𝐴 -based augmenting-path search. It is shown that, with high probability, this algorithm finds a matching in 𝑂 𝜖 (1) (a function of 𝜖 and a constant is 𝜖 is constant) passes whose size is at least 1 1+𝜖 of the maximum matching. 4.2 Weighted Matching The streaming version of the problem was first studied in [22] where a streaming algorithm (Algorithm 13.4) was proposed. The algorithm uses only one pass over the stream and manages to find a matching which is at least 1 6 of the optimal size. Algorithm 13.4: Weighted Matching Maintain a matching 𝑀 at all times.1 while there are edges in the stream do2 Let 𝑒 be the next edge in the stream and 𝑤(𝑒) be the weight of 𝑒;3 Let 𝑤(𝐶) be the sum of the weights of the edges in4 𝐶 = {𝑒 ′ ∣𝑒 ′ ∈ 𝑀 and 𝑒 ′ and 𝑒 share an end point}. (𝑤(𝐶) = 0 if 𝐶 is empty.) if 𝑤(𝑒) > 2𝑤(𝐶) then 5 update 𝑀 ← 𝑀 ∪ {𝑒} ∖ 𝐶.6 else7 ignore 𝑒8 end9 end10 The following property of Algorithm 13.4 is shown in [22]. 404 MANAGING AND MINING GRAPH DATA Theorem 13.2. In 1 pass and 𝑂(𝑛 log 𝑛) storage, Algorithm 13.4 constructs a weighted matching that is at least 1 6 of the optimal size. Proof: For any set of edges 𝑆, let 𝑤(𝑆) =  𝑒∈𝑆 𝑤(𝑒). We say that an edge is selected if it is ever part of 𝑀 . We say that an edge is dropped if it was selected early but later replaced from 𝑀 (step 6 in Algorithm 13.4) by a new heavier edge. This new edge replaces the dropped edge. We say an edge is a survivor if it is selected and never dropped. Let the set of survivors be 𝑆. The weight of the matching we find is therefore 𝑤(𝑆). For each survivor 𝑒, let the Trail of Drops leading to this edge be 𝑇 (𝑒) = 𝐶 1 ∪ 𝐶 2 ∪ . . . where 𝐶 0 = {𝑒}, 𝐶 1 = {the edges replaced by 𝑒}, and 𝐶 𝑖 = ∪ 𝑒 ′ ∈𝐶 𝑖−1 {the edges replaced by 𝑒 ′ }. We have 𝑤(𝑇 (𝑒)) ≤ 𝑤(𝑒). This is because for each replacing edge 𝑒, 𝑤(𝑒) is at least twice the cost of the replaced edges, and an edge has at most one replacing edge. Hence, for all 𝑖, 𝑤(𝐶 𝑖 ) ≥ 2𝑤(𝐶 𝑖+1 ) and 2𝑤(𝑇 (𝑒)) =  𝑖≥1 2𝑤(𝐶 𝑖 ) ≤  𝑖≥0 𝑤(𝐶 𝑖 ) = 𝑤(𝑇 (𝑒)) + 𝑤(𝑒). Now consider the optimal solution that includes edges opt = {𝑜 1 , 𝑜 2 , . . .}. We are going to charge the costs of the edges in opt to the survivors and their trail of drops, ∪ 𝑒∈𝑆 𝑇 (𝑒) ∪ {𝑒}. We hold an edge 𝑒 in this set accountable to 𝑜 ∈ opt if either 𝑒 = 𝑜 or if 𝑜 wasn’t selected because 𝑒 was in 𝑀 when 𝑜 arrived. Note that, in the second case, it is possible for two edges to be accountable to 𝑜. If only one edge is accountable for 𝑜 then we charge 𝑤(𝑜) to 𝑒. If two edges 𝑒 1 and 𝑒 2 are accountable for 𝑜, then we charge 𝑤(𝑜)𝑤(𝑒 1 ) 𝑤(𝑒 1 )+𝑤 (𝑒 2 ) to 𝑒 1 and 𝑤(𝑜)𝑤(𝑒 2 ) 𝑤(𝑒 1 )+𝑤 (𝑒 2 ) to 𝑒 2 . In either case, the amount charged by 𝑜 to any edge 𝑒 is at most 2𝑤(𝑒). We now redistribute these charges as follows: (for distinct 𝑢 1 , 𝑢 2 , 𝑢 3 ) if 𝑒 = (𝑢 1 , 𝑣) gets charged by 𝑜 = (𝑢 2 , 𝑣), and 𝑒 subsequently gets replaced by 𝑒 ′ = (𝑢 3 , 𝑣), we transfer the charge from 𝑒 to 𝑒 ′ . Note that we maintain the property that the amount charged by 𝑜 to any edge 𝑒 is at most 2𝑤(𝑒) because 𝑤(𝑒 ′ ) ≥ 𝑤(𝑒). What this redistribution of charges achieves is that now every edge in a trail of drops is only charged by one edge in opt. Survivors can, however, be charged by two edges in opt. We charge 𝑤(opt) to the survivors and their trails of drops, and hence 𝑤(opt) ≤  𝑒∈𝑆 (2𝑤(𝑇 (𝑒)) + 4𝑤(𝑒)) . Because 𝑤(𝑇 (𝑒)) ≤ 𝑤(𝑒),  𝑒∈𝑆 (2𝑤(𝑇 (𝑒)) + 4𝑤(𝑒)) ≤ 6𝑤(𝑆) A Survey on Streaming Algorithms for Massive Graphs 405 and the theorem follows. □ The condition on line 5 of Algorithm 13.4 can be generalized to be 𝑤(𝑒) > (1 + 𝛾)𝑤(𝐶), 𝐶 = {𝑒 ′ ∣𝑒 ′ ∈ 𝑀 and 𝑒 ′ and 𝑒 share an end point}. By setting 𝛾 appropriately and repeating Algorithm 13.4 until the improvement yielded falls below some threshold, a matching can be constructed [35] in 𝑂 𝜖 (1) passes whose size is at least 1 2+𝜖 of the maximum matching. Another improvement for weighted matching was made recently by Zelke [46]. Zelke’s algorithm is also based on Algorithm 13.4, but incorpo- rates some improvements. In particular, the algorithm stores a few edges that have been in 𝑀 in the past but were replaced later, to potentially reinsert them into 𝑀 in the future. Such edges are called in [46] the “shadow edges." With shadow edges, when a new edge arrives in the stream, besides the (two) edges that sharing the endpoints with the new edge, a few other edges (edges in 𝑀 as well as the shadow edges) in the vincinity of the new edge can be examined to find potential augmenting path. This improves the approximation from 1/5.82 (by an algorithm in [35]) to 1/5.58. 5. Graph Distance We consider the shortest-path distance in a graph. The shortest path between two vertices in a graph is the path that has the smallest number of edges (for an unweighted graph) or the smallest sum of the weights of the path edges (for a weighted graph). There may be more than one such shortest path. A structure often used in approximating graph distance is the graph spanner [39, 11, 18]. An undirected graph 𝐺 = (𝑉, 𝐸) induces a metric space 𝒰 in which the vertex set 𝑉 serves as the set of points, and the shortest- path distances serve as the distances between the points. The graph spanner 𝐺 ′ = (𝑉, 𝐻), 𝐻 ⊆ 𝐸, is a sparse skeleton of the graph 𝐺 whose induced metric space 𝒰 ′ is a close approximation of the metric space 𝒰 of the graph 𝐺. That is, the distance between two vertices in 𝐺 ′ is not far from the distance between the same two vertices in 𝐺. For example, a subgraph 𝐺 ′ = (𝑉, 𝐻), 𝐻 ⊆ 𝐸 is a (multiplicative) 𝑡-spanner of the graph 𝐺, if for every pair of vertices 𝑢, 𝑣 ∈ 𝑉 , 𝑑𝑖𝑠𝑡 𝐺 ′ (𝑢, 𝑣) ≤ 𝑡 ⋅ 𝑑𝑖𝑠𝑡 𝐺 (𝑢, 𝑣) (where 𝑑𝑖𝑠𝑡 𝐺 (𝑢, 𝑣) stands for the distance between the vertices 𝑢 and 𝑣 in the graph 𝐺). The stretch factor of a spanner is the parameter(s) that determines how close the spanner approx- imates the distances in the original graph, e.g., in the case of a 𝑡-spanner, the parameter 𝑡. Clearly, if a spanner can be constructed for a massive graph, one can approx- imate the node distance in the graph using the spanner. Because the spanner is much smaller than the original graph, it can often be stored in the main memory. In fact, an early application of spanners is to maintain a succinct representation of the routing information [39, 11]. Instead of the original network 406 MANAGING AND MINING GRAPH DATA graph, spanners are passed and stored by the routers for calculating the routing paths. Besides distances, the diameter of a graph can be approximated using the spanner diameter. In [22], Feigenbaum et al.gave a simple streaming algorithm for spanner- construction by adapting the technique of [4]. It displays a certain connection between the girth of a graph and the spanner. (The girth of a graph is the length of the shortest cycle in the graph.) However, in the worst case, the algorithm needs more than 𝑂(𝑛) time to process an edge. Such a processing time is prohibitively high for the streaming model. For an unweighted graph, the algorithm of [22] in one pass constructs a (log 𝑛/ log log 𝑛)-spanner 𝑆: Because a graph whose girth is larger than 𝑘 have at most ⌈𝑛 1+2/(𝑘−2) ⌉ edges [7, 17, 2], the algorithm constructs 𝑆 by adding an edge in the stream to 𝑆 if the edge does not cause a cycle of length less than log 𝑛/ log log 𝑛 in the 𝑆 constructed so far. Otherwise, the edge is ignored. Note that for each ignored edge, there is a path 𝑃 of length at most log 𝑛/ log log 𝑛 in 𝑆 that connects the two endpoints of this edge. Any shortest path in the original graph that uses this edge can be replaced by a path in 𝑆 that uses 𝑃 . Therefore, 𝑆 is a log 𝑛/ log log 𝑛 spanner of the original graph. For a weighted graph, however, the construction in [4] requires sorting the edges according to their weights, which is difficult in the streaming model. Instead of sorting, a geometric grouping technique is used in [22] to extend the spanner construction for unweighted graphs to a construction for weighted graphs. This technique is similar to the one used in [12]. Let 𝜔 𝑚𝑖𝑛 be the minimum weight and 𝜔 𝑚𝑎𝑥 be the maximum weight. We divide the range [𝜔 𝑚𝑖𝑛 , 𝜔 𝑚𝑎𝑥 ] into intervals of the form [(1 + 𝜖) 𝑖 𝜔 𝑚𝑖𝑛 , (1 + 𝜖) 𝑖+1 𝜔 𝑚𝑖𝑛 ) and round all the weights in the interval [(1+𝜖) 𝑖 𝜔 𝑚𝑖𝑛 , (1+𝜖) 𝑖+1 𝜔 𝑚𝑖𝑛 ) down to (1+ 𝜖) 𝑖 𝜔 𝑚𝑖𝑛 . For each induced graph 𝐺 𝑖 = (𝑉, 𝐸 𝑖 ), where 𝐸 𝑖 is the set of edges in 𝐸 whose weight is in the interval [(1+𝜖) 𝑖 𝜔 𝑚𝑖𝑛 , (1+𝜖) 𝑖+1 𝜔 𝑚𝑖𝑛 ), a spanner can be constructed in parallel using the above construction for unweighted graphs. The union of the spanners for all the 𝐺 𝑖 , 𝑖 ∈ {0, 1, . . . , log (1+𝜖) 𝜔 𝑚𝑎𝑥 𝜔 𝑚𝑖𝑛 − 1}, forms a spanner for the graph 𝐺. Note that this can be done without prior knowledge of 𝜔 𝑚𝑖𝑛 and 𝜔 𝑚𝑎𝑥 . The goal is to break the range [𝜔 𝑚𝑖𝑛 , 𝜔 𝑚𝑎𝑥 ] into a small number of intervals. Given any value 𝜔 ∈ [𝜔 𝑚𝑖𝑛 , 𝜔 𝑚𝑎𝑥 ], we can use the set of intervals of the form [(1 + 𝜖) 𝑖 𝜔, (1 + 𝜖) 𝑖+1 𝜔) and [ 𝜔 (1+𝜖) 𝑖+1 , 𝜔 (1+𝜖) 𝑖 ). Therefore, we can determine the intervals without the prior knowledge of 𝜔 𝑚𝑖𝑛 and 𝜔 𝑚𝑎𝑥 . 5.1 Distance Approximation using Multiple Passes Elkin and Zhang gave a multiple-pass streaming spanner construction in [21]. This algorithm builds an additive spanner. A subgraph 𝐺 ′ = (𝑉, 𝐻)

Định dạng
Số trang	10
Dung lượng	1,89 MB