Feldman m algorithms for big data 2020

ALGORITHMS FOR BIG DATA 11398_9789811204739_TP.indd 1/7/20 10:52 AM b2530 International Strategic Relations and China’s National Security: World at the Crossroads This page intentionally left blank b2530_FM.indd 01-Sep-16 11:03:06 AM ALGORITHMS FOR BIG DATA Moran Feldman The Open University of Israel, Israel World Scientific NEW JERSEY • LONDON 11398_9789811204739_TP.indd • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO 1/7/20 10:52 AM Published by World Scientific Publishing Co Pte Ltd Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE Library of Congress Cataloging-in-Publication Data Names: Feldman, Moran, author Title: Algorithms for big data / Moran Feldman, The Open University of Israel, Israel Description: New Jersey : World Scientific, 2020 Identifiers: LCCN 2020011810| ISBN 9789811204739 (hardcover) | ISBN 9789811204746 (ebook for institutions) | ISBN 9789811204753 (ebook for individuals) Subjects: LCSH: Algorithms Classification: LCC QA9.58 F45 2020 | DDC 005.701/5181 dc23 LC record available at https://lccn.loc.gov/2020011810 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Copyright © 2020 by World Scientific Publishing Co Pte Ltd All rights reserved This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA In this case permission to photocopy is not required from the publisher For any available supplementary material, please visit https://www.worldscientific.com/worldscibooks/10.1142/11398#t=suppl Desk Editors: Anthony Alexander/Steven Patt Typeset by Stallion Press Email: enquiries@stallionpress.com Printed in Singapore Steven - 11398 - Algorithms for Big Data.indd 30/6/2020 8:52:51 am July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-fm Preface The emergence of the Internet has allowed people, for the first time, to access huge amounts of data Think, for example, of the graph of friendships in the social network Facebook and the graph of links between Internet websites Both these graphs contain more than one billion nodes, and thus, represent huge datasets To use these datasets, they must be processed and analyzed However, their mere size makes such processing very challenging In particular, classical algorithms and techniques, that were developed to handle datasets of a more moderate size, often require unreasonable amounts of time and space when faced with such large datasets Moreover, in some cases it is not even feasible to store the entire dataset, and thus, one has to process the parts of the dataset as they arrive and discard each part shortly afterwards The above challenges have motivated the development of new tools and techniques adequate for handling and processing “big data” (very large amounts of data) In this book, we take a theoretical computer science view on this work In particular, we will study computational models that aim to capture the challenges raised by computing over “big data” and the properties of practical solutions developed to answer these challenges We will get to know each one of these computational models by surveying a few classic algorithmic results, including many state-of-the-art results This book was designed with two contradicting objectives in mind, which are as follows: (i) on the one hand, we try to give a wide overview of the work done in theoretical computer science in the context of “big data” and v page v July 1, 2020 vi 17:16 Algorithms for Big Data - 9in x 6in b3853-fm Algorithms for Big Data (ii) on the other hand, we strive to so with sufficient detail to allow the reader to participate in research work on the topics covered While we did our best to meet both goals, we had to compromise in some aspects In particular, we had to omit some important “big data” subjects such as dimension reduction and compressed sensing To make the book accessible to a broader population, we also omitted some classical algorithmic results that involve tedious calculations or very advanced mathematics In most cases, the important aspects of these results can be demonstrated by other, more accessible, results page vi July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-fm About the Author Moran Feldman is a faculty member at the Computer Science Department of the University of Haifa He obtained his B.A and M.Sc degrees from the Open University of Israel, and his Ph.D from the Technion Additionally, he spent time, as an intern, post-doctoral follow and a faculty member, in Yahoo! Research, Google, Microsoft Research, EPFL and the Open University of Israel Moran was a fellow of the Alon Scholarship and the Google European Fellowship in Market Algorithms He was also awarded the Cisco Prize, the Rothblum Award and the SIAM Outstanding Paper Prize Moran’s main research interests lie in the theory of algorithms Many of his works are in the fields of submodular optimization, streaming algorithms and online computation vii page vii b2530 International Strategic Relations and China’s National Security: World at the Crossroads This page intentionally left blank b2530_FM.indd 01-Sep-16 11:03:06 AM July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-fm page ix Contents Preface v About the Author Part I: vii Data Stream Algorithms Chapter Introduction to Data Stream Algorithms Chapter Basic Probability and Tail Bounds 15 Chapter Estimation Algorithms 51 Chapter Reservoir Sampling 73 Chapter Pairwise Independent Hashing 93 Chapter Counting Distinct Tokens 109 Chapter Sketches 133 Chapter Graph Data Stream Algorithms 165 Chapter The Sliding Window Model 197 ix July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in Locality-Sensitive Hashing b3853-ch18 page 433 433 OR-construction is very similar to an AND-construction, with the sole difference being that now a function g ∈ G corresponding to r functions f1 , f2 , , fr ∈ F outputs the same range item for two elements u and v if and only if at least one of the functions f1 , f2 , , fr outputs the same range item for both elements Exercise Prove that the r-OR-construction G is (d1 , d2 , − (1 − p1 )r , − (1 − p2 )r )-sensitive As promised, the OR-construction increases the probability of any pair of elements to be mapped to the same range item However, this increase is more prominent for close elements, as is mathematically demonstrated by the fact that (1 − p1 )r /(1 − p2 )r is a decreasing function of r Thus, the OR-construction can be used to counterbalance the effect of the ANDconstruction on close elements To make the use of the AND-construction and the OR-construction easier to understand, we now demonstrate it on a concrete example Recall the locality-sensitive hash functions family FJ described in Section 18.2 for the Jaccard distance between sets Exercise showed that for a uniformly random function f ∈ FJ , the probability that f maps two sets S1 and S2 at distance d from each other to the same range item is − d Figure 18.3(a) depicts this linear relationship between the distance and the probability of being mapped to the same range item According to Exercise 6, the family FJ is (1/5, 2/5, 4/5, 3/5)-sensitive Hence, if we consider sets at distance less than 1/5 as “close” and sets at distance of more than 2/5 as “far”, then close sets are more likely than far sets to be mapped to the same range item, but not by much Let us now use the AND-construction and OR-construction to increase the gap between the probabilities of close and far sets to be mapped to the same range item First, let us denote by F the 20-AND-construction of FJ Using Exercise 7, one can show that F is (1/5, 2/5, 0.0115, 0.0000366)sensitive (verify this!) Hence, the probability of close sets to be mapped to the same range item is now more than 300 times as large as the corresponding probability for far sets However, even quite close sets are not very likely to be mapped to the same range item by F , as is demonstrated in Figure 18.3(b), which depicts the probability of a pair of sets to be mapped to the same range item by F as a function of the distance between the sets To fix that, we consider the 400-OR-construction of F Let us denote this 400-OR-construction by F Using Exercise 8, one can show that F is (1/5, 2/5, 0.99, 0.015)-sensitive (verify this also!), which means that close 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 434 Algorithms for Big Data 434 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 (a) 0.04 0.18 0.32 0.46 0.6 0.74 0.88 0.8 0.04 0.18 0.32 0.46 0.6 0.74 0.88 0.8 (b) 0.04 0.18 0.32 0.46 0.6 0.74 0.88 July 1, 2020 (c) Figure 18.3 The probability of two sets to be mapped to the same range item, as a function of the distance between the sets, by a random hash function from (a) FJ , (b) F and (c) F sets have a probability of 0.99 to be mapped by F to the same range item, and for far sets this probability drops to as low as 0.015 A graphical demonstration of the nice properties of F is given by Figure 18.3(c), which depicts the probability of a pair of sets to be mapped to the same range item by a random function from F as a function of the distance between the sets One can note that the shape of the graph in Figure 18.3(c) resembles the ideal shape described in Figure 18.1 Exercise The following is a natural Map-Reduce procedure that uses a locality-sensitive hash functions family F to find pairs of input elements that are suspected to be close to each other (1) A central machine draws a uniformly random function f from F , and this function is distributed to all the input machines (2) Every input machine applies f to the element e it got, and then forwards e to the machine named f (e) (3) Every non-input machine named M gets all the elements mapped by f to M All these elements are reported as suspected to be close to each other (because they are all mapped by f to the same range item M ) The following parts of the exercise discuss some implementation details for the above procedure (a) Discuss the best way to distribute the random function f drawn by the central machine to all the input machines (b) The above procedure assumes that two outputs of f are considered equal if and only if they are identical Unfortunately, this is not true for the OR-construction Suggest a way to make the procedure work also for a hash functions family F obtained via an OR-construction July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 Locality-Sensitive Hashing page 435 435 18.4 Bibliographic Notes The notion of the locality-sensitive hash functions family, and the formal way to quantify them using (d1 , d2 , p1 , p2 )-sensitivity, was first suggested by Indyk and Motwani (1998) and Gionis et al (1999) The first of these works also noted that the hash functions family described in Section 3.4.2 for the Jaccard distance is locality sensitive This hash functions family is often referred to as Min-Hashing, and it was first suggested by Broder et al (1997, 1998) The locality-sensitive hash functions family described above for angular distance was suggested by Charkiar (2002) The same work also suggested such a family for another common distance measure known as the earth mover distance More information about locality-sensitive hashing, including the ANDconstruction and OR-construction described in Section 3.4.3, can be found in Leskovec (2014) A Z Broder, M Charikar, A M Frieze, and Michael Mitzenmacher Min-wise Independent Permutations In Proceedings of the 30th ACM Symposium on Theory of Computing (STOC), 327–336, 1998 A Z Broder, S C Glassman, M S Manasse and G Zweig Syntactic Clustering of the Web Computer Networks, 29(8–13): 1157–1166, 1997 M S Charikar Similarity Estimation Techniques from Rounding Algorithms In Proceedings of the 34th ACM Symposium on Theory of Computing (STOC), 380–388, 2002 A Gionis, P Indyk and R Motwani Similarity Search in high Dimensions via Hashing In Proceedings of the 25th International Conference on Very LargeData Bases (VLDB), 518–529, 1999 P Indyk and R Motwani Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality In Proceedings of the 30th ACM Symposium on Theory of Computing (STOC), 604–613, 1998 J Leskovec, A Rajaraman and J D Ullman Finding Similar Items Mining of Massive Datasets, 73–130, 2014 Exercise Solutions Solution Let c be a uniformly random value from the range [0, 10), then the definition of the family F implies Pr[f (x1 ) = f (x2 )] = Pr[fc (x1 ) = fc (x2 )] = Pr x1 − c x2 − c = 10 10 To understand the event (x1 − c)/10 = (x2 − c)/10 , let us assume that the real line is partitioned into disjoint ranges (10i, 10(i + 1)] for every July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 Algorithms for Big Data 436 integer i Given this partition, the last event can be interpreted as the event that x1 − c and x2 − c end up in the same range If |x1 − x2 | 10, then this can never happen because the distance between x1 − c and x2 − c is |x1 − x2 | and the length of each range is 10 Thus, the case of |x1 − x2 | < 10 remains to be considered In this case, the event (x1 − c)/10 = (x2 − c)/10 happens if and only if the location of x1 − c within the range that includes it is at distance of at least |x1 − x2 | from the end of the range Note that the distribution of c guarantees that the distance of x1 − c from the end of the range including it is a uniformly random number from the range (0, 10], and thus, the probability that it is at least |x1 − x2 | is given by 1− |x1 − x2 | |x1 − x2 | =1− 10 − 10 Solution Observe that fi (x) = fi (y) if and only if the vectors x and y agree on their i-th coordinate Thus, when fi is drawn uniformly at random from FH (which implies that i is drawn uniformly at random from the integers between and n), we get # of coordinates in which x and y agree n distH (x, y) n − distH (x, y) =1− = n n Pr[fi (x) = fi (y)] = Solution Figure 18.4 depicts the vectors x and y and two regions, one around each one of these vectors, where the region around each vector z includes all the vectors whose angle with respect to z is at most 90◦ We denote the region around x by N (x) and the region around y by N (y) One can note that fz (x) = fz (y) if and only if the vector z is in both these regions or in neither of them Thus, Pr[fz (x) = fz (y)] = {angular size ofN (x) ∩ N (y)} + {angular size of R2 \(N (x) ∩ N (y))} 360◦ Let us now relate the two angular sizes in the last equality to distθ (x, y) Since the angle between the vectors x and y is distθ (x, y), the angular size of the intersection between N (x) and N (y) is 180◦ – distθ (x, y) Using the page 436 July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 page 437 Locality-Sensitive Hashing 437 x y Figure 18.4 Vectors x and y in R2 Around the vector x there is a region marked with dots that includes all the vectors whose angle with respect to x is at most 90◦ Similarly, around y there is a region marked with lines that includes all the vectors whose angle with respect to y is at most 90◦ inclusion and exclusion principal, this implies {angular size of R2 \(N (x) ∪ N (y))} = {angular size of R2 } − {angular size of N (x)} − {angular size of N (y)} + {angular size of N (x) ∩ N (y)} = 360◦ − 180◦ − 180◦ + (180◦ − distθ (x, y)) = 180◦ − distθ (x, y) Combining all the above equations, we get [180◦ − distθ (x, y)] + [180◦ − distθ (x, y)] 360◦ distθ (x, y) =1− 180◦ Pr[fz (x) = fz (y)] = Solution Note that fe (S1 ) = fe (S2 ) if and only if e belongs to both these sets or to neither one of them Thus, / S1 ∪ S2 ] Pr[fe (S1 ) = fe (S2 )] = Pr[e ∈ S1 ∩ S2 ] + Pr[e ∈ = |S1 ∩ S2 | |S1 ∪ S2 | + 1− |N | |N | − =1 |S1 ∪ S2 | · distJ (S1 , S2 ) |N | To see that the last equality holds, plug in the definition of distJ July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 Algorithms for Big Data 438 Solution Recall that fπ (S1 ) is the first element of S1 according to the permutation π, and fπ (S2 ) is the first element of S2 according to this permutation Thus, if fπ (S1 ) = fπ (S2 ), then the element fπ (S1 ) is an element of S1 ∩ S2 that appears before every other element of S1 ∪S2 in π This proves one direction of the first part of the exercise To prove the other direction, we need to show that if the element e that is the first element of S1 ∪ S2 according to π belongs to S1 ∩ S2 , then fπ (S1 ) = fπ (S2 ) Hence, assume that this is the case, and note that this implies, in particular, that e is an element of S1 that appears in π before any other element of S1 Thus, fπ (S1 ) = e Similarly, we also get fπ (S2 ) = e, and consequently, fπ (S1 ) = e = fπ (S2 ) The second part of the exercise remains to be solved Since π is a uniformly random permutation of N in this part of the exercise, a symmetry argument shows that the first element of S1 ∪ S2 according to π (formally given by fπ (S1 ∪ S2 )) is a uniformly random element of S1 ∪ S2 Thus, Pr[fπ (S1 ) = fπ (S2 )] = Pr[fπ (S1 ∪ S2 ) ∈ S1 ∩ S2 ] = |S1 ∩ S2 | = − distJ (S1 , S2 ) |S1 ∪ S2 | Solution Recall that by Exercise 5, for two sets S1 and S2 at Jaccard distance d from each other it holds that Pr[f (S1 ) = f (S2 )] = − d for a random hash function f from FJ Thus, for d d1 = 1/5, we get Pr[f (S1 ) = f (S2 )] = − d Similarly, for d − 1/5 = 4/5 = p1 d2 = 2/5, we get Pr[f (S1 ) = f (S2 )] = − d − 2/5 = 3/5 = p2 Solution Consider a pair e1 and e2 of elements at distance at most d1 Since F is (d1 , d2 , p1 , p2 )-sensitive, Pr[f (e1 ) = f (e2 )] p1 for a uniformly random function f ∈ F Consider now a uniformly random function g ∈ G Since G contains a function for every choice of r (not necessarily distinct) functions from F , the uniformly random choice of g implies that it is associated with r page 438 July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-ch18 Locality-Sensitive Hashing page 439 439 uniformly random functions f1 , f2 , , fr from F Thus, r Pr[g(e1 ) = g(e2 )] = Pr[∀1 i r fi (e1 ) = fi (e2 )] = Pr[f (e1 ) = f (e2 )] i=1 r p1 = pr1 (18.2) i=1 It remains to be proved that Pr[g(e1 ) = g(e2 )] pr2 when e1 and e2 are two elements at distance at least d2 and g is a uniformly random function from G However, the proof for this inequality is very similar to the proof of Inequality (18.2), and is thus, omitted Solution The solution of this exercise is very similar to the solution of Exercise However, for the sake of completeness, we repeat the following necessary arguments Consider a pair e1 and e2 of elements at distance at most d1 Since F is (d1 , d2 , p1 , p2 )-sensitive, Pr[f (e1 ) = f (e2 )] p1 for a uniformly random function f ∈ F Consider now a uniformly random function g ∈ G Since G contains a function for every choice of r (not necessarily distinct) functions from F , the random choice of g implies that it is associated with r uniformly random functions f1 , f2 , , fr from F Thus, Pr[g(e1 ) = g(e1 )] = Pr[∃1 i r fi (e1 ) = − Pr[∀1 = fi (e1 )] i r fi (e1 ) = fi (e1 )] r (1 − Pr[fi (e1 ) = fi (e2 )]) =1− i=1 r 1− (1 − p1 ) = − (1 − p1 )r (18.3) i=1 It remains to be proved that Pr[g(e1 ) = g(e2 )] − (1 − p2 )r when e1 and e2 are two elements at distance at least d2 and g is a uniformly random function from G However, the proof for this inequality is very similar to the proof of Inequality (18.3), and is thus, omitted July 1, 2020 17:16 440 Algorithms for Big Data - 9in x 6in b3853-ch18 Algorithms for Big Data Solution (a) The most natural way to distribute the function f is via the following two Map-Reduce iterations method In the first iteration, every input machine forwards its name to the pre-designed central machine Then, in the second Map-Reduce iteration, the central machine forwards f to all the input machines Unfortunately, using this natural method as is might result in large machine time and space complexities because it requires the central machine to store the names of all the input machines and send the function f to all of them To solve this issue, it is necessary to split the work of the central machine between multiple machines More specifically, we will use a tree T of machines with k levels and d children for every internal node The root of the tree is the central machine, and then k − Map-Reduce iterations are used to forward f along the tree T to the leaf machines Once the function f gets to the leaves, all the input machines forward their names to random leaves of T , and each leaf that gets the names of input machines responds by forwarding to these machines the function f Observe that, under the above suggested solution, an internal node of the tree needs to forward f only to its children in T , and thus, has a small machine time complexity as long as d is kept moderate Additionally, every leaf of the tree gets the names of roughly n/dk−1 input machines, where n is the number of elements, and thus, has a small machine time and space complexities when dk−1 is close to n Combining these observations, we get that the suggested solution results in small machine time and space complexity whenever dk−1 = Θ(n) and d is small While these requirements are somewhat contradictory, they can be made to hold together (even when we want d to be a constant) by setting k = Θ(log n) The above paragraphs were quite informal in the way they described the suggested solution and analyzed it For the interested readers, we note that a more formal study of a very similar technique was done in the solution of Exercise in Chapter 16 (b) The procedure described by the exercise forwards every element e to the range item f (e) it is mapped to by f , and then detects that two elements are mapped to the same range item by noting that they end up on the same machine As noted by the exercise, this works only when range items are considered equal exactly when they are identical, which is not the case in the OR-construction The range of a hash functions family obtained as an r-OR-construction consists of r-tuples, where two tuples are considered equal if they agree on page 440 July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in Locality-Sensitive Hashing b3853-ch18 page 441 441 some coordinate Thus, detecting that two tuples are equal is equivalent to detecting that their values for some coordinate are identical This suggests the following modification to the procedure described by the exercise Instead of forwarding the element e mapped to an r-tuple (t1 , t2 , , tr ) to a machine named (t1 , t2 , , tr ), we forward it to the r machines named i t Then, every machine (i, t) that gets multiple (i, ti ) for every elements can know that the tuples corresponding to all these elements had the value t in the ith coordinate of their tuple, and thus, can declare all these elements as suspected to be close One drawback of this approach is that a pair of elements might be declared as suspected to be close multiple times if their tuples agree on multiple coordinates If this is problematic, then one can solve it using the following trick A machine (i, t) that detects that a pair of elements e1 and e2 might be close to each other should forward a message to a machine named (e1 , e2 ) Then, in the next Map-Reduce iteration, each machine (e1 , e2 ) that got one or more messages will report e1 and e2 as suspected to be close This guarantees, at the cost of one additional Map-Reduce iteration, that every pair of elements is reported at most once as suspected to be close b2530 International Strategic Relations and China’s National Security: World at the Crossroads This page intentionally left blank b2530_FM.indd 01-Sep-16 11:03:06 AM July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-index Index A C adjacency list, 267–308 adjacency matrix, 256, 267, 309–330 All-Prefix-Sums, 381–389, 396, 399–400 AMS Algorithm, 109–111 AND-construction, 432–433, 435, 438–439 angular distance, 428–429, 435–437 approximate median, 77–81 approximation algorithm, 51, 170–173, 180–181, 186, 276, 280–281 ratio, 172–173, 176, 178, 186, 190, 280, 284–285 averaging technique, 61, 63, 183–184 Chebyshev’s inequality, 34–35, 38, 48, 59–60, 66, 72, 112, 116, 126, 128–129, 150, 184 Chernoff bound, 35–38, 50, 63, 79, 90, 121, 124, 151, 161, 229, 242, 258, 271, 286, 306, 380, 394, 407 combination algorithm, 133, 138–139, 141–143, 145, 147, 153, 159, 161 communication round (local distributed algorithm), 276–278, 281, 284, 301–303 computer cluster, 355–357, 368, 370 conditional probability, 17, 41, 55, 86 conditional expectation, 25–26 connected components (counting), 267–272, 274, 291 connected components (other uses), 188, 237, 257, 273, 289–293 Count sketch, 145–152, 154, 161–164 Count-Median sketch, 144–146, 151, 153–154, 159–164 Count-Min sketch, 137–144, 151, 154, 157, 164 B Bernoulli trial, 30–31, 48 BFS (breadth-first search), 269, 292–294, 298 binary search, 246–249 binomial distribution, 30–31, 34–35, 38, 48, 63, 124, 228, 380, 394 Boolean function, 331–334, 337–341, 343, 345, 347–351 Boolean hypercube, 338–346, 349–351 bounded degree graph representation, 268, 288–289 D data stream algorithm, 4–5, 9–11, 63, 73–74, 87, 113, 119–121, 130, 133, 443 page 443 July 1, 2020 444 17:16 Algorithms for Big Data - 9in x 6in b3853-index Algorithms for Big Data 135, 138, 157, 206–208, 212–216 graph algorithm (see also graph data stream algorithm) model (see also model (data stream): plain vanilla) plain vanilla model (see also model (data stream): plain vanilla) diameter (estimation), 230–236 discrete probability space, 15–16, 38–39 duplication testing, 240–244, 256 E frequent tokens (finding) (see also sketch), 5–9, 11, 136–137, 155–156 G graph 2-connected, 205–206, 218–220 bipartite, 166–169, 186, 310–325 connected, 169, 186, 188, 200–205, 231, 237, 239, 256, 267, 272–274, 288–295, 310, 316–319, 325, 401–404 graph data stream algorithm, 165–195 greedy algorithm, 170–173, 180, 186, 189–190, 192 H earth mover distance, 435 edge cover, 415 estimation algorithm, 51–52, 61, 64, 170 Euclidean distance, 231 events disjoint, 18–23, 26–27, 40–42, 92, 96, 103, 106 independence, 18–20, 83, 85 expectation conditional (see also conditional expectation) law of total expectation (see also law of total expectation) linearity, 22, 24, 28–31, 44–46, 59–62, 112, 125, 140, 147–148, 296 exponential histograms method, 216 half-plane testing, 249–256, 261–265 Hamming distance, 428, 436 hash family, 93–107, 110–119, 122, 125–129, 138–149, 153, 157, 159–163, 425–435 function (see also hash: family) k-wise independent, 94–95, 98–102 locality sensitive (see also locality sensitive hash functions family) pairwise independent, 94–98, 100, 111–115, 126, 139–142, 145–146, 149 hypergraph (k-uniform), 180–181, 192–194 F I false negative, 426–427 false positive, 426–427 filtering technique, 401–410, 415 forest, 166–168, 200–206, 217–218, 401–409, 416–420 frequency words, 357–361, 363–368, 377–381 vector, 11, 133–135, 145, 150, 152, 157 image (see also model (property testing): pixel model) impossibility result, 51, 119–123, 168–169, 181, 186–187, 227, 275, 325, 370 independence random variables, 26–31, 35–36, 62–63, 79, 90, 121, 124, 147, 149–150, 161, 184, 242–243, 271, 286, 313 page 444 July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in Index events (see also events: independence) k-wise, 20–21, 27 pairwise, 20–21, 27, 29, 112, 126 indexing, 387–389, 399–400 indicator, 29–31, 46, 62, 78–79, 89–90, 112, 121, 124–129, 140, 147, 150–151, 157, 160–161, 228, 242, 258, 286, 305–306, 309, 335, 380, 394 J Jaccard distance, 420–421, 433, 435 Jensen’s inequality, 24–25, 57, 70 K key-value pair, 358–363, 365, 370–371 Kruskal’s algorithm, 273, 299, 403, 405, 409, 418 L law of total probability, 19–20, 26, 41–42, 54, 68, 96, 117, 298, 407 law of total expectation, 26, 45, 183 linear function (testing), 332–338, 347–349 linear equation, 100 linear sketch, 152–154, 163–164 local distributed algorithm, 276–282, 295, 303–304 local computation algorithm, 281–285, 295 locality sensitive hash functions family, 425–435 amplifying, 431–435 locality sensitive hashing (see also locality sensitive hash functions family) M machine space complexity, 365–366, 373, 381, 385–389, 396, 400–402, 409 b3853-index page 445 445 machine time complexity, 367, 372, 381, 385–388, 395, 400, 409, 423, 440 Map-Reduce algorithm, 357–358, 364–368, 377–381, 383–389, 401–402, 405–410, 413–416 framework, 356–362, 366, 368–370, 389, 401, 416 model, 357, 361–365, 367, 370, 377, 382, 415, 422 round, 361–362, 391 Markov’s inequality, 32–36, 47–48, 56, 112, 116, 126, 140, 158, 243, 319 map (step), 358–363 map procedure, 358–363, 365, 370 Misra–Gries algorithm, 11 massively parallel computation (MPC) model, 368–370, 373–374 matching maximal matching, 171, 415 maximum cardinality matching, 170–171, 186, 208 maximum weight matching, 169–181, 186, 415 median technique, 61–64, 113, 119, 123–124, 142, 154, 185 Min-Hashing (see also Jaccard distance) minimum cut, 415 minimum weight spanning tree, 51, 272–275, 282–283, 295, 301–302, 401–410, 415 model (data stream) (see also model (data stream): plain vanilla) cash register, 134–138, 155–157 plain vanilla, 4–9, 11, 51, 133–136, 166, 206–207, 212, 214–215 sliding window, 197–200, 204 strict turnstile, 135–138, 140–144, 151, 155, 157 turnstile, 134–137, 144–147, 151–152, 155, 164 July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in 446 Algorithms for Big Data model (property testing) Boolean function, 331–351 bounded degree graph, 288–295, 306–307 dense graph, 309–330 list model, 240–249, 259–261 pixel model, 249–251 monotone function (testing), 338–346, 349–351 Morris’s Algorithm, 51–65 N non-monotone edge, 339–344, 350–351 norm 135–138, 140–145, 151–152 150–152 O one-sided error algorithm, 244, 248–249, 258 OR-construction, 432–435, 439–440 P parallel algorithm (see also Map-Reduce: algorithm), 355–358 parallelizable, 356 passes (multiple), 5–14, 63–64, 136–137, 156, 186 pixel model (see also model (property testing): pixel model) probabilistic method, 122, 337 property testing algorithm, 237–265, 288–295, 306–307, 309–351 pseudometric, 231–232, 235 Q quantile, 77–81 query complexity, 233–236, 240, 244, 248–249, 251, 254–255, 257, 261–262, 264, 324, 338, 346, 351 queue, 217–218 R random function, 93–94, 97 random variable Bernoulli, 30–31, 183 b3853-index Binomial (see also Binomial distribution) indicator (see also indicator) independence (see also independence: random variable) numerical, 22–24, 26–29 reduce (step), 358, 360–362 reduce procedure, 360–362, 365, 371 reducer, 359–360, 362–363, 371 relative error, 56–57, 60–62, 72, 117–119, 122–123, 184–185, 212–215, 230–231, 272, 298 reservoir sampling (algorithm) (see also sampling) S sampling uniform, 73–74, 85–86, 91, 182, 194–195 uniform with replacement, 74–75, 78, 81, 84 uniform without replacement, 75–76, 86–89 weighted, 81–84, 91–92 semi-streaming algorithm, 169, 171, 175, 186, 189, 192–193, 204, 218 shuffle, 358–362, 366 sketch, 133–154 Count (see also Count sketch) Count-Median (see also Count-Median sketch) Count-Min (see also Count-Min sketch) linear (see also linear sketch) smooth histograms method, 208–216, 220–224 smooth function, 208, 212–215, 220 sorted list testing, 245–249, 259–261 spanning tree (minimum weight) (see also minimum weight spanning tree) stack, 173–181, 190–193 standard deviation, 34 page 446 July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-index Index star, 232, 236, 413 streaming algorithm, 9, 51, 77, 84, 109, 111, 115, 135–136, 156, 168–169, 170 sublinear time algorithm, 227–228, 230–234, 237, 267, 272, 276, 285, 288, 295, 297, 309, 332 T tail bound, 15, 31–38 tie breaking rule, 260–261, 404 trailing zeros, 106, 110–118 token counting distinct, 109–131 processing time, 10, 13–14, 63, 67–68, 202 total space complexity, 366, 373, 385–388, 390, 400, 408–409, 415, 423–424 triangle counting, 181–187, 410, 416 listing, 410–416, 421 two-sided error algorithm, 244 U union bound, 21–22, 50, 80, 91, 113, 117, 121, 126, 130, 158, 229, 243, page 447 447 271, 275, 287, 304, 306, 312–313, 317, 323, 337, 349, 380, 388, 395, 408 universal family, 93–94, 102 update event, 133–139, 142, 145–146, 155–156 V variance, 27–34, 48, 57–60, 63, 66–67, 70, 128–129, 148, 183–184 vertex coloring, 281–282, 303–304 vertex cover, 276–288, 295, 415 W work, 367–368, 373, 385–389, 391, 399–400, 409–410, 415, 421 window active, 198–208, 210–219, 221–222, 224 algorithm, 198–201, 203–208, 213–214 length, 198–199, 204 model (see also model (data stream): sliding window) ... simple data stream algorithm that obtains the same goal using a single pass July 1, 2020 17:13 Algorithms for Big Data - 9in x 6in b3853-ch01 Algorithms for Big Data Algorithm 2: Frequent Elements... Window Model 197 ix July 1, 2020 17:16 x Part II: Algorithms for Big Data - 9in x 6in b3853-fm page x Algorithms for Big Data Sublinear Time Algorithms 225 Chapter 10 Introduction to Sublinear Time... Press Email: enquiries@stallionpress.com Printed in Singapore Steven - 11398 - Algorithms for Big Data. indd 30/6 /2020 8:52:51 am July 1, 2020 17:16 Algorithms for Big Data - 9in x 6in b3853-fm Preface

Định dạng
Số trang	458
Dung lượng	19,43 MB