Data-Centric Systems and Applications Series editors M.J Carey S Ceri Editorial Board A Ailamaki S Babu P Bernstein J.C Freytag A Halevy J Han D Kossmann I Manolescu G Weikum K.-Y Whang J.X Yu More information about this series at http://www.springer.com/series/5258 Minos Garofalakis r Johannes Gehrke Rajeev Rastogi r Editors Data Stream Management Processing High-Speed Data Streams Editors Minos Garofalakis School of Electrical and Computer Engineering Technical University of Crete Chania, Greece Rajeev Rastogi Amazon India Bangalore, India Johannes Gehrke Microsoft Corporation Redmond, WA, USA ISSN 2197-9723 Data-Centric Systems and Applications ISBN 978-3-540-28607-3 DOI 10.1007/978-3-540-28608-0 ISSN 2197-974X (electronic) ISBN 978-3-540-28608-0 (eBook) Library of Congress Control Number: 2016946344 Springer Heidelberg New York Dordrecht London © Springer-Verlag Berlin Heidelberg 2016 The fourth chapter in part is published with kind permission of © 2004 Association for Computing Machinery, Inc All rights reserved This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Contents Data Stream Management: A Brave New World Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi Part I Foundations and Basic Stream Synopses Data-Stream Sampling: Basic Techniques and Results Peter J Haas 13 Quantiles and Equi-depth Histograms over Streams Michael B Greenwald and Sanjeev Khanna 45 Join Sizes, Frequency Moments, and Applications Graham Cormode and Minos Garofalakis 87 Top-k Frequent Item Maintenance over Streams Moses Charikar 103 Distinct-Values Estimation over Data Streams Phillip B Gibbons 121 The Sliding-Window Computation Model and Results Mayur Datar and Rajeev Motwani 149 Part II Mining Data Streams Clustering Data Streams Sudipto Guha and Nina Mishra 169 Mining Decision Trees from Streams Geoff Hulten and Pedro Domingos 189 Frequent Itemset Mining over Data Streams Gurmeet Singh Manku 209 v vi Contents Temporal Dynamics of On-Line Information Streams Jon Kleinberg 221 Part III Advanced Topics Sketch-Based Multi-Query Processing over Data Streams Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi 241 Approximate Histogram and Wavelet Summaries of Streaming Data S Muthukrishnan and Martin Strauss 263 Stable Distributions in Streaming Computations Graham Cormode and Piotr Indyk 283 Tracking Queries over Distributed Streams Minos Garofalakis 301 Part IV System Architectures and Languages STREAM: The Stanford Data Stream Management System Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Mayur Datar, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom 317 The Aurora and Borealis Stream Processing Engines U˘gur Çetintemel, Daniel Abadi, Yanif Ahmad, Hari Balakrishnan, Magdalena Balazinska, Mitch Cherniack, Jeong-Hyon Hwang, Samuel Madden, Anurag Maskey, Alexander Rasin, Esther Ryvkina, Mike Stonebraker, Nesime Tatbul, Ying Xing, and Stan Zdonik 337 Extending Relational Query Languages for Data Streams N Laptev, B Mozafari, H Mousavi, H Thakkar, H Wang, K Zeng, and Carlo Zaniolo 361 Hancock: A Language for Analyzing Transactional Data Streams Corinna Cortes, Kathleen Fisher, Daryl Pregibon, Anne Rogers, and Frederick Smith 387 Sensor Network Integration with Streaming Database Systems Daniel Abadi, Samuel Madden, and Wolfgang Lindner 409 Part V Applications Stream Processing Techniques for Network Management Charles D Cranor, Theodore Johnson, and Oliver Spatscheck 431 High-Performance XML Message Brokering Yanlei Diao and Michael J Franklin 451 Fast Methods for Statistical Arbitrage Eleftherios Soulas and Dennis Shasha 473 Contents vii Adaptive, Automatic Stream Mining Spiros Papadimitriou, Anthony Brockwell, and Christos Faloutsos 499 Conclusions and Looking Forward Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi 529 Data Stream Management: A Brave New World Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi Introduction Traditional data-management systems software is built on the concept of persistent data sets that are stored reliably in stable storage and queried/updated several times throughout their lifetime For several emerging application domains, however, data arrives and needs to be processed on a continuous (24 × 7) basis, without the benefit of several passes over a static, persistent data image Such continuous data streams arise naturally, for example, in the network installations of large Telecom and Internet service providers where detailed usage information (Call-DetailRecords (CDRs), SNMP/RMON packet-flow data, etc.) from different parts of the underlying network needs to be continuously collected and analyzed for interesting trends Other applications that generate rapid, continuous and large volumes of stream data include transactions in retail chains, ATM and credit card operations in banks, financial tickers, Web server log records, etc In most such applications, the data stream is actually accumulated and archived in a database-management system of a (perhaps, off-site) data warehouse, often making access to the archived data prohibitively expensive Further, the ability to make decisions and infer interesting M Garofalakis (B) School of Electrical and Computer Engineering, Technical University of Crete, University Campus—Kounoupidiana, Chania 73100, Greece e-mail: minos@softnet.tuc.gr J Gehrke Microsoft Corporation, One Microsoft Way, Redmond, WA 98052-6399, USA e-mail: johannes@microsoft.com R Rastogi Amazon India, Brigade Gateway, Malleshwaram (W), Bangalore 560055, India e-mail: rastogi@amazon.com © Springer-Verlag Berlin Heidelberg 2016 M Garofalakis et al (eds.), Data Stream Management, Data-Centric Systems and Applications, DOI 10.1007/978-3-540-28608-0_1 M Garofalakis et al Fig ISP network monitoring data streams patterns on-line (i.e., as the data stream arrives) is crucial for several mission-critical tasks that can have significant dollar value for a large corporation (e.g., telecom fraud detection) As a result, recent years have witnessed an increasing interest in designing data-processing algorithms that work over continuous data streams, i.e., algorithms that provide results to user queries while looking at the relevant data items only once and in a fixed order (determined by the stream-arrival pattern) Example (Application: ISP Network Monitoring) To effectively manage the operation of their IP-network services, large Internet Service Providers (ISPs), like AT&T and Sprint, continuously monitor the operation of their networking infrastructure at dedicated Network Operations Centers (NOCs) This is truly a large-scale monitoring task that relies on continuously collecting streams of usage information from hundreds of routers, thousands of links and interfaces, and blisteringly-fast sets of events at different layers of the network infrastructure (ranging from fibercable utilizations to packet forwarding at routers, to VPNs and higher-level transport constructs) These data streams can be generated through a variety of networkmonitoring tools (e.g., Cisco’s NetFlow [10] or AT&T’s GigaScope probe [5] for monitoring IP-packet flows), For instance, Fig depicts an example ISP monitoring setup, with an NOC tracking NetFlow measurement streams from four edge routers in the network R1 –R4 The figure also depicts a small fragment of the streaming data tables retrieved from routers R1 and R2 containing simple summary information for IP sessions In real life, such streams are truly massive, comprising hundreds of attributes and billions of records—for instance, AT&T collects over one terabyte of NetFlow measurement data from its production network each day! Typically, this measurement data is periodically shipped off to a backend data warehouse for off-line analysis (e.g., at the end of the day) Unfortunately, such off-line analyses are painfully inadequate when it comes to critical networkmanagement tasks, where reaction in (near) real-time is absolutely essential Such tasks include, for instance, detecting malicious/fraudulent users, DDoS attacks, or Service-Level Agreement (SLA) violations, as well as real-time traffic engineering to avoid congestion and improve the utilization of critical network resources Thus, Adaptive, Automatic Stream Mining 523 at largest scales (from 10 and on, or windows >1024) In other words, the most interesting activity is at times larger than about half an hour, which is indeed the case Conclusions Sensor networks are becoming increasingly popular, thanks to falling prices and increasing storage and processing power We presented AWSOM, which achieves all of the following goals: Concise patterns AWSOM provides linear models with few coefficients, it can detect arbitrary periodic components, it gives information across several frequencies and it can diagnose self-similarity and long-range dependence Streaming framework We can update patterns in an “any-time” fashion, with one pass over the data, in time independent of stream size and using O(lg N ) space (where N is the length of the sequence so far) Furthermore, AWSOM can forecasting (directly, for the estimated model) Unsupervised operation Once we decide the largest AWSOM order, no further intervention is needed; the sensor can be left alone to collect information We showed real and synthetic data, where our method captures the periodicities and burstiness, while manually selected AR (or even (S)ARIMA generalizations, which are not suitable for streams with limited resources) fails completely AWSOM is an important first step toward hands-off stream mining, combining simplicity with modeling power Continuous queries are useful for evidence gathering and hypothesis testing once we know what we are looking for AWSOM is the first method to deal directly with the problem of unsupervised stream mining and pattern detection and fill the gap Acknowledgements We thank Becky Buchheit for her help with the automobile traffic datasets and Mike Bigrigg for the temperature sensor data Appendix A: Auto-Regressive Modeling In their simplest form, an auto-regressive model of order p, or AR(p) express Xt as a linear combination of previous values, i.e., Xt = φ1 Xt−1 + · · · + φp Xt−p + t or, more concisely, φ(L)Xt = t where L is the lag operator and φ(L) is a polynomial defined on this operator: LXt ≡ Xt−1 , φ(L) = − φ1 L − φ2 L2 − · · · − φp Lp , 524 and S Papadimitriou et al t is a white noise process, i.e., E[ t ] = and Cov[ t , σ2 t−k ] = if k = 0, otherwise Using least-squares, we can estimate σ from the sum of squared residuals (SSR) This is used as a measure of estimation error; when generating “future” points, t is set to E[ t ] ≡ The next step up are auto-regressive moving average models An ARMA(p, q) model expresses values Xt as φ(L)Xt = θ (L) t where θ (L) = 1−θ1 L−· · ·−θq Lq Estimating the moving average coefficients θi is fairly involved State-of-the-art methods use maximum-likelihood (ML) algorithms, employing iterative methods for nonlinear optimization, whose computational complexity depends exponentially on q ARIMA(p, d, q) models are similar to ARMA(p, q) models, but operate on (1 − L)d Xt , i.e., the dth order backward difference of Xt : φ(L)(1 − L)d Xt = θ (L) t Finally, SARIMA(p, d, q)×(P , D, Q)T models are used to deal with seasonalities, where φ(L)Φ LT (1 − L)d − LT D Xt = θ (L)Θ LT t and where the seasonal difference polynomials, Φ LT = − Φ1 LT − Φ2 L2T − · · · − ΦP LP T , Θ LT = − Θ1 LT − Θ2 L2T − · · · − ΘQ LQT , are similar to φ(L) and θ (L) but operate on lags that are multiples of a fixed period T The value of T is yet another parameter that either needs to be estimated or set based on prior knowledge about the series Xt Appendix B: More Wavelet Properties Frequency Properties Wavelet filters employed in practice can only approximate an ideal bandpass filter since they are of finite length L The practical implications are that wavelet coefficients at level l correspond roughly to the frequency range [1/2l+1 , 1/2l ], or, equivalently, periods in [2l , 2l+1 ] (see Fig 12 for the actual correspondence) This has to be taken into account for precise interpretation of AWSOM models by an expert Adaptive, Automatic Stream Mining 525 Fig 12 Illustration of Haar and Daubechies-6 cascade gain (levels 3–5) The horizontal axis is frequency and the curves show how much of each frequency is “represented” at each wavelet level As expected, D-6 filters (used in all experiments), have better band-pass properties 526 S Papadimitriou et al Wavelet Variance and Self-Similarity The wavelet variance decomposes the variance of a sequence across scales Due to space limitations, we mention basic definitions and facts; details can be found in [29] Definition (Wavelet Variance) If {Wl,t } is the DWT of a series {Xt } then the wavelet variance Vl is defined as Vl = Var[Wl,t ] l N/2 2l Under certain general conditions, Vˆ l = N t=1 Wl,t is an unbiased estimator of Vl Note that the sum is precisely the energy of {Xt } at scale l Definition (Self-Similar Sequence) A sequence {Xt } is said to be self-similar following a pure power-law process if SX (f ) ∝ |f |α , where −1 < α < and SX (f ) is the SDF.10 It can be shown that Vl ≈ 1/2l 1/2l+1 SX (f )df , thus if {Xt } is self-similar, then log Vl ∝ l, i.e., the plot of log Vl versus the level l should be linear In fact, the slope of the log-power versus scale plot should be approximately equal to the exponent α This fact and how to estimate Vl are what the reader needs to keep in mind References M Akay (ed.), Time Frequency and Wavelets in Biomedical Signal Processing (Wiley, New York, 1997) A Arasu, B Babcock, S Babu, J McAlister, J Widom, Characterizing memory requirements for queries over continuous data streams, in PODS (2002) B Babcock, C Olston, Distributed top-k monitoring, in Proc SIGMOD (2003) J Beran, Statistics for Long-Memory Processes (Chapman & Hall, London, 1994) T Bollerslev, Generalized autoregressive conditional heteroskedasticity J Econom 31, 307– 327 (1986) P Bonnet, J.E Gehrke, P Seshadri, Towards sensor database systems, in Proc MDM (2001) P.J Brockwell, R.A Davis, Time Series: Theory and Methods, 2nd edn Springer Series in Statistics (Springer, Berlin, 1991) A Bulut, A.K Singh, SWAT: hierarchical stream summarization in large networks, in Proc 19th ICDE (2003) L.R Carley, G.R Ganger, D Nagle, Mems-based integrated-circuit mass-storage systems Commun ACM 43(11), 72–80 (2000) 10 The spectral density function (SDF) is the Fourier transform of the auto-covariance sequence (ACVS) SX,k ≡ Cov[Xt , Xt−k ] Intuitively, it decomposes the variance into frequencies Adaptive, Automatic Stream Mining 527 10 D Carney, U Cetintemel, M Cherniack, C Convey, S Lee, G Seidman, M Stonebraker, N Tatbul, S.B Zdonik, Monitoring streams—a new class of data management applications, in Proc VLDB (2002) 11 Y Chen, G Dong, J Han, B.W Wah, J Wang, Multi-dimensional regression analysis of timeseries data streams, in Proc VLDB (2002) 12 J Considine, F Li, G Kollios, J.W Byers, Approximate aggregation techniques for sensor databases, in Proc ICDE (2004) 13 G Das, K.-I Lin, H Mannila, G Renganathan, P Smyth, Rule discovery from time series, in Proc KDD (1998) 14 M Datar, A Gionis, P Indyk, R Motwani, Maintaining stream statistics over sliding windows, in Proc SODA (2002) 15 M.H DeGroot, M.J Schervish, Probability and Statistics, 3rd edn (Addison-Wesley, Reading, 2002) 16 A Dobra, M.N Garofalakis, J Gehrke, R Rastogi, Processing complex aggregate queries over data streams, in Proc SIGMOD (2002) 17 C Faloutsos, Searching Multimedia Databases by Content (Kluwer Academic, Norwell, 1996) 18 M.N Garofalakis, P.B Gibbons, Wavelet synopses with error guarantees, in Proc SIGMOD (2002) 19 J Gehrke, F Korn, D Srivastava, On computing correlated aggregates over continual data streams, in Proc SIGMOD (2001) 20 R Gencay, F Selcuk, B Whitcher, An Introduction to Wavelets and Other Filtering Methods in Finance and Economics (Academic Press, San Diego, 2001) 21 A.C Gilbert, Y Kotidis, S Muthukrishnan, M Strauss, Surfing wavelets on streams: one-pass summaries for approximate aggregate queries, in Proc VLDB (2001) 22 S Guha, N Koudas, Approximating a data stream for querying and estimation: algorithms and performance evaluation, in Proc ICDE (2002) 23 J Hill, R Szewczyk, A Woo, S Hollar, D Culler, K Pister, System architecture directions for networked sensors, in Proc ASPLOS-IX (2000) 24 P Indyk, N Koudas, S Muthukrishnan, Identifying representative trends in massive time series data sets using sketches, in Proc VLDB (2000) 25 W Leland, M Taqqu, W Willinger, D Wilson, On the self-similar nature of Ethernet traffic IEEE Trans Netw 2(1), 1–15 (1994) 26 S.R Madden, M.A Shah, J.M Hellerstein, V Raman, Continuously adaptive continuous queries over streams, in SIGMOD Conf (2002) 27 C Olston, J Jiang, J Widom, Adaptive filters for continuous queries over distributed data streams, in Proc SIGMOD (2003) 28 T Palpanas, M Vlachos, E.J Keogh, D Gunopulos, W Truppel, Online amnesic approximation of streaming time series, in Proc ICDE (2004) 29 D.B Percival, A.T Walden, Wavelet Methods for Time Series Analysis (Cambridge University Press, Cambridge, 2000) 30 E Riedel, C Faloutsos, G.R Ganger, D Nagle, Data mining on an OLTP system (nearly) for free, in SIGMOD Conf (2000) 31 Y Tao, C Faloutsos, D Papadias, B Liu, Prediction and indexing of moving objects with unknown motion patterns, in Proc SIGMOD (2004) 32 A.S Weigend, N.A Gerschenfeld, Time Series Prediction: Forecasting the Future and Understanding the Past (Addison-Wesley, Reading, 1994) 33 B.-K Yi, N Sidiropoulos, T Johnson, H Jagadish, C Faloutsos, A Biliris, Online data mining for co-evolving time sequences, in Proc ICDE (2000) 34 P Young, Recursive Estimation and Time-Series Analysis: An Introduction (Springer, Berlin, 1984) 528 S Papadimitriou et al 35 D Zhang, D Gunopulos, V.J Tsotras, B Seeger, Temporal aggregation over data streams using multiple granularities, in Proc EDBT (2002) 36 Y Zhu, D Shasha, Statstream: statistical monitoring of thousands of data streams in real time, in Proc VLDB (2002) 37 Y Zhu, D Shasha, Efficient elastic burst detection in data streams, in Proc KDD (2003) 38 R Zuidwijk, P de Zeeuw, Fast algorithm for directional time-scale analysis using wavelets, in Proc SPIE, Wavelet Applications in Signal and Image Processing VI, vol 3458 (1998) Conclusions and Looking Forward Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi Today, data streaming is a part of the mainstream and several data steaming products are now publicly available Data streaming algorithms are powering complex event processing, predictive analytics, and big data applications in the cloud In this final chapter, we provide an overview of current data streaming products, and applications of data streaming to cloud computing, anomaly detection and predictive modeling We also identify future research directions for mining and doing predictive analytics on data streams, especially in a distributed environment Data Streaming Products Given the need for processing data streams in manufacturing, financial trading, logistics, telecom, health monitoring, and web analytics applications, a number of the leading software vendors have launched commercial data streaming products We describe three prominent products below TIBCO Streambase is a high-performance system for rapidly building applications that analyze and act on real-time streaming data StreamBase’s EventFlow language M Garofalakis (B) School of Electrical and Computer Engineering, Technical University of Crete, University Campus—Kounoupidiana, Chania 73100, Greece e-mail: minos@softnet.tuc.gr J Gehrke Microsoft Corporation, One Microsoft Way, Redmond, WA 98052-6399, USA e-mail: johannes@microsoft.com R Rastogi Amazon India, Brigade Gateway, Malleshwaram (W), Bangalore 560055, India e-mail: rastogi@amazon.com © Springer-Verlag Berlin Heidelberg 2016 M Garofalakis et al (eds.), Data Stream Management, Data-Centric Systems and Applications, DOI 10.1007/978-3-540-28608-0_25 529 530 M Garofalakis et al represents stream processing flows and operators as graphical elements—operators can be placed on a canvas and connected with arrows that represent the flow of data Furthermore, StreamBase’s StreamSQL extends the standard SQL querying model and relational operators to also perform processing on continuous data streams EventFlow and StreamSQL both extend the semantics of standard SQL by adding rich windowing constructs and stream-specific operators Windows are definable over time or the number of messages, and define the scope of a multi-message operator such as an aggregate or a join EventFlow and StreamSQL operators provide the capability to filter streams, merge, combine, and correlate multiple streams, and run time-window-based aggregations and computations on real-time streams In addition, developers can easily extend the set of operators available to either EventFlow or StreamSQL modules by writing their own operators IBM System S provides a programming model and an execution platform for userdeveloped applications that ingest, filter, analyze, and correlate potentially massive volumes of continuous data streams In System S, users create applications in the form of dataflow graphs consisting of analytics operators interconnected by streams System S provides a toolkit of type-generic built-in stream processing operators, which include all basic stream-relational operators, as well as a number of plumbing operators (such as stream splitting, demultiplexing, etc.) It also provides an extensible operator framework, which supports the addition of new type-generic and configurable operators Microsoft StreamInsight implements a lightweight streaming architecture that supports highly parallel execution of continuous queries over high-speed event data Developers can write their applications using Microsoft’s NET language such as Visual C#, leveraging the advanced language platform LINQ (Language Integrated Query) as an embedded query language By using LINQ, developers can write SQL queries in a declarative fashion that process and correlate data from multiple streams into meaningful results The optimizer and scheduler of the StreamInsight server in turn ensure optimal query performance Incoming events are continuously streamed into standing queries in the StreamInsight server, which processes and transforms the data according to the logic defined in each query The query result at the output can then be used to trigger specific actions Each event consists of the following parts: (i) An event header that contains metadata defining the event kind (interval vs point events) and one or more timestamps that define the time interval for the event, and (ii) The payload of an event is a NET data structure that contains the data associated with the event StreamInsight provides the following query functionality: (i) Filter operation to express Boolean predicates over the event payload and discard events that not satisfy the predicates, (ii) Grouping operation that partitions the incoming stream based on event properties such as location or ID and then applies other operations or complete query fragments to each group separately, (iii) Hopping and sliding windows to define windows over event streams—a sliding window contains events within the last X time units, at each point in time, (iv) Built-in aggregations for sum, count, min, max, and average that typically operate on time windows, (v) TopK operation to identify heavy hitters in an event stream, (vi) A powerful join Conclusions and Looking Forward 531 operation that matches events from two sources if their times overlap and executes the join predicate specified on the payload fields, (vii) A union operation that multiplexes several input streams into a single output stream, and (viii) User-defined functions Data Streaming in the Cloud Many internet applications such as web site statistics monitoring and analytics, intrusion detection systems and spam filters have real time processing requirements To provide real-time analytics functionality to such applications, a number of systems for processing data streams in the cloud have been developed recently We describe a few of these below Spark Streaming extends Apache Spark for doing large scale stream processing It chops up the live stream into batches of X seconds Each batch of data is treated as a Resilient Distributed Dataset (RDD) and processed using RDD operations such as map, count, join, etc Finally, the processed results of the RDD operations are returned in batches and can be persisted in HDFS Thus, Spark represents a stream of data as a sequence of RDDs (referred to as DStream) and applies transformations that modify data from one DStream to another using standard RDD operations Spark Streaming also has support for defining Windowed DStreams that gather together data over a sliding window Each window has two parameters, a window length and a sliding interval At each time t, the function for time t is applied to the union of all RDDs of the parent DStream between times t and (t—window length) One can use Spark Streaming to get hashtags from a Twitter stream, count the number of hashtags in each window, or join incoming tweets with a file containing spam keywords to filter out bad tweets Amazon Kinesis enables users to collect and process large streams of data records in real time A producer puts data records into Amazon Kinesis streams For example, a web server sending log data to an Amazon Kinesis stream is a producer Each data record consists of a sequence number, a partition key, and a data blob that is not interpreted by Kinesis The data records in a stream are distributed into multiple shards, using the partition key associated with each data record to determine which shard a given data record belongs to Specifically, an MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards When a stream is created, the number of shards for the stream are specified by the application Each consumer reads data records from a particular shard Apache Kafka is a high-throughput publish-subscribe messaging system Kafka maintains feeds of messages in categories called topics Producers publish messages to a topic that are delivered to consumers who have subscribed to the topic Each topic consists of multiple partitions and producers can specify which partition each message within a topic is assigned to Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer 532 M Garofalakis et al instance within each subscribing consumer group Kafka provides both ordering guarantees and load balancing over a pool of consumer processes by assigning partitions in the topic to consumers in the consumer group so that each partition is consumed by exactly one consumer in the group Kafka provides a total order over messages only within a partition, not between different partitions in a topic Use cases for Apache Kafka include messaging, website activity tracking (page views, searches), and stream processing Apache Storm makes it easy to reliably process unbounded streams of data The basic primitives Storm provides for doing stream transformations are “spouts” and “bolts” A spout is a source of streams For example, a spout may connect to the Twitter API and emit a stream of tweets A bolt consumes any number of input streams, does some processing, and possibly emits new streams Bolts can anything from run functions, filter tuples, streaming aggregations, streaming joins, etc For example, a bolt may perform complex stream transformations like computing a stream of trending topics from a stream of tweets Networks of spouts and bolts are packaged into a “topology” to real-time computation A topology is a graph of stream transformations where each node is a spout or bolt Links between nodes in the topology indicate how tuples should be passed around between nodes For example, when a spout or bolt emits a tuple, it sends the tuple to every bolt to which it has an outgoing edge Thus, a storm topology consumes streams of data and processes those streams in arbitrary complex ways, repartitioning the streams between each stage of the computation as needed Complex Event Processing Complex event processing (CEP) techniques discover complex events by analyzing patterns and correlating other events For instance, a CEP application may analyze tweet events in a Twitter stream to detect complex events such as an earthquake or a plane accident Another CEP application in the manufacturing domain may generate an alert when a measurement exceeds a predefined threshold of time, temperature, or other value CEP applications include algorithmic stock-trading, network anomaly detection, credit-card fraud detection, demand spike detection in retail, and security monitoring CEP applications may require complex pattern matching over high-speed data streams For instance, Network intrusion detection systems (NIDS) like Snort perform deep packet inspection on an IP traffic stream to check if the packet payload (and header) matches a given set of signatures of well known security threats (e.g., viruses, worms) The signature patterns are frequently specified using general regular expressions because it is a more expressive pattern language compared to strings Thus, matching thousands of regular expressions on data streams is a critical capability in network security applications Current matching algorithms employ finite automata-based representations to detect regular expression patterns Conclusions and Looking Forward 533 Another example in the retail space is detection of spikes in product demand Such spikes can occur because of different types of events, e.g., a snowstorm can trigger the sale of snow shovels Similarly, there may be a jump in the sale of textbooks when students go back to school One strategy to detect sudden demand spikes is to maintain a sliding window and compute the mean demand and standard deviation over the window If the current observed demand deviates from the mean by more than two standard deviations then we can generate a demand spike event Numenta’s Grok application employs an innovative approach to detect anomalies in numeric streaming data It continuously trains a machine learning model on a sliding window of the most recently observed data and uses the model to predict the next most likely value based on the sliding window of previous values Specifically, the model returns the probability that the next value in the data stream is the observed value If multiple contiguous stream values have low occurrence probabilities then Grok generates an anomaly event Grok is being used to detect anomalies in streaming metrics (e.g., CPU utilization, disk writes) from server clusters Big Data and Predictive Modeling As businesses increasingly computerize and automate their operations, they have the ability to collect massive amounts of data and leverage the data to automate decision making at every step Consequently, Big Data analytics and Predictive modeling is ubiquitous today across a wide range of domains such as telecom, healthcare, internet, finance, retail, media, transportation and manufacturing, and is leading to: (i) Improved customer experience—Online content providers and web search companies analyze historical customer actions (e.g., clicks, searches, browsed pages) to learn about customer preferences and then show relevant content and search results that match a user’s interests, (ii) Reduced costs—In transportation and manufacturing, analysis of past equipment operations data enables proactive prediction of failures and servicing of equipment prior to failures, and (iii) Higher revenues—In online advertising and e-commerce, targeting customers with relevant ads and recommending relevant products can lead to higher ad clicks and product purchases Data streaming algorithms are critical for Big Data analytics and Predictive modeling Below, we list applications of data streaming algorithms in various Predictive modeling steps: • Data cleaning Collected data is frequently dirty with errors, outliers and missing values For a numeric attribute, values greater than a certain threshold (e.g., 98th percentile, mean plus two times standard deviation) or smaller than a certain threshold (e.g., 2nd percentile, mean minus two times standard deviation) can be considered as outliers Similarly, for a categorical attribute, values that occur very frequently or infrequently can be regarded as outliers Single-pass algorithms for computing a wide variety of data statistics such as percentiles, mean and standard deviation for numeric attributes, and value frequencies for categorical attributes can help to detect outliers and prune them, thus ensuring high data quality 534 M Garofalakis et al • Data statistics and visualization Univariate data statistics provide insights into attribute and target data distributions—these include different percentiles, mean and standard deviation for numeric attributes, and cardinality and top frequent values/keywords for categorical and text attributes Text attributes, in particular, may contain millions of keywords making it difficult to store counts for individual keywords in main memory These analyses require scalable algorithms for constructing histograms over numeric attribute values, and computing attribute value frequencies and cardinality Bivariate data statistics shed light on the degree of correlation between attribute and target values Commonly used correlation metrics include Pearson’s Correlation Coefficient for numeric attributes, and information gain or mutual information for categorical and text attributes While Pearson’s Coefficient can be computed efficiently in one pass over the data, information gain and mutual information estimation requires calculating attributetarget value pair frequencies using streaming techniques • Feature engineering To obtain models with high predictive accuracy, raw data needs to be transformed into higher level representations with predictive power For instance, individual pixels within an image may not have much signal but higher level aggregations such as color histograms or shapes in the image may be a lot more predictive A common transformation is numeric attribute binning which can be carried out on large data sets using stream quantile computation algorithms Another option is to construct interaction features over attribute pairs, e.g., in online advertising, an interaction feature involving user gender and ad category has more signal compared to the individual attributes themselves (since women are more likely to click on jewelry or beauty product ads while men are more likely to click on ads related to automotive parts) Here, min-wise hashing techniques can be used to efficiently compute counts for frequent attribute value pairs belonging to interaction features More recently, deep learning has shown promise for unsupervised learning of higher level features in speech recognition, computer vision and natural language processing applications Efficient algorithms for learning deep neural networks on large datasets is an active area of research • Feature selection Noisy and redundant features can cause trained models to overfit the data, thus adversely impacting their accuracy In order to prune such features, we need scalable algorithms for (i) computing correlations between attributes and the target using metrics such as Pearson’s coefficient, information gain or mutual information, and selecting highly correlated features with predictive power, (ii) computing pairwise attribute correlations and dropping redundant features that are highly correlated with other features, and (iii) reducing data dimensionality using techniques such as PCA, SVD and matrix factorization • Model training Online learning algorithms such as stochastic gradient descent (SGD) can be used to train linear and logistic regression models, and perform tasks such as matrix factorization SGD updates model parameters with the gradient computed for each example as opposed to the entire dataset as is done by traditional batch learning algorithms like BFGS For clustering problems, proposed data streaming algorithms first independently cluster data partitions in an Conclusions and Looking Forward 535 initial pass and subsequently cluster the centroids for each partition in a second pass Note that a possible strategy for scaling learning algorithms is to train models on a random sample of the data Random samples can be obtained in a single pass over the data using reservoir sampling However, the random sampling approach does not consider the entire data and so could hurt model accuracy Scalable training and inference algorithms for complex ML models such as decision trees, random forests, neural networks and graphical models is an active area of research In addition to data streaming algorithms, scaling predictive modeling to large terabyte datasets also requires parallelizing the algorithms over large machine clusters There are two main parallelization paradigms proposed in the literature • Loosely-coupled paradigm In this paradigm, data is partitioned across machines, the algorithm is run locally on each machine and then the data summaries/synopses structures from the different machines are combined into a single summary/synopsis structure Thus, the loosely-coupled paradigm fits well with Map-Reduce style computation with reducers combining the synopses computed by the mappers Note that multiple Map-Reduce iterations may be performed in this paradigm with the synopsis computed at the end of each iteration used to initialize the synopsis at the beginning of the next iteration As an example, consider k-means clustering The synopsis contains the k cluster centroids Each mapper assigns points in its data partition to the closest centroid and computes k new centroids for the new assignment The mappers transmit the new cluster centroids along with the size of each cluster to the reducers that combine the centroids to compute k new centroids using weighted averaging The loosely-coupled paradigm is especially well-suited for parallelizing algorithms where synopses satisfy the composability property, that is, sketches computed on individual data partitions can be combined to yield a single synopsis for the entire dataset Several synopses structures such as the Flajolet–Martin (FM) sketch and the Count-Min sketch satisfy the composability property Algorithms that maintain such synopses can be naturally parallelized using Map-Reduce with mappers computing synopses for each data partition and the reducer composing a single synopsis from the synopses computed by mappers • Tightly-coupled paradigm In many instances, synopses may not satisfy the composability property, thus making the loosely-coupled paradigm inapplicable for parallelization For example, in online machine learning algorithms like SGD, parameter values computed for each data partition cannot be easily combined to yield the same parameter values had the algorithm been run on the full dataset To handle such scenarios, tight coupling between the parallel instances in needed In the tightly-coupled paradigm, the synopsis structure is replicated across multiple machines, and the synopsis replicas are kept synchronized using distributed protocols while the algorithm is running on each machine A popular realization of the tightly-coupled paradigm for ML algorithms is through a centralized parameter server The parameter server synchronizes the values of parameter replicas distributed across the machines using asynchronous 536 M Garofalakis et al messages Specifically, each time a parameter value is updated on a machine, the change in the parameter value is propagated to all the replicas through the parameter server Furthermore, in order to reduce communication, changes are only propagated if they exceed a certain threshold value Scaling ML algorithms through distributed implementations on machine clusters is currently an active area of research Popular systems for statistical and exploratory data analysis such as R load the entire dataset into main memory Consequently, these systems are incapable of analyzing large terabyte datasets that not fit in memory To fill the gap, a number of systems for doing predictive analytics on big data have been developed in recent years These systems rely on online algorithms and parallelization to different degrees in order to handle large datasets We describe the salient characteristics of some prominent systems below Vowpal Wabbit (VW) supports training of a wide spectrum of ML models: linear models with a variety of loss functions like squared, logistic, hinge and quantile loss, matrix factorization models and latent Dirichlet allocation (LDA) models To scale to terabyte size datasets, it uses the online SGD algorithm to learn model parameters Furthermore, it employs a parallel implementation of SGD based on the loosely-coupled paradigm that combines parameter values computed on each data partition using simple averaging GraphLab provides a high level programming interface for rapid development of distributed ML algorithms based on the tightly-coupled paradigm Users specify dependencies among ML model parameters using a graph abstraction, the logic for updating parameter values based on those of neighbors and rules for propagating parameter updates to neighbors in the graph The GraphLab distributed infrastructure partitions parameters across machines to minimize communication, appropriately updates parameter values and transparently propagates parameter updates GraphLab has a large collection of ML methods already implemented such clustering, matrix factorization and LDA Skytree includes many ML methods such as k-means clustering, linear regression, Gradient Boosted trees and Random forests Through a combination of new efficient ML algorithms and parallelization, Skytree is able to achieve significant speedups on massive datasets Azure ML supports a broad range of ML modeling methods including scalable Boosted Decision trees, deep neural networks, logistic and linear regression and clustering using a variant of the standard k-means approach The linear regression implementation is based on the online SGD algorithm while the logistic regression implementation relies on the batch BFGS learning algorithm Apache Mahout offers a scalable machine learning library with support for k-means clustering, matrix factorization, LDA, logistic regression and random forests Mahout implements a streaming k-means algorithm that processes data points one-by-one and makes only one pass through the data The algorithm maintains a set of cluster centroids in memory along with a distance threshold parameter Conclusions and Looking Forward 537 For a new data point, if its distance from the closest cluster centroid is less than the threshold, then the point is simply assigned to the closest cluster (and its centroid is updated) However, if the distance to the closest centroid is greater than the threshold, then the point starts a new cluster Finally, if the number of clusters exceeds the memory size, then the distance threshold is increased and existing clusters are merged The Mahout implementation of logistic regression and matrix factorization uses the online SGD algorithm Spark MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression clustering, collaborative filtering, and dimensionality reduction MLlib computes basic statistics such as mean and variance for numeric attributes, and correlations based on Pearson’s Coefficient and Pearson’s chi-squared tests For classification and regression, it supports linear models with a variety of loss functions such as squared, logistic and hinge with L1 and L2 regularization, decision trees, and naïve Bayes MLlib supports SGD and L-BFGS optimization methods to compute parameter values for linear models that minimize the loss function value MLlib also supports collaborative filtering using the alternating least squares (ALS) algorithm to learn latent factors, k-means clustering, and dimensionality reduction methods such as SVD and PCA ... [6] Data Stream Management: A Brave New World Fig General stream query processing architecture Querying Data Streams: Synopses and Approximation A generic query processing architecture for streaming... briefly outline some key data- stream management concepts and discuss basic stream- processing models 2.1 Data Streaming Models An equivalent view of a relational data stream is that of a massive,... http://www.springer.com/series/5258 Minos Garofalakis r Johannes Gehrke Rajeev Rastogi r Editors Data Stream Management Processing High- Speed Data Streams Editors Minos Garofalakis School of Electrical and Computer Engineering