Data Where You Want It Geo-Distribution of Big Data and Analytics Ted Dunning and Ellen Friedman Beijing Boston Farnham Sebastopol Tokyo Data Where You Want It by Ted Dunning and Ellen Friedman Copyright © 2017 Ted Dunning and Ellen Friedman All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Shannon Cutt Copyeditor: Holly Bauer Forsyth February 2017: Interior Designer: David Futato Cover Designer: Randy Comer First Edition Revision History for the First Edition 2017-02-15: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Data Where You Want It, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-98354-6 [LSI] Table of Contents Why Geo-Distribution Matters Goals of Modern Geo-Distributed Systems Moving Data: Replication and Mirroring Clouds and Geo-distribution Global Data Governance Containers for Big Data Use Case: Geo-Replication in Telecom It’s Actually Broken Even If It Works Most of the Time Use Case: Shared Inventory in Ad Tech 10 13 16 20 20 22 Additional Resources 29 v Why Geo-Distribution Matters “Data where you want it; compute where you need it.” Thirty years ago, if someone in North America or Europe men‐ tioned they had accepted a position at a Tokyo-based firm, your first question likely would’ve been, “When are you relocating to Tokyo?” Today the question has become, “Are you planning to relocate?” Remote communication combined with frequent flights have left the question open in global business teams Just as we now think differently about how people work together, so too a shift is needed in how we build and use global data infrastruc‐ ture in order to address modern business challenges We need sys‐ tems that allow data to live where it should We should be able to think of data—on premise or in the cloud—as part of a global sys‐ tem In short, our data infrastructure should give us data that can be accessed where, when, and by whom we want, and not by anyone else The idea of working with data centers that are geographically distant isn’t new There’s an emerging need among many big data organiza‐ tions for globally distributed but still cohesive data that meets the challenges of consistency, low latency, ease of administration, and low cost, even at huge scale In the past, many people built organizations around a central head‐ quarters plus regional teams that functioned independently, each with its own regional database, possibly reporting back to HQ monthly or weekly Data was copied and backed up at a remote loca‐ tion for disaster recovery, typically daily, particularly if the data was critical But these hierarchical approaches aren’t good enough anymore With a global view via the Internet, people expect to touch and respond to business from anywhere, at any time To really get this done, you need data to reside in many places, with low latency coor‐ dination of updates, and still be robust against communication fail‐ ures on the Internet Data infrastructure often needs to be shared by many applications We may need a data object to live in more than one place—that’s geo-distributed data This includes cloud comput‐ ing, because cloud is really just one more “place,” as suggested by Figure 1-1 Figure 1-1 Emerging technologies address the need for data that can be shared and updated globally, at massive scale,with very low latency and high consistency There’s also a need for fine-grained control over who has access Here, on-premise data centers are shown as rectangles that share data directly to distant locations or form a hybrid system with public or private cloud The challenges posed by the required scale and speed alone are sub‐ stantial For example, IoT sensor data systems commonly move data at a rate of tens of gigabits per second, and some exceed terabits per second The need for truly geo-distributed data—both in terms of storage and computation—requires new approaches, and these new approaches are the topic of this report These approaches have emerged bit by bit rather than all at once, but the accumulated | Why Geo-Distribution Matters change is now large enough to warrant even experienced practition‐ ers to take new look Previously, systems that needed data to be available globally would have explicitly copied data from place to place instead of using platform-level capabilities to automatically propagate changes In practice, however, it pays to think of data as a global system in which the same data objects are shared across different locations This geodistribution of data combined with appropriate design patterns can make it much simpler to build applications and can result in more reliable, scalable systems that span the globe Similarly, the manage‐ ment of computation in global systems has historically been very haphazard This is improving with containers, which allow precise deployment of a precisely known distribution of code In this report, we describe the requirements and challenges of such a system as well as examples of specific technologies designed to meet them We also include a collection of real-world use cases where low-latency geo-distribution of very large-scale data and computa‐ tion provide a competitive edge Goals of Modern Geo-Distributed Systems As we examine some of the emerging technologies that address the need for geo-distributed systems, keep these goals in mind Many modern systems need to take into account the facts that: • Data comes from many sources, in many forms, and from many locations • Sometimes data has to stay in a particular location—for exam‐ ple, to meet government regulations or to optimize perfor‐ mance • In many other cases, a global view is required—for example, for reporting or machine learning in which global models are built from much more than just local activity • Central data often needs to be distributed in order to be shared, as for inventory control, model deployment, accurate predictive maintenance for large-scale industrial systems, or for disaster recovery • Computation (and the associated data) sometimes needs to be near the edge, such as in industrial control systems, and simul‐ Goals of Modern Geo-Distributed Systems | taneously in a central analytics facility that has access to data from across the entire enterprise • To stay competitive in modern settings, data agility and a micro‐ services approach may be required With these demands, how new technologies meet the challenges? Global View: Data Storage and Computation One reason for geo-distribution of data is to put data at a remote site as part of a disaster recovery plan, but movement of data between active data centers is also key for efficient use of data in many situa‐ tions It’s generally more efficient to access data locally, so storing the same data in multiple places and being able to replicate updates with low latency are valuable capabilities One key is to be able to specify how and where data should end up without saying exactly which bits to move We discuss new options for data movement in the next section of this report Although local access is generally desirable, you should have the choice of accessing data remotely as well, and we discuss some of the ways that can be done more easily in “Global Data Governance” on page 13 Data movement is the most important issue in geo-distributed sys‐ tems, but computation is a factor, too This is especially true where very high volume and rate of data production occurs, as with IoT sensor data Edge computing is becoming increasingly important in such situations In the free report, Data: Emerging Trends and Tech‐ nologies, Alistair Croll refers to “…a renewed interest in computing at the edges—Cisco calls it ‘fog computing…’” An application that has been developed and tested at a central data center needs to be deployed to multiple locations How can you this efficiently and with confidence that it will perform as needed at the new locations? We address that topic in “Containers for Big Data” on page 16 Moving Data: Replication and Mirroring Remote mirroring and geo-distributed data replication can be done in a variety of ways Traditional methods for data movement were not designed to provide the large scale, low latency, and low cost that modern systems demand We’ll examine capabilities needed to this efficiently, the challenges in doing so, and how some emerg‐ ing technologies begin to address those challenges | Why Geo-Distribution Matters For instance, one obvious potential solution is to embed the data in the running containers themselves This is a problem since the data has to outlive any given container and thus can’t be embedded in the container image itself Discarding the data on exit doesn’t work either because the data in a stateful container typically needs to out‐ live the container By symmetry, data that outlives one container should exist until the replacement container is fully running That, in turn, means that this new container will need to be filled with that data on launch, which makes scalability difficult Furthermore, even if you load the data into the new container, data local to a con‐ tainer can be hard to share In short, this doesn’t work well A second option is to run applications in containers that include some distributed data store as well This is a substantial problem because the application lifecycle is fundamentally different from the data lifecycle It takes time, potentially a long time, after the con‐ tainer is launched before it can contribute to data storage tasks, whereas we usually want the application to work immediately, with all participants in the distributed data system working effectively at all times Restarting many containers with new application code shouldn’t impair access to long-lived data, but collocating infrastruc‐ tural data storage code with stateful applications will result in exactly that impairment Likewise, block-level storage (such as a SAN system) provided out‐ side the containers isn’t the answer because individual containers have to form local file systems on top of the block storage This takes time and is inflexible and difficult to share between containers Even worse, data bandwidth doesn’t scale with the number of containers because all of the data is actually hosted on the SAN rather than in the container Data locality is also a huge problem because the data isn’t local to any containers This leads to the previous problem of data lifecycle in which data must be loaded and unloaded Other conventional data stores like a network attached storage (NAS) system or a conventional database are complete non-starters as large-scale stateful container storage because such systems just can’t scale to meet the demands of large numbers of stateful contain‐ ers Containers for Big Data | 17 How to Make Stateful Containers Work for Big Data The modern response to the problem of supporting stateful contain‐ ers is to use a scalable and distributed data platform to which all containers have access It’s often desirable to collocate the data plat‐ form with the same machines that run the containers Due to the mass of data, the scalable storage layer is often deployed alongside the base operating system that supports containers instead of inside containers, but deploying the data platform inside special-purpose containers is also possible This results in a system something like what is shown in Figure 1-4 Figure 1-4 Containers running stateful and stateless applications can be deployed by a container management system like Kubernetes, but can also access a scalable data platform Applications running in con‐ tainers can communicate with each other directly or via the data plat‐ form The application state can reside logically in the data platform, but can often be physically collocated with the applications Having special-purpose datastores like Apache Cassandra that are individually scalable provides some of the benefits of this design, but that design falls short in many respects One of the key problems is that a database like Cassandra only provides one level of namespace (table names) and only provides one kind of persistence, namely tables Similarly, databases can’t really provide file access, nor are they really suitable for streaming data Similar problems occur with special-purpose file stores like HDFS Problems also crop up with HDFS specifically in that the shortcircuit read used to increase performance with HDFS is disabled for processes in containers This can severely limit the read perfor‐ 18 | Why Geo-Distribution Matters mance of container-based applications, even for data that is located on the same machine as the container doing the read Crucial to the idea of having a data platform to support stateful applications is that the data platform should directly allow applica‐ tions access to high-level persistence objects such as files, tables, and message streams and organize these objects using some kind of namespace Example: Handling State and Containers on MapR The MapR Converged Data Platform is an example of a system that can support stateful applications in containers as described here A conventional MapR cluster is created, and then Docker is installed either on the nodes of the MapR cluster or on other nodes if net‐ work I/O is sufficiently high Docker containers based on an image that contains the MapR DB, Streams, and Posix file access are then used to hold the stateful services These services can access files, tables, or streams uniformly with the exact same file names from any container without even knowing where the container is located on the network File location information is available to a coordina‐ tion system such as Kubernetes or Apache Mesos to allow special containers to be placed close to any desired data resources Typically, containers will not be launched directly, but instead will be launched by Kubernetes or Mesos Whatever framework is used injects any necessary user credentials and cluster address informa‐ tion into the container Because the containers themselves don’t require any state, they can launch quickly and be productive within seconds In addition, the MapR file system can be used to store and deploy container images Again, the universal namespace helps by allowing any legal user ID to access the images by the same path names regardless of the location of the machine storing the Docker images Summary Containers make a perfect match for a geo-distributed data, particu‐ larly because they help build stable deployment platforms for serv‐ ices Conversely, some geo-distributed data systems such as MapR work well for running large swarms of containers Containers for Big Data | 19 Use Case: Geo-Replication in Telecom Modern telecommunications systems are very complex and are composed of an enormous number of devices that measure all kinds of things about their own operation, as well as about the calls and data transfers that are being done for users of the system Some of these measurements are eventually used to create billing records Others are used to monitor the operation of the system to allow fault detection and diagnosis and to tune performance The raw volume of these measurements is surprisingly large; in toto the volume of raw metrics and monitoring data in a telecommunications system probably exceeds the amount of user data transferred Much of this data is only kept at the point of measurement and is retained only if a fault is discovered; some of it is summarized near the point of measurement and transmitted in a much smaller summary form, but much of the data is simply discarded Most of the telecommunications data is collected as log files that may be summarized locally into aggregate logs These logs are copied, typically using well-worn systems like sftp, to aggregation points where they are monitored by alerting systems and further aggregated These alerts and aggregates are stored again in log files and transferred, again by relatively old techniques, to central facili‐ ties The time from when a measurement is taken to when it affects an aggregate in the central facility is often many tens of minutes Keeping the log files moving properly is a very complex task in and of itself, and the systems that move the files generate even more log files that must be monitored to verify correct operation All in all, however, the metrics systems in telecom mostly work, and they already handle data that is definitely geo-distributed So what could more modern geo-distributed techniques have to offer? It if ain’t broke, why fix it? It’s Actually Broken Even If It Works Most of the Time In fact, the system of data transfer described above is seriously bro‐ ken It isn’t broken because bits are lost between the measurement and the final reports, but because lots of time and effort are lost in getting those bits to their final destination It is broken because those bits take a long time to get to where decisions can be made, 20 | Why Geo-Distribution Matters and the intricate scripting of transfers involved in moving them to their goals is sufficiently difficult that to make it work faster is gen‐ erally considered infeasible Figure 1-5 shows an alternative approach that assumes that message streams are available as an infrastructural primitive, as are MapR Streams, and that these message streams can be set up with stream‐ ing replication from data center to data center Figure 1-5 Schematic showing how data can be moved from original source via MapR Streams and stream replication to an aggregation program in a central facility Message streams are represented by hori‐ zontal cylinders Instead of waiting for a log file to fill up, data is pushed into the left-most streams message by message Due to the way MapR Stream replication works, the destination stream contains the union of messages from each source The aggregated result can then be pushed to a result stream From the point of view of the application developer, all that is neces‐ sary is to write messages to a stream at the data source and read these messages from the stream at the aggregation point The fact that the messages traverse from tower to HQ is an infrastructural detail that isn’t pertinent (from the point of view of the application implementor) From the point of view of the administrator, all that matters is that the streams in the towers have the right permissions and are configured to replicate to the HQ stream The actions of the application aren’t pertinent (from the admin’s point of view) It’s Actually Broken Even If It Works Most of the Time | 21 Neither the admin nor the application developer need to worry about how MapR Streams actually replicate messages from data cen‐ ter to data center The result is that the data moves from source to final report in seconds or less and there are no transfer programs to write or schedule The application developer focusses only on the application without regard to geo-replication, and the administrator focuses on the creation and configuration of the streams This divi‐ sion of attention means that both developer and admin can a bet‐ ter job, and the resulting product produces results that are more likely to be correct in a much shorter amount of time Use Case: Shared Inventory in Ad Tech Online advertisers want to place their ads in front of as many people who might be receptive to their message as possible, while website publishers want to get as much income as possible from the adver‐ tisers while still providing a good enough experience for visitors to keep them coming back The way this combination is done is that as a visitor to a website loads a web page, and a request is made by the publisher for bids on the available ad placements Advertisers bid on the placements, and the publisher picks whoever makes the high bid That is the idealistic view of the process, anyway In practice, it doesn’t quite work that way First of all, not all publishers can know about all of the advertisers This has led to the rise of advertising exchanges These are companies like Doubleclick or Rubicon Project who make deals with publishers to handle the selection of which ads to show and make deals with advertisers to bid on rea‐ sonable placements for them Second, the selection of a single best bid isn’t really possible Instead, the process is more like a Dutch auction: the price is reduced until a buyer jumps in In this situation, response time can be almost as important as price in picking a win‐ ning bid That means that deciding how to respond to bids quickly is critically important to the success of an ad exchange This require‐ ment for speed is another major reason that ad exchanges exist; they can specialize in making accurate bids very fast, whereas only a very few websites can invest enough effort to so One key strategy that is required for fast response is to put the bid‐ ding algorithms close to the web content being viewed It can take 100 milliseconds for a request to reach a distant data center and for 22 | Why Geo-Distribution Matters a response to return, by which time most ad auctions would be over and done But distributing the bidding process raises the core issue of geo-distribution of data Advertisers don’t just give an ad exchange carte blanche to place an ad as many times as possible Instead, they agree to pay for a limited number of placements over a time period As such, to bid intelligently on an ad placement, you have to know how many impressions are remaining on a contract and whether you are ahead or behind schedule on satisfying that contract And you have to know this in many distinct locations at the same time The historical way to solve this problem was to ship logs from the remote data centers where bidding is actually done back to a central data center where recent impressions are tallied New estimates of current inventory then are pushed back to the remote data centers Log shipping and inventory pushing was typically done using ad hoc methods that have been written and rewritten over time Typically, the early versions of these systems worked well initially, but they had trouble with increasing load or in handling failure modes The modern way to work on these problems is to build on a stron‐ ger foundation For instance, if you have solid multi-master replica‐ tion of tables, you can build a system such as the one shown in Figure 1-6 Use Case: Shared Inventory in Ad Tech | 23 Figure 1-6 Storing current inventory in multiple data centers using a single distributed table that can be updated from any location helps solve the problem of how to respond quickly and accurately to bid requests MapR multi-master table replication is one example of a technology that provides the needed real-time geo-distribution of data The idea is that rather than building ad hoc methods for moving data, having a platform that inherently provides geo-distribution makes things simpler Of course, real life isn’t quite as simple as drawing one diagram Many distributed databases that continue to operate when the replication links are severed rely on a last-writewins strategy This can cause problems if writers in different datacenters read-modify-write operations (such as incrementing a counter) on the same element in different locations The common way to deal with this is simply to have several counters in the data‐ base for each inventory item, one for each location This means that read-modify-write operations each happen on a single value in a single location, and thus the effect of strong local consistency for each of these elements will be propagated to replicas of that value On read, you will need to combine these elements Figure 1-7 shows how such a table might look if we have three ads (ad-1, ad-2, and ad-3) being presented from three data centers (dc-1, dc-2, and dc-3) 24 | Why Geo-Distribution Matters Figure 1-7 Keeping track of a vector clock as multiple elements in a single row The available inventory for ad-1 in this example would be: 220000 – 1250 – 4609 – 1318 = 212823 This method of using multiple values to represent a single counter is known as a “vector clock,” but keep in mind that getting it right depends on having strong consistency in each data center; eventual consistency isn’t appropriate because if faults within a single data center cause data errors, that can cause serious problems For real-time bidding, something like a counter that represents the current state is the right data structure, but for accounting and bill‐ ing that is subject to audit, you may need a record of every transac‐ tion, or at least a summary by region and time For that, a design something like the one shown in Figure 1-8 is useful Use Case: Shared Inventory in Ad Tech | 25 Figure 1-8 A platform such as MapR that provides multi-master stream replication allows a safe and unified audit trail of impressions for billing purposes Messages in the streams can contain information from individual impressions or short-term aggregations of multiple impressions The similarity to Figure 1-6 is striking and strongly makes the case that tables and streams share a nearly identical need for replication It therefore makes sense that tables and streams should share a sin‐ gle replication mechanism that is provided at the platform level Note also how stream replication in Figure 1-8 is bidirectional, and there are multiple paths between different replicas This is important because it allows the system to recover from faults as much as possi‐ ble without manual intervention to change the replication pattern For instance, if the link between data center and is lost, data will still replicate between the two centers via the HQ data center Like‐ wise, if the HQ data center goes offline entirely, data centers and will stay in synch The only likely loss in such a scenario would be that new advertisements could not be entered into the system, and that could easily be handled if there were a disaster recovery facility to take over for the HQ data center The lesson to be learned from this use case is that uniform mecha‐ nisms for replication of streams and tables can make geo-distributed systems much easier to build It isn’t that these mechanisms solve the hard problems of machine learning that are associated with suc‐ 26 | Why Geo-Distribution Matters cessful ad targeting, but instead that solving the problem of geodistribution once at the platform level allows you to spend more time on adding unique value rather than rebuilding low-level repli‐ cation systems Use Case: Shared Inventory in Ad Tech | 27 Additional Resources In addition to the O’Reilly publication Streaming Architecture by Ted Dunning and Ellen Friedman, you may find the following resources helpful on topics related to geo-distribution of big data: • “Persistence in the Age of Microservices: Introducting MapR Converged Data Platform for Docker” article on containers with direct access to data persistence, by Will Ochandarena • Whiteboard Walkthrough videos by Ted Dunning: — “Big Data in the Cloud” — “Converged Advantages in the Cloud” — “Anomaly Detection in Telecommunications Using Complex Streaming Data” • For more on MapR Mirroring and multi-master replication of tables and streams, see the MapR website Selected O’Reilly Publications by Ted Dunning and Ellen Friedman • Streaming Architecture: New Designs Using Apache Kafka and MapR Streams (O’Reilly, 2016) • Sharing Big Data Safely: Managing Data Security (O’Reilly, 2015) • Real-World Hadoop (O’Reilly, 2015) • Time Series Databases: New Ways to Store and Access Data (O’Reilly, 2014) 29 • Practical Machine Learning: A New Look at Anomaly Detection (O’Reilly, 2014) • Practical Machine Learning: Innovations in Recommendation (O’Reilly, 2014) O’Reilly Publication by Ellen Friedman and Kostas Tzoumas • Introduction to Apache Flink: Stream Processing for Real Time and Beyond (O’Reilly, 2016) 30 | Additional Resources About the Authors Ted Dunning is Chief Applications Architect at MapR Technologies and active in the open source community He currently serves as VP for Incubator at the Apache Foundation, as a champion and mentor for a large number of projects, and as committer and PMC member of the Apache ZooKeeper and Drill projects He developed the t-digest algorithm used to estimate extreme quantiles T-digest has been adopted by several open source projects He also developed the open source log-synth project described in the book Sharing Big Data Safely (O’Reilly) Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems, built fraud-detection systems for ID Analytics (LifeLock), and has issued 24 patents to date Ted has a PhD in computing science from University of Shef‐ field When he’s not doing data science, he plays guitar and mando‐ lin Ted is on Twitter as @ted_dunning Ellen Friedman is a solutions consultant, scientist, author, and a committer on the Apache Drill and Apache Mahout projects With a PhD in biochemistry and years of writing on a variety of scientific and computing topics, she is an experienced communicator Her books include Introduction to Apache Flink, Streaming Architecture, the Practical Machine Learning series, and Time Series Databases, all from O’Reilly Ellen has been an invited speaker at Strata London, Nike Tech Talks in Portland, Data Day Seattle, Berlin Buzzwords, and Philly ETE conference and was a keynote speaker at Big Data London and at NoSQL Matters (Barcelona) Ellen is on Twitter as @Ellen_Friedman ... offer? It if ain’t broke, why fix it? It s Actually Broken Even If It Works Most of the Time In fact, the system of data transfer described above is seriously bro‐ ken It isn’t broken because bits... accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this... analytics facility that has access to data from across the entire enterprise • To stay competitive in modern settings, data agility and a micro‐ services approach may be required With these demands,