Fast Data and the New Enterprise Data Architecture Scott Jarr Beijing • Cambridge • Farnham • Kưln • Sebastopol • Tokyo Preface A structural shift in data management is underway Unlike previous eras of technological change — mainframe to server, server to PC, PC to mobile and tablet — this shift is not driven solely by growth in processing power (the oft-cited Moore’s Law) Today, processing power is cheap at the endpoints The combination of cheap, ubiquitous CPUs attached to fast mobile networks is creating a network effect of devices, distorting Moore’s Law with the force multiplier of near-global wireless network coverage Thus, today’s shift is spurred not only by increases in processing power but also by the growth of data — of new data, which is doubling every two years — and by the rate of growth in the perceived value of data These macro computing trends are causing a swift adoption of new data management technologies Open source software solutions and innovations such as in-memory databases are enabling organizations to reap the value of realtime interactions and observations No longer is it necessary to wait for insight until the data has been analyzed deeply in a big data store This is changing the way in which enterprises manage data, both data in motion-"fast data” streaming in from millions of endpoints — and data at rest, or “big data” stored in Hadoop and data warehouses Businesses in the vanguard of this change recognize that they operate in a “data economy.” These leaders make an important distinction between the two major ways in which they interact with data This shift in thinking has led to the creation of a new enterprise data architecture This book will discuss what the new enterprise data architecture looks like as well as the benefits it will deliver to organizations It will also outline the major technology components necessary to build a unified enterprise data architecture, one in which both fast data and big data work together Chapter What’s Shaping the Environment Data Is Everywhere The digitization of the world has fueled unprecedented growth in data, much of it driven by the global explosion of mobile data sources and the Internet of Things (IoT) Each day, more devices — from smartphones to cars to electric grids — are being connected and interconnected It is safe to predict that within the next 10–15 years, anything powered by electricity will be connected to the Internet According to the 2014 EMC/IDC Digital Universe report, data is doubling in size every two years In 2013, more than 4.4 zetabyes of data had been created; by 2020, the report predicts that number will explode by a factor of 10 to 44 zetabytes — 44 trillion gigabytes The report also notes that people — consumers and workers — created some two-thirds of 2013’s data; in the next decade, more data will be created by things — sensors and embedded devices In the report, IDC estimates that the IoT had nearly 200 billion connected devices in 2013 and predicts that number will grow 50% by 2020 as more devices are connected to the Internet — smartphones, cars, sensor networks, sports tracking monitors, and more Data from these connected devices is fueling a data economy, creating huge implications for future business opportunity Additionally, the rate of growth of new data is creating a structural change in the ways enterprises, which are responsible for more than 80% of the world’s data, manage and interact with that data As the data economy evolves, an important distinction between the major ways in which businesses interact with data is emerging Companies have begun to interact with data that is big — data that has volume and variety Additionally, as companies embark on ever-more extensive big data initiatives, they have also realized the importance of interacting with data that is fast The ability to process data immediately — a requirement driven by IoT macro-trends — creates new opportunity to realize value via disruptive business models To illustrate this point, consider the devices generating all this data Some are relatively dumb sensors that generate a one-way flow of information — for example, network sensors that push data to a processing hub but that cannot communicate with one another More important are two-way sensors embedded in “smart” devices — for example, automotive in-vehicle infotainment and navigation systems and smart meters used in smart power grids These two-way sensors not only collect data but also enable organizations to analyze and make decisions on that data in real time, pushing results (more data) back to the device These smart sensors create huge streams of fast, smart data; they can act autonomously on “your” inputs as well as act collectively on the group’s inputs The EMC/IDC report states that “embedded systems — the sensors and systems that monitor the physical universe — already account for 2% of the digital universe By 2020 that will rise to 10%.” Clearly, two-way sensors that generate fast and big data require different modes of interaction if the data is to have any business value These different modes of interaction require the new capabilities of the enterprise data architecture Data Is Fast Before It’s Big It is important to note that the discussion in this book is contained to what are described as “data-driven applications.” These applications are pervasive in many organizations and are characterized by utilization of data at scales previously unobtainable This scale can refer to the complexity of the analysis, the sheer amount of data being managed, or the velocity at which data must be acted upon Simply stated, data is fast before it is big With the increase in fast data comes the opportunity to act on fast and big data in a way that creates the most compelling vision for data-driven applications Fast data is a new opportunity made possible by emerging technologies and, in many cases, by new approaches to established technologies, e.g., inmemory databases In the new paradigm — one in which data in motion has equal or greater value than “historical” data (data at rest) — new opportunities to extract value require that enterprises adopt new approaches to data management Many traditional database architectures and systems are incapable of dealing with fast data’s challenges As a result, the data management industry has been enveloped in confusion, much of it driven by hype surrounding the major forces of big data, cloud, and mobility Fortunately, many of the available technologies are falling into categories based on problems they address, bringing the picture into better focus This is good news for application developers, as advances in cloud computing and in-memory database architectures mean familiar tools can be used to tackle fast data Chapter The Enterprise Data Architecture Introduction The enterprise data architecture is a break from the traditional siloed data application, where data is disconnected from the analytics and other applications and data The enterprise data architecture supports fast data created in a multitude of new end points, operationalizes the use of that data in applications, and moves data to a “data lake” where services are available for the deep, long-term storage and analytics needs of the enterprise The enterprise data architecture can be represented as a data pipeline that unifies applications, analytics, and application interaction across multiple functions, products, and disciplines (see Figure 2-1) Figure 2-1 Fast data represents the velocity aspect of big data Future Applications Where Data Is the Major Value Examples: Industrial Internet, smart infrastructure, Internet of Things Perhaps the most promising improvements to everyday life will come from areas just now emerging These industries have not been automated to the extent that we will see them automate in the next five years Data will be the driving factor in the value these services offer Unlike the category above, these industries need to build the endpoints that will be controlled by the smart data they generate The enterprise data architecture will enable the intelligence of these industries A large measure of the utility of these services will come not from the devices themselves, but from the ancillary services and intelligence derived from the data Example: Smart meter deployments initially look like a way to reduce the human labor involved in the process of reading a meter But that is a small, and likely not even cost-effective, benefit of the smart electrical meter The ability of the meter to communicate bi-directionally, to be considered in the context of the surrounding environment to warn of imminent disasters, maintenance needs, and efficiency advantages, are all high value–added services made possible by data Fast data applications will, of course, move beyond the use cases presented above as more traditional users — in Geoffrey Moore’s terminology, the Pragmatists, Conservatives, and Luddites — feel pressure to extract value from business data to remain competitive Imagine if the taxi industry, for example, had seen the value in data before Uber, Lyft, et al., emerged to disintermediate its business model While one could argue that the nature of the traditional taxi industry does not lend itself to a broad sharing and analysis of data, it is clear that markets can be created — and destroyed — by the ability (or failure) to recognize fast data opportunities [1] William Gibson on NPR’s Fresh Air, August 1, 1993 Also in “The Science in Science Fiction” on Talk of the Nation, NPR (30 November 1999, Timecode 11:55) Chapter How Fast and Big Applications Will Enter the Enterprise Fast data is already streaming into the enterprise, and more is coming on a daily basis However, in many cases, enterprises are pushing this fast data directly into the data lake, missing the opportunity to extract valuable realtime insights from data streams using in-memory technology Realizing the benefits of this fast data requires a new enterprise data architecture Therefore, the way in which systems are designed and built to leverage streams of data will define how quickly and pervasively fast data applications will be rolled out within an organization To understand how enterprise adoption of fast data technologies will occur, one needs to examine both the data sources and the applications that utilize those data sources Four broad usage environments will drive enterprise adoption of fast data The first three are combinations of a specific application and the data source(s) that encompass that application The fourth category will be defined by corporations that truly understand the value that exists in being data-driven, and are prepared to implement an enterprise data architecture designed to unify all data interaction within the enterprise Existing Applications This category of usage exists when applications that manage data begin to experience increasing volumes of data, exerting pressure on existing applications Given the normal architecture of these systems, the load on the traditional database component will no longer meet the needs of the application; a change in the application will be required Adoption will occur because systems are no longer capable of meeting the needs of application users Application developers will be forced to look at alternative technologies as the rate of inbound events exceeds what is possible to manage with the more traditional database systems around which applications were originally designed For example, this is what has happened with mobile subscriber data, and more change is coming As phones and phone service prices continue to drop, more customers are coming online in a given geography; new markets are opening because of the lower-cost model A subscriber system using a traditional database to manage one million subscribers will break under the load when demand expands to 100 million subscribers Equally taxing on the system is when a process that has historically had relatively few inputs is enhanced to add more detailed measurements As an illustration, manufacturing Enterprise Resource Planning (ERP) software does not appear to be a likely candidate for fast data until one realizes that entire manufacturing lines are being retrofitted with sensors on every component in the manufacturing process These systems are developed to feed realtime manufacturing data back into resource planning software to enable fine-grained adjustments and optimizations These changes put enormous stress on systems, often forcing evaluation of new database technology New Applications, Existing Data Sources Another way in which existing data sources are driving enterprise adoption of fast data is when those data sources, which have existed for years, are deemed to have newfound value This newfound value often manifests itself in two ways: looking at data as it is generated in real time, or looking at it differently or in combination with other activities Occasionally this triggers a displacement of one set of tools for either a broader or more customized solution The change driven in these applications will remain within the confines of the single application, but will allow for more innovative uses by the application developer Consider an example: Network packets are not a new data source within the enterprise However, fast data technologies have advanced to the point at which network packet ingestion creates new capabilities from already existing data These network packets can be the source of fraud detection or Distributed Denial of Service (DDoS) detection by harnessing data currently available in the enterprise New Applications, New Data Sources New data sources enable companies to launch new products and services, creating disruptive forces in many industries These applications marry inbound data and user and device interaction with the environment to create new categories of products In many cases, these systems are being built as a complete package — for example, new smart infrastructure systems that manage power distribution in many cities But some come from the ability to take sensor information from a phone and build entirely new applications on data that was not previously available What makes these products notable is that a large portion of the value the user experiences with the product derives from the data that informs the product’s interaction New Data-Driven Enterprise Integration While the categories discussed previously are likely to be the initial adoption path into enterprise fast data, they are not the most disruptive As reviewed in the beginning of this book, there is a 1+1=3 opportunity when all data assets within an enterprise are combined in an enterprise data architecture that can leverage all data — structured and unstructured, real time and historic, fast and big — across product lines and business units Companies that see the opportunity and move quickly to adopt an enterprise data architecture will gain the most from the imminent disruption that will come from capturing and utilizing both fast and big data Chapter Getting There: Making the Right Fast Data Technology Choices Application developers and technical managers involved with building fast and big data applications have a number of technology alternatives to evaluate Clearly, the choices made in all phases of the architecture are important, but special attention must be paid to the choices for the fast data portion of the system Architectural Approaches to Delivering Fast Data Three technology categories can be evaluated as the core components for the fast data portion of the enterprise data architecture: fast OLAP systems, stream processing products, and fast operational database systems All are highly capable systems, but some are better suited to meet the broad requirements of fast data as described in this book Organizing the alternatives by their core architecture types provides a way to evaluate strengths and weaknesses Fast OLAP Systems New in-memory OLAP systems are able to drastically reduce reporting times and enable near realtime analysis of fast-arriving data Many of these systems are column stores, optimized for uses where the only requirement is to improve reporting speeds Additionally, some of these systems have the ability to ingest data quite quickly OLAP solutions, however, are designed as analytics engines and generally are not useful for making decisions on individual events as they arrive in the system This inability to provide transactions at the point of data entering the architecture restricts these systems from solving the primary value that is achieved in the fast data portion of the architecture Stream Processing Systems Stream processing approaches, including complex event processing (CEP), are available as open source as well as commercial options Stream processing has been around for decades and has proven valuable in some very specialized uses in specific industries such as capital markets trading, where very specific patterns and timings need to be identified When used in these environments, it is a well-suited system Stream processing systems provide scalable message processing and coordination between systems that often scales across commodity servers However, stream processing systems not maintain data state As a result, they are severely limited in the ways in which they interact with an event entering the pipeline All context of other data, either static data in data fusion instances, or changing data from other events passing through the system, is lost Also, without the concept of state, analytics are performed by hand-coding algorithms and maintaining and managing the state for the results Because stream processing wasn’t designed to serve the needs of modern fast data applications, it tends to be a poor match In order to overcome these shortcomings, additional code is often written to perform continuous computations (realtime analytics), and databases are added to maintain state This adds complexity and moves the performance bottleneck to another component in the system The results are often systems that don’t meet the requirements of the application and are burdened with complexity Operational Database Systems Operational database systems are, by definition, designed to support perevent decision-making that is informed by other data stored within the system Operational databases have long been the standard for interactive applications, but historically were unable to meet the performance required of fast data use cases In-memory, NewSQL systems are now available that are capable of meeting the performance requirements of the operational work as well as delivering full dataset analytics Because these systems were designed with fast data applications in mind, the integration with the big data portion of the architecture is normally built in Fast OLAP Stream processing Fast operational DB Ingests data streams Some Yes Yes Data-driven event decisions No No Yes Realtime analytics Yes Through add-on Yes Integrates with big data system No Yes Yes No Yes Serves analytic results from big data systems Yes Chapter Conclusion Understanding the promise and value of fast data is an absolute necessity, but it is not sufficient to guarantee success for companies still working to implement big data initiatives Having the tools, and the skills, to take advantage of fast data is critical for businesses in all industries and geographies Fast data is the payoff for big data While much can be accomplished by mining data to derive insights that enable a business to grow and change, looking into the past provides only hints about the future Simply collecting vast amounts of data for exploration and analysis will not prepare a business to act in real time, as data flows into the organization from millions of endpoints: sensors, mobile devices, connected systems, and the Internet of Things Because fast and big data have different requirements, it’s necessary to have a component on the front end of the enterprise data architecture to ingest and interact on data, perform real realtime analytics, and make data-driven decisions on each event Applications can take action, and data can be exported to the data warehouse for historical analytics, reporting, analysis, and more The missing link between fast and big is a unified enterprise data architecture This approach links high-value, historical data from the data lake to fast-moving, inbound data from multiple endpoints This frees application developers to write code that adds value to the organization, rather than being burdened by writing code to persist data as it flows to the data lake An in-memory operational system that can decide, analyze, and serve results at fast data’s speed is key to making big data work at enterprise scale Fast data, achieved through adoption of a new enterprise data architecture, gives organizations the tools to process high-volume streams of data while enabling millions of complex decisions in real time With fast data, things that were not possible before become achievable: instant decisions can be made on realtime data to drive sales, connect with customers, inform business processes, and create value About the Author Scott Jarr is Co-Founder & Chief Strategy Officer of VoltDB He brings more than 20 years of experience building, launching, and growing technology companies from inception to market leadership in highly competitive environments Prior to joining VoltDB, Scott was VP Product Management and Marketing at online backup SaaS leader LiveVault Corporation While at LiveVault, Scott was key in growing the recurring revenue business to 2,000 customers strong, leading to an acquisition by Iron Mountain Scott has also served as board member and advisor to other early-stage companies in the search, mobile, security, storage and virtualization markets Scott has an undergraduate degree in mathematical programming from the University of Tampa and an MBA from the University of South Florida Colophon The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono Fast Data and the New Enterprise Data Architecture Scott Jarr Editor Mike Hendrickson Revision History 2014-09-24 First release Copyright © 2014 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Fast Data and the New Enterprise Data Architecture and related trade dress are trademarks of O’Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc was aware of a trademark claim, the designations have been printed in caps or initial caps While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights O’Reilly Media 1005 Gravenstein Highway North Sebastopol, CA 95472 2014-11-21T06:23:40-08:00 Fast Data and the New Enterprise Data Architecture Table of Contents Preface What’s Shaping the Environment Data Is Everywhere Data Is Fast Before It’s Big The Enterprise Data Architecture Introduction Data and the Database Universe Architecture Matters Components of the Enterprise Data Architecture Big Data, the Enterprise Data Architecture, and the Data Lake Integrating Traditional Enterprise Applications into the Enterprise Data Architecture Fast Data in the Enterprise Data Architecture An End-to-End Illustration of the Enterprise Data Architecture in Action Why Is There Fast Data? Fast Data Bridges Operational Work and the Data Pipeline Fast Data Frontier — The Inevitability of Fast Data Make Faster Decisions; Don’t Settle Only for Faster Analytics Applications and Analytics Merge Progression to Realtime Analytics Necessitates Automated Decisions Requirements of Fast Data Systems in the Enterprise Data Architecture Building an Architecture for Fast Data Ingest/interact with the data feed Make decisions on each event in the feed Provide visibility into fast-moving data with realtime analytics Fast data systems must seamlessly integrate into systems designed to store big data Fast data systems must have the ability to serve analytic results and knowledge from big data systems quickly to users and applications, closing the data loop Fast Data Applications (and Most of Them Are) Industries That Have Historically Dealt with Fast Data Challenges in a Siloed Way Industries Being Transformed by the Changes Data Represents Future Applications Where Data Is the Major Value How Fast and Big Applications Will Enter the Enterprise Existing Applications New Applications, Existing Data Sources New Applications, New Data Sources New Data-Driven Enterprise Integration Getting There: Making the Right Fast Data Technology Choices Architectural Approaches to Delivering Fast Data Fast OLAP Systems Stream Processing Systems Operational Database Systems Conclusion About the Author Colophon Copyright ... Introduction The enterprise data architecture is a break from the traditional siloed data application, where data is disconnected from the analytics and other applications and data The enterprise data architecture. .. petabytes of data and generate extensive historical reports Big Data, the Enterprise Data Architecture, and the Data Lake The big data portion of the architecture is centered around a data lake, the. .. Fast Data in the Enterprise Data Architecture The enterprise data architecture is split into two main capabilities, loosely coupled in bidirectional communications — fast data and big data The