Big Data Now 2015 Edition O’Reilly Media, Inc Big Data Now: 2015 Edition by O’Reilly Media, Inc Copyright © 2016 O’Reilly Media, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Nicole Tache Production Editor: Leia Poritz Copyeditor: Jasmine Kwityn Proofreader: Kim Cofer Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest January 2016: First Edition Revision History for the First Edition 2016-01-12: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Big Data Now: 2015 Edition, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-95057-9 [LSI] Introduction Data-driven tools are all around us — they filter our email, they recommend professional connections, they track our music preferences, and they advise us when to tote umbrellas The more ubiquitous these tools become, the more data we as a culture produce, and the more data there is to parse, store, and analyze for insight During a keynote talk at Strata + Hadoop World 2015 in New York, Dr Timothy Howes, chief technology officer at ClearStory Data, said that we can expect to see a 4,300% increase in annual data generated by 2020 But this striking observation isn’t necessarily new What is new are the enhancements to data-processing frameworks and tools — enhancements to increase speed, efficiency, and intelligence (in the case of machine learning) to pace the growing volume and variety of data that is generated And companies are increasingly eager to highlight data preparation and business insight capabilities in their products and services What is also new is the rapidly growing user base for big data According to Forbes, 2014 saw a 123.60% increase in demand for information technology project managers with big data expertise, and an 89.8% increase for computer systems analysts In addition, we anticipate we’ll see more data analysis tools that non-programmers can use And businesses will maintain their sharp focus on using data to generate insights, inform decisions, and kickstart innovation Big data analytics is not the domain of a handful of trailblazing companies; it’s a common business practice Organizations of all sizes, in all corners of the world, are asking the same fundamental questions: How can we collect and use data successfully? Who can help us establish an effective working relationship with data? Big Data Now recaps the trends, tools, and applications we’ve been talking about over the past year This collection of O’Reilly blog posts, authored by leading thinkers and professionals in the field, has been grouped according to unique themes that garnered significant attention in 2015: Data-driven cultures (Chapter 1) Data science (Chapter 2) Data pipelines (Chapter 3) Big data architecture and infrastructure (Chapter 4) The Internet of Things and real time (Chapter 5) Applications of big data (Chapter 6) Security, ethics, and governance (Chapter 7) Chapter Data-Driven Cultures What does it mean to be a truly data-driven culture? What tools and skills are needed to adopt such a mindset? DJ Patil and Hilary Mason cover this topic in O’Reilly’s report “Data Driven,” and the collection of posts in this chapter address the benefits and challenges that data-driven cultures experience — from generating invaluable insights to grappling with overloaded enterprise data warehouses First, Rachel Wolfson offers a solution to address the challenges of data overload, rising costs, and the skills gap Evangelos Simoudis then discusses how data storage and management providers are becoming key contributors for insight as a service Q Ethan McCallum traces the trajectory of his career from software developer to team leader, and shares the knowledge he gained along the way Alice Zheng explores the impostor syndrome, and the byproducts of frequent self-doubt and a perfectionist mentality Finally, Jerry Overton examines the importance of agility in data science and provides a real-world example of how a short delivery cycle fosters creativity How an Enterprise Begins Its Data Journey by Rachel Wolfson You can read this post on oreilly.com here As the amount of data continues to double in size every two years, organizations are struggling more than ever before to manage, ingest, store, process, transform, and analyze massive data sets It has become clear that getting started on the road to using data successfully can be a difficult task, especially with a growing number of new data sources, demands for fresher data, and the need for increased processing capacity In order to advance operational efficiencies and drive business growth, however, organizations must address and overcome these challenges In recent years, many organizations have heavily invested in the development of enterprise data warehouses (EDW) to serve as the central data system for reporting, extract/transform/load (ETL) processes, and ways to take in data (data ingestion) from diverse databases and other sources both inside and outside the enterprise Yet, as the volume, velocity, and variety of data continues to increase, already expensive and cumbersome EDWs are becoming overloaded with data Furthermore, traditional ETL tools are unable to handle all the data being generated, creating bottlenecks in the EDW that result in major processing burdens As a result of this overload, organizations are now turning to open source tools like Hadoop as cost-effective solutions to offloading data warehouse processing functions from the EDW While Hadoop can help organizations lower costs and increase efficiency by being used as a complement to data warehouse activities, most businesses still lack the skill sets required to deploy Hadoop Where to Begin? Organizations challenged with overburdened EDWs need solutions that can offload the heavy lifting of ETL processing from the data warehouse to an alternative environment that is capable of managing today’s data sets The first question is always How can this be done in a simple, cost-effective manner that doesn’t require specialized skill sets? Let’s start with Hadoop As previously mentioned, many organizations deploy Hadoop to offload their data warehouse processing functions After all, Hadoop is a cost-effective, highly scalable platform that can store volumes of structured, semi-structured, and unstructured data sets Hadoop can also help accelerate the ETL process, while significantly reducing costs in comparison to running ETL jobs in a traditional data warehouse However, while the benefits of Hadoop are appealing, the complexity of this platform continues to hinder adoption at many organizations It has been our goal to find a better solution ... Big Data Now 2015 Edition O’Reilly Media, Inc Big Data Now: 2015 Edition by O’Reilly Media, Inc Copyright © 2016 O’Reilly Media,... unique themes that garnered significant attention in 2015: Data- driven cultures (Chapter 1) Data science (Chapter 2) Data pipelines (Chapter 3) Big data architecture and infrastructure (Chapter 4)... enterprise data warehouses (EDW) to serve as the central data system for reporting, extract/transform/load (ETL) processes, and ways to take in data (data ingestion) from diverse databases and