Harness the power of big data IBM platform
Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 Harness the Power of Big Data 00-FM.indd 1 04/10/12 12:19 PM Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 About the Authors Paul C. Zikopoulos, B.A., M.B.A ., is the Director of Technical Professionals for IBM Software Group’s Information Management division and addition- ally leads the World-Wide Competitive Database and Big Data Technical Sales Acceleration teams. Paul is an award-winning writer and speaker with over 19 years of experience in Information Management. In 2012, Paul was chosen by SAP as one of its Top 50 Big Data Twitter influencers (@BigData_ paulz). Paul has written more than 350 magazine articles and 16 books, including Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data; Warp Speed, Time Travel, Big Data, and More: DB2 10 New Fea- tures; DB2 pureScale: Risk Free Agile Scaling; DB2 Certification for Dummies; and DB2 for Dummies. In his spare time, Paul enjoys all sorts of sporting activities, including running with his dog Chachi, avoiding punches in his MMA train- ing, and trying to figure out the world according to Chloë—his daughter. You can reach him at: paulz_ibm@msn.com. Dirk deRoos, B.Sc., B.A., is IBM’s World-Wide Technical Sales Leader for IBM InfoSphere BigInsights. Dirk spent the past two years helping customers with BigInsights and Apache Hadoop, identifying architecture fit, and advis- ing early stage projects in dozens of customer engagements. Dirk recently coauthored a book on this subject area, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data (McGraw-Hill Professional, 2012). Prior to this, Dirk worked in the IBM Toronto Software Development Lab on the DB2 database development team where he was the Information Architect for all of the DB2 product documentation. Dirk has earned two degrees from the University of New Brunswick in Canada: a Bachelor of Computer Sci- ence, and a Bachelor of Arts (Honors English). You can reach him at: dirk .ibm@gmail.com or on Twitter at @Dirk_deRoos. Krishnan Parasuraman, B.Sc., M.Sc., is part of IBM’s Big Data industry solu- tions team and serves as the CTO for Digital Media. In his role, Krishnan works very closely with customers in an advisory capacity, driving Big Data solution architectures and best practices for the management of Internet- scale analytics. He is an authority on the use of Big Data technologies, such as Hadoop and MPP data warehousing platforms, for solving analytical problems in the online digital advertising, customer intelligence, and 00-FM.indd 2 04/10/12 12:19 PM Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 real-time marketing space. He speaks regularly at industry events and writes for trade publications and blogs. Prior to his current role, Krishnan worked in research, product development, consulting, and technology marketing across multiple disciplines within information management. Krishnan has enabled data warehousing and customer analytics solutions for large media and consumer electronics organizations, such as Apple, Microsoft, and Kodak. He holds an M.Sc. degree in computer science from the University of Georgia. You can keep up with his musings on Twitter @kparasuraman. Thomas Deutsch, B.A, M.B.A., is a Program Director for IBM’s Big Data team. Tom played a formative role in the transition of Hadoop-based technol- ogy from IBM Research to IBM Software Group and continues to be involved with IBM Research around Big Data. Tom has spent several years helping customers, identifying architecture fit, developing business strategies, and managing early stage projects across more than 300 customer engagements with technologies such as Apache Hadoop, InfoSphere BigInsights (IBM’s Hadoop distribution), InfoSphere Streams, Cassandra, and other emerging NoSQL technologies. Tom has coauthored a book and multiple thought papers about Big Data, and is a columnist for IBM Data Management magazine. He’s a frequent speaker at industry conferences and a member of the IBM Academy of Technology. Prior to this, Tom worked in the Information Management CTO’s office, focused on emerging technologies; he came to IBM through the FileNet acquisition, where he was its flagship product’s lead product man- ager. With more than 20 years in the industry, and as a veteran of two startups, Tom is an expert on the technical, strategic, and business information man- agement issues facing the enterprise today. Tom earned a B.A. degree from Fordham University in New York and an M.B.A. degree from the University of Maryland University College. David Corrigan, B.A., M.B.A ., is currently the Director of Product Market- ing for IBM’s InfoSphere portfolio, which is focused on managing trusted information. His primary focus is driving the messaging and strategy for the InfoSphere portfolio of information integration, data quality, master data management (MDM), data lifecycle management, and data privacy and security. Prior to his current role, David led the product management and product marketing teams for IBM’s MDM portfolio, and has worked in the Information Management space for over 12 years. David holds an M.B.A. 00-FM.indd 3 04/10/12 12:19 PM Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 degree from York University’s Schulich School of Business, and an under- graduate degree from the University of Toronto. James Giles, BSEE, B.Math, MSEE, Ph.D., is an IBM Distinguished Engineer and currently a Senior Development Manager for the IBM InfoSphere BigIn- sights and IBM InfoSphere Streams Big Data products. Previously, Jim man- aged the Advanced Platform Services group at the IBM T. J. Watson Research Center, where Jim and his team developed the technology for the System S stream-processing prototype, which is now the basis for InfoSphere Streams. Jim joined IBM in 2000 as a Research Staff Member and led research and devel- opment in content distribution, policy management, autonomic computing, and security. He received his Ph.D. in electrical and computer engineering from the University of Illinois at Urbana-Champaign, where he studied covert communications in data networks. Jim has several patents and is the recipient of an IBM Corporate Award for his work on stream computing. About the Technical Editor Roman B. Melnyk, B.A., M.A., Ph.D., is a senior member of the DB2 Infor- mation Development team. During more than 18 years at IBM, Roman has written numerous books, articles, and other related materials about DB2. Roman coauthored DB2 Version 8: The Official Guide; DB2: The Complete Refer- ence; DB2 Fundamentals Certification for Dummies; and DB2 for Dummies. 00-FM.indd 4 04/10/12 12:19 PM Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 Harness the Power of Big Data The IBM Big Data Platform Paul C. Zikopoulos Dirk deRoos Krishnan Parasuraman Thomas Deutsch David Corrigan James Giles New York Chicago San Francisco Lisbon London Madrid Mexico City Milan New Delhi San Juan Seoul Singapore Sydney Toronto 00-FM.indd 5 04/10/12 12:19 PM Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 Copyright © 2013 by The McGraw-Hill Companies. All rights reserved. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written permission of the publisher. ISBN: 978-0-07180818-7 MHID: 0-07-180818-3 The material in this eBook also appears in the print version of this title: ISBN: 978-0-07-180817-0 , MHID: 0-07-180817-5 . McGraw-Hill eBooks are available at special quantity discounts to use as premiums and sales promotions, or for use in corporate training programs. To contact a representative please e-mail us at bulksales@mcgraw-hill.com. All trademarks are trademarks of their respective owners. Rather than put a trademark symbol after every occur- rence of a trademarked name, we use names in an editorial fashion only, and to the benefit of the trademark owner, with no intention of infringement of the trademark. Where such designations appear in this book, they have been printed with initial caps. Information has been obtained by McGraw-Hill from sources believed to be reliable. However, because of the possibility of human or mechanical error by our sources, McGraw-Hill, or others, McGraw-Hill does not guaran- tee the accuracy, adequacy, or completeness of any information and is not responsible for any errors or omissions or the results obtained from the use of such information. The contents of this book represent those features that may or may not be available in the current release of any products mentioned within this book despite what the book may say. IBM reserves the right to include or exclude any functionality mentioned in this book for the current or subsequent releases of InfoSphere Streams, InfoS- phere BigInsights, the family of IBM PureData Systems, or any other IBM products mentioned in this book. Decisions to purchase any IBM software should not be made based on the features said to be available in this book. In addition, any performance claims made in this book aren’t official communications by IBM; rather, they are the results observed by the authors in unaudited testing. The views expressed in this book are also those of the authors and not necessarily those of IBM Corporation. TERMS OF USE This is a copyrighted work and The McGraw-Hill Companies, Inc. (“McGrawHill”) and its licensors reserve all rights in and to the work. Use of this work is subject to these terms. Except as permitted under the Copyright Act of 1976 and the right to store and retrieve one copy of the work, you may not decompile, disassemble, reverse engineer, reproduce, modify, create derivative works based upon, transmit, distribute, disseminate, sell, publish or sublicense the work or any part of it without McGraw-Hill’s prior consent. You may use the work for your own noncommercial and personal use; any other use of the work is strictly prohibited. Your right to use the work may be terminated if you fail to comply with these terms. THE WORK IS PROVIDED “AS IS.” McGRAW-HILL AND ITS LICENSORS MAKE NO GUARANTEES OR WARRANTIES AS TO THE ACCURACY, ADEQUACY OR COMPLETENESS OF OR RESULTS TO BE OBTAINED FROM USING THE WORK, INCLUDING ANY INFORMATION THAT CAN BE ACCESSED THROUGH THE WORK VIA HYPERLINK OR OTHERWISE, AND EXPRESSLY DISCLAIM ANY WAR- RANTY, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. McGraw-Hill and its licensors do not warrant or guarantee that the functions contained in the work will meet your requirements or that its operation will be uninterrupted or error free. Neither McGraw-Hill nor its licensors shall be liable to you or anyone else for any inaccuracy, error or omission, regardless of cause, in the work or for any damages resulting therefrom. Mc- Graw-Hill has no responsibility for the content of any information accessed through the work. Under no circum- stances shall McGraw-Hill and/or its licensors be liable for any indirect, incidental, special, punitive, consequen- tial or similar damages that result from the use of or inability to use the work, even if any of them has been advised of the possibility of such damages. This limitation of liability shall apply to any claim or cause whatso- ever whether such claim or cause arises in contract, tort or otherwise. eBook_copyright.indd 1 09/10/12 5:49 PM Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 My sixteenth book in my nineteenth year at IBM. Looking back, as this collection of books literally occupies an entire shelf, one thing strikes me: the caliber of people I work with. From this authoring team (some of whom are newfound friends), to past ones, I’m luckily surrounded by some of the smartest and most passionate professionals in the world: IBMers—and it’s an honor to learn from you all. To the people who have created an environment in which I never want to slow down (Martin Wildberger, Bob Picciano, Dale Rebhorn, and Alyse Passarelli), thanks for your mentorship and belief in me, but also your patience with some of those 2 a . m . run-on notes with the red markup. It’s interesting the toll that writing a book takes on your life. For example, I found that my golf handicap experienced double-digit percentage growth after I started to write this one, leaving my retirement portfolio green with envy. (I’d be remiss if I didn’t thank Durham Driving Range’s Dave Dupuis for always greeting me with a smile and listening to me complain as he watches the odd—perhaps more than odd—ball shank hard right.) Although that stuff doesn’t matter, the personal impact and trade-offs you have to make to write a book lead me to my most important thank-you I’ve got to give: to my family, Chloë, Kelly, and the spirit of Grace. You gals keep me strong and in overdrive. —Paul Zikopoulos To Sandra, Erik, and Anna: the truly wonderful people I have in my life. Thanks for giving me the time to help make this happen and for your patience with me! I would also like to dedicate my work on this book to my beloved Netherlands national football team, who, yet again, broke my heart this year. May the collaboration of the many authors on this book be an example to you of what teamwork looks like! (Mental note: Never work on a book with a fellow Dutchman.) —Dirk deRoos I would like to thank the Netezza team for all the fond memories and good times; and to Brad Terrell…for being my Force Field. —Krishnan Parasuraman 00-FM.indd 7 04/10/12 12:19 PM Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 I would like to thank (again) my slightly less patient (from last year when I thanked them) family for their patience during this process. I would also like to thank Paul Zikopoulos; I’ve lost count of the number of drinks I owe him. Finally, thanks to Nagui Halim, John McPherson, Hamid Pirahesh, and Neil Isford for being such good dance partners in emerging compute spaces. —Thomas Deutsch I’d like to thank Karen, Kaitlyn, and Alex for all of their love and support. I’d also like to thank all of my integration and governance colleagues for continuing to drive a strategy that makes this a market-leading platform and a very interesting place to work. —David Corrigan I would like to dedicate this book to the tireless IBM Big Data development and research teams worldwide. This book would not be possible without the countless innovations and commitment to building great technology for the enterprise. Thank you all! —James Giles 00-FM.indd 8 04/10/12 12:19 PM Flash 6X9 / Harness the Power of Big Data: The IBM Big Data Platform / Zikopoulos / 817-5 ix CONTENTS Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii Part I The Big Deal About Big Data 1 What Is Big Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Why Is Big Data Important? . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Now, the “What Is Big Data?” Part . . . . . . . . . . . . . . . . . . . . . 4 Brought to You by the Letter V: How We Define Big Data . . . . . . . . . . . . . . . . . . . . . . . . 9 What About My Data Warehouse in a Big Data World? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Wrapping It Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Applying Big Data to Business Problems: A Sampling of Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . 21 When to Consider a Big Data Solution . . . . . . . . . . . . . . . . . . 21 Before We Start: Big Data, Jigsaw Puzzles, and Insight . . . . . . . . . . . . . . . . . . 24 Big Data Use Cases: Patterns for Big Data Deployment . . . . . . . . . . . . . . . . . . . . 26 You Spent the Money to Instrument It—Now Exploit It! . . . . . . . . . . . . . . . . . . . . 26 IT for IT: Data Center, Machine Data, and Log Analytics . . . . . . . . 28 What, Why, and Who? Social Media Analytics . . . . . . . . . 30 Understanding Customer Sentiment . . . . . . . . . . . . . . . . . 31 Social Media Techniques Make the World Your Oyster . . . . . . . . . . . . . . . . . . . . . . 33 Customer State: Or, Don’t Try to Upsell Me When I Am Mad . . . . . . . . . . 34 00-FM.indd 9 04/10/12 12:19 PM