Building Big Data Applications Krish Krishnan Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2020 Elsevier Inc All rights reserved No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein) Notices Knowledge and best practice in this field are constantly changing As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-815746-6 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Mara Conner Acquisition Editor: Mara Conner Editorial Project Manager: Joanna Collett Production Project Manager: Punithavathy Govindaradjane Cover Designer: Mark Rogers Typeset by TNQ Technologies Dedicated to all my teachers Preface In the world that we live in today it is very easy to manifest and analyze data at any given instance Space a very insightful analytics is worth every executive’s time to make decisions that impact the organization today and tomorrow Space this analytics is what we call Big Data analytics since the year 2010, and our teams have been struggling to understand how to integrate data with the right metadata and master data in order to produce a meaningful platform that can be used to produce these insightful analytics Not only is the commercial space interested in this we also have scientific research and engineering teams very much wanting to study the data and build applications on top off at The effort’s taken to produce Big Data applications have been sporadic when measured in terms of success why is that a question that is being asked by folks across the industry In my experience of working in this specific space, what I have realized is that we are still working with data which is lost in terms of volumes come on and it is produced very fast on demand by any consumer leading to metadata integration issues This metadata integration issue can be handled if we make it an enterprise solution, and all renters in the space need not necessarily worry about their integration with a Big Data platform This integration is handled through integration tools that have been built for data integration and transformation Another interesting perspective is that while the data is voluminous and it is produced very fast it can be integrated and harvested as any enterprise data segment We require the new data architecture to be flexible, and scalable to accommodate new additions, updates, and integrations in order to be successful in building a foundation platform This data architecture will differ from the third normal and star schema forms that we built the data warehouse from The new architecture will require more integration and just in time additions which are more represented by NoSQL database architecture’s and how architectures How we get this go to success factor? And how we make the enterprise realize that new approaches are needed to ensure success and accomplishing the tipping point on a successful implementation Our executives are always known for asking questions about the lineage of data and its traceability These questions today can be handled in the data architecture and engineering provided we as an enterprise take a few minutes to step back and analyze why our past journeys journeys were not successful enough, and how we can be impactful in the future journey delivering the Big Data application The hidden secret here is resting in the farm off governance within the enterprise Governance, it is not about measuring people it is about ensuring that all processes have been followed and completed as requirements and that all specifics are in place for delivering on demand lineage and traceability In writing this book there are specific points that have been discussed about the architecture and governance required to ensure success in Big Data applications The goal of the book is to share the secrets that have been leveraged by different segments of people in their big data application projects and the risks that they had to overcome to become successful The chapters in the book present different types of scenarios that we all encounter, and in this process the goals of reproducibility and repeatability for ensuring experimental xiii xiv Preface success has been demonstrated If you ever wondered what the foundational difference in building a Big Data application is the foundational difference is that the datasets can be harvested and an experimental stage can be repeated if all of the steps are documented and implemented as specified into requirements Any team that wants to become successful in the new world needs to remember that we have to follow governance and implement governance in order to become measurable Measuring process completion is mandatory to become successful and as you read it in the book revisit this point and draw the highlights from In developing this book there are several discussions that I have had with teams from both commercial enterprises as well as research organizations and thank all contributors for that time and insights and sharing the endeavors, it did take time to ensure that all the relevant people across these teams were sought out and tipping point of failure what discussed in order to understand the risks that could be identified and avoided in the journey There are several reference points that has been added to chapters and while the book is not all encompassing by any means it does provide any team that wants to understand how to build a Big Data application choices of how success can be accomplished as well as case studies that vendors have shared showcasing how companies have implemented technologies to build the final solution I thank all vendors who provided material for the book and in particular IO-Tahoe, Teradata, and Kinetica for access to teams to discuss the case studies I thank my entire editorial and publishing team at Elsevier publishing for their continued support in this journey for their patience and support in ensuring completion of this book is what is in your hands today Last but not the least, I thank my wife and our two sons for the continued inspiration and motivation for me to write Your love and support is a motivation Big Data introduction This chapter will be a brief introduction to Big Data, providing readers the history, where are we today, and the future of data The reader will get a refresher view of the topic The world we live in today is flooded with data all around us, produced at rates that we have not experienced, and analyzed for usage at rates that we have heard as requirements before and now can fulfill the request What is the phenomenon called as “Big Data” and how has it transformed our lives today? Let us take a look back at history, in 2001 when Doug Laney was working with Meta Group, he forecasted a trend that will create a new wave of innovation and articulated that the trend will be driven by the three V’s namely volume, velocity, and variety of data In the continuum in 2009, he wrote the first premise on how “Big Data” as the term was coined by him will impact the lives of all consumers using it A more radical rush was seen in the industry with the embracement of Hadoop technology and followed by NoSQL technologies of different varieties, ultimately driving the evolution of new data visualization, analytics, storyboarding,and storytelling In a lighter vein, SAP published a cartoon which read the four words that Big Data brings d“Make Me More Money” This is the confusion we need to steer clear of and be ready to understand how to monetize from Big Data First to understand how to build applications with Big Data, we need to look at Big Data from both the technology and data perspectives Big Data delivers business value The e-Commerce market has shaped businesses around the world into a competitive platform where we can sell and buy what we need based on costs, quality, and preference The spread of services ranges from personal care, beauty, healthily eating, clothing, Building Big Data Applications https://doi.org/10.1016/B978-0-12-815746-6.00001-6 Copyright © 2020 Elsevier Inc All rights reserved Building Big Data Applications perfumes, watches, jewelry, medicine, travel, tours, investments, and the list goes on All of this activity has resulted in data of various formats, sizes, languages, symbols, currencies, volumes, and additional metadata which we collectivity today call as “Big Data” The phenomenon has driven unprecedented value to business and can deliver insights like never before The business value did not and does not stop here; we are seeing the use of the same techniques of Big Data processing across insurance, healthcare, research, physics, cancer treatment, fraud analytics, manufacturing, retail, banking, mortgage, and more The biggest question is how to realize the value repeatedly? What formula will bring success and value, how to monetize from the effort? Take a step back for a moment and assess the same question with investments that has been made into a Salesforce or Unica or Endeca implementation and the business value that you can drive from the same Chances are you will not have an accurate picture of the amount of return on investmentor the percentage of impact in terms of increased revenue or decreased spendor process optimization percentages from any such prior experiences Not that your teams did not measure the impact, but they are unsure of expressing the actual benefit into quantified metrics But in the case of a Big Data implementation, there are techniques to establish a quantified measurement strategy and associate the overall program with such cost benefits and process optimizations The interesting question to ask is what are organizations doing with Big Data? Are they collecting it, studying it, and working with it for advanced analytics? How exactly does the puzzle called Big Data fit into an organization’s strategy and how does it enhance corporate decision-making? To understand this picture better there are some key questions to think about and these are a few you can add more to this list: How many days does it take on an average to get answers to the question “why”? How many cycles of research does the organization for understanding the market, competition, sales, employee performance, and customer satisfaction? Can your organization provide an executive dashboard along the ZachmanFramework model to provide insights and business answers on who, what, where, when, and how? Can we have a low code application that will be orchestrated with a workflow and can provide metrics and indicators on key processes? Do you have volumes of data but have no idea how to use it or not collect it at all? Do you have issues with historical analysis? Do you experience issues with how to replay events? Simple or complex events? The focus of answering these questions through the eyes of data is very essential and there is an abundance of data that any organization has today and there is a lot of hidden Chapter Big Data introduction data or information in these nuggets that have to be harvested Consider the following data: Traditional business systemsdERP, SCM, CRM, SFA Content management platforms Portals Websites Third-party agency data Data collected from social media Statistical data Research and competitive analysis data Point of sale datadretail or web channel Legal contracts Emails If you observe a pattern here there is data about customers, products, services, sentiments, competition, compliance, and much more available The question is does the organization leverage all the data that is listed here? And more important is the question, can you access all this data at relative ease and implement decisions? This is where the platforms and analytics of Big Data come into the picture within the enterprise From the data nuggets that we have described 50% of them or more are internal systems and data producers that have been used for gathering data but not harnessing analytical value (the data here is structured, semistructured, and unstructured), the other 50% or less is the new data that is called Big Data (web data, machine data, and sensor data) Big Data Applications are the answer to leveraging the analytics from complex events and getting the articulate insights for the enterprise Consider the following example: Call center optimizationdThe worst fear of a customer is to deal with the call center The fundamental frustration for the customer is the need to explain all the details about their transactions with the company they are calling, the current situation, and what they are expecting for a resolution, not once but many times (in most cases) to many people and maybe in more than one conversation All of this frustration can be vented on their Facebook page or Twitter or a social media blog, causing multiple issues They will have an influence in their personal network that will cause potential attrition of prospects and customers Their frustration maybe shared by many others and eventually result in class action lawsuits Their frustration will provide an opportunity for the competition to pursue and sway customers and prospects All of these actions lead to one factor called as “revenue loss.”If this company continues to persist with poor quality of service, eventually the losses will be large and even leading to closure of business and loss of brand reputation It is Building Big Data Applications in situations like this where you can find a lot of knowledge in connecting the dots with data and create a powerful set of analytics to drive business transformation Business transformation does not mean you need to change your operating model but rather it provides opportunities to create new service models created on data driven decisions and analytics The company that we are discussing here, let us assume,decides that the current solution needs an overhaul and the customer needs to be provided the best quality of service, it will need to have the following types of data ready for analysis and usage: Customer profile, lifetime value, transactional history, segmentation models, social profiles (if provided) Customer sentiments, survey feedback, call center interactions Product analytics Competitive research Contracts and agreementsdcustomer specific We should define a metadata-driven architecture to integrate the data for creating these analytics There is a nuance of selecting the right technology and architecture for the physical deployment A few days later the customer calls for support, the call center agent is now having a mash-up showing different types of analytics presented to them The agent is able to ask the customer-guided questions on the current call and apprise them of the solutions and timelines, rather than ask for information; they are providing a knowledge service In this situation the customer feels more privileged and even if there are issues with the service or product, the customer will not likely attrite Furthermore, the same customer now can share positive feedback and report their satisfaction, thus creating a potential opportunity for more revenue The agent feels more empowered and can start having conversations on cross-sell and up-sell opportunities In this situation, there is a likelihood of additional revenue and diminished opportunities for loss of revenue This is the type of business opportunities that Big Data analytics (internal and external) will bring to the organization, in addition to improving efficiencies, creating optimizations, and reducing risks and overall costs There is some initial investment spent involved in creating this data strategy, architecture, and implementing additional technology solutions The returnon investment will offset these costs and even save on license costs from technologies that may be retired post the new solution We see the absolute clarity that can be leveraged from an implementation of the Big Dataedriven call center, which will provide the customer with confidence, the call center associate with clarity, the enterprise with fine details including competition, noise, campaigns, social media presence, the ability to see what customers in the same age group and location are sharing, similar calls, and results All of this can be easily accomplished if we set the right strategy in motion for implementing Big Data applications This requires us to understand the underlying infrastructure and how to leverage them for the implementation This is the next segment of this chapter 214 Use cases from industry vendors RESULTS Analyzing Breadcrumb Data to Improve Services USPS breadcrumb data Kinetica was used to collect, process and analyze over 200,000 messages per minute That data was used to determine actual delivery and collection tions to mailers and customers By analyzing this breadcrumb data, the USPS was able to 1) understand where spending would achieve the best results, 2) make faster and customers with a reliable service, and 4) reduce costs by streamlining deliveries Processing Geospatial Data for Real-Time Decision-making FIGURE 2: ZOOMED IN AREA SHOWING COLLECTION BOX METADATA WHEN SELECTED Kinetica was also used to enable visualization of geospatial data (routes, delivery points, and collection point data) so that dispatchers employee territory assignments and take proper action if needed Kinetica helped the USPS make the best use of routes and to age of assigned areas, uncovered areas, and distribution bottlenecks They were also able to improve contingency planning if a carrier was unable to deliver to assigned routes FIGURE 3: TERRITORY REASSIGNMENT TOOL SHOWS TWO ROUTE BOUNDARIES AND THE ACTUAL TERRITORY SECTIONS WITHIN THEM THAT CAN BE MOVED BETWEEN THE TWO ROUTES aggregating point-to-point carrier performance data Optimizing Routes Kinetica springs to life the moment mail carriers depart a USPS Origin Facility By tracking carrier movements in real time, Kinetica provides the USPS with immediate visibility into the status of deliveries anywhere in the country, along with information on how each route is progressing, how many drivers are on the road, how many deliveries each driver is making, where their last stop was, and more Course corrections are then made so a carrier is within the optimal geographical boundaries at all times, helping reduce unnecessary transport costs Optimizing routes results in on-time delivery, fewer trucks handling a greater number of deliveries, and delivery windows narrowed IDC INNOVATION EXCELLENCE AWARD RECIPIENT Due to the success of the project, USPS was named a 2016 recipient of International Data Corporation (IDC)’s HPC Innovation Excellence Award for its use of Kinetica to track the location of employees and individual pieces of mail in real time “The HPC Innovation Excellence Awards recognize organizations that have excelled in applying advanced supercomputing technologies to accelerate science, engineering, and society at large,” said Kevin Monroe, senior research analyst at IDC “USPS’ application of Kinetica enhances the quality of service that US citizens receive by giving them a better, more predictable experience sending and receiving mail.” “We’re honored to have had the opportunity to partner with the US Postal Service and are humbled by the profound impact our technology is having on their everyday business operations,” said Amit Vij, co-founder and CEO, Kinetica “This IDC Award demonstrates the tremendous value that real-time analytics and visualizations can have when applied to solve supply chains, logistics, and a range of other business challenges.” SUMMARY The complexities and dynamics of USPS’ logistics have reached all-time highs, while consumers have greater demands and more alternative options than ever before Improving end-to-end business process performance while reducing costs at the same time requires the ability to make fast business decisions based on live data By implementing Kinetica’s GPU-accelerated database, the USPS is expected to save millions of dollars in the years to come, and help them deliver more sophisticated services, achieve more accurate tracking capabilities, ensure safer, on-time deliveries, and increase operational For more information on Kinetica and GPU-accelerated databases, visit kinetica.com Kinetica and the Kinetica logo are trademarks of Kinetica and its subsidiaries in the United States and other countries Other marks and brands may be claimed as provided without warranty of any kind, express or implied Copyright © 2016 Kinetica Use cases from industry vendors 215 Lippo group Getting to Know You: How Lippo Group Leverages a GPU Database to Understand Their Customers and Make Better Business Decisions Joe Lee, VP, Asia Pacific Japan j January 26, 2018 Lippo Group is a prominent conglomerate with significant investments in digital technologies, education, financial services, healthcare, hospitality, media, IT, telecommunications, real estate, entertainment and retail Lippo Group has a large global footprint around the world, with properties and services not only in Indonesia, but also in Singapore, Hong Kong, Los Angeles, Shanghai, Myanmar, and Malaysia Lippo Group in Indonesia connects and serves over 120 million consumers, aggressively investing in Big Data and analytics technology through OVO OVO is Lippo Group Digital’s concierge platform, integrating mobile payment, loyalty points, and exclusive priority deals So, how does one of Asia’s largest conglomerates develop 360 degree views of their customers from multiple industry data sources and interact with these customers in a personalized way, while also opening up new revenue streams from data monetization? The answer lies in a combination of Kinetica’s centralized digital analytics platform and OVO “Smart Digital Channel” which will help marketers reach out to their audiences and better understand the customer journey Challenge: Create a Next-Gen Centralized Analytics Platform The Lippo Group business ecosystem is vast, with data coming in from multiple lines of business across several industries The goal was to consolidate fragmented customer data generated by transactional systems from various subsidiaries into a centralized analytics platform Once consolidated, the data can then be analyzed to generate a 360-degree view of the customer profile and journey Solution: Deep and Fast Analytics Powered by Kinetica and NVIDIA Lippo Group turned to Kinetica and NVIDIA to spur innovation in their big data and analytics strategy The first project was to a derive insights on customers’ digital lifestyles, preference and spending behavior and deliver it in the form of API Kinetica’s distributed, Used with permission from Kinetica, http://www.kinetica.com 216 Use cases from industry vendors in-memory database combines customer profile, buying behavior, sentiment and shopping trends based on a variety of cross-industry data sources with millisecond latencyepowering analytics at the speed of thought The new platform is now playing a key role in their multiple industry big data initiative They are able to develop a full 360-degree profile of each customer, using multiple dimensions of customer attributes to describe their behavior, such as cross channel behavior, CRM/demographic/behavioral/geographic data, brand sentiment, social media behavior, and purchasing behavior via e-commerce, e-wallets, and loyalty programs Lippo Group can now correlate all customer information and transactions to understand their profile and preferences in order to interact with them in a personalized way They can also associate household members’ activities and preferences as consolidated profiles to deliver family-based personalized offers and improve their service experiences By having these types of 360 degree profiles, Lippo Group can improve overall customer experience, conversion rates, campaign take-up rates, inventory/product life cycle management, and future purchasing predictions The Technology Behind the Curtain The underlying technology consists of a Hadoop big data cluster and an analytics layer on Kinetica In the images below, you can see the evolution in the technology stack that took place to bring Lippo Group’s analytics to the next generation: Before: Lippo Group’s legacy analytic consisted of Impala, Hive, and HBase Use cases from industry vendors 217 After: Lippo Group’s next-gen analytic API cluster with Kinetica and the ability to augment with machine learning, powered by NVIDIA GPUs Lippo can leverage any existing digital channel within the organization to deliver personalized messages/offers to their customers and embed machine learning for product recommendations The same API can be reused multiple times across channels OVO will be able to correlate customer, transaction and location information, and geospatial analytics will deliver actionable insights for Lippo business units to better understand business performance and help them outperform competitors The Big Data API and AI platform enables all OVO touchpoint channels to introduce new and personalized experiences to customers in real time Results Lippo Group’s mantra of “deep and fast analytics” is opening up new opportunities for improved customer engagement within the organization’s business ecosystem, as well as opening up new revenue streams from data monetization through their interactive digital channel By having a zbig Data API and AI platform in place, every company within the Lippo Group ecosystem will be able to access the Analytical and Campaign API Marketplace to perform queries through an analytic API that involves huge volumes and rich data sets in subsecond latency With the support from Kinetica and NVIDIA, Lippo Group Digital is the first enterprise to integrate an AI, in-memory, GPU database in Indonesia By harnessing technology and using it with intensity, precision and speed, Lippo Group is now able to understand consumers better, meet their needs and expectations, and make informed operational and product decisions Learn more about this project from the Lippo Group presentation at Strata: 218 Use cases from industry vendors Teredata Case study Link Pervasive Data Intelligence Creates a Winning Strategy https://www.teradata.com/Customers/Larry-HMiller-Sports-and-Entertainment Using Data to Enable People, Businesses, and Society to Grow https://www.teradata.com/Customers/Swedbank Deep Neural Networks Support Personalized Services for 8M Subscribers in the Czech Republic and Slovakia https://www.teradata.com/Customers/O2-CzechRepublic Operationalizing Insights Supports Multinational Banking Acquisitions https://www.teradata.com/Customers/ABANCA More Personal, More Secure Banking How Pervasive Data Intelligence is Building a More Personalized Banking Experience https://www.teradata.com/Customers/US-Bank 100 Million Passengers Delivered Personalized Customer Experience https://www.teradata.com/Customers/Air-FranceKLM Siemens Healthineers Invests in Answers https://www.teradata.com/Customers/SiemensHealthineers Predictive Parcel Delivery, Taking Action in Real Time to Change the Course of Business https://www.teradata.com/Customers/Yodel Index ‘Note: Page numbers followed by “f” indicate figures, “t” indicates tables.’ A Analytics visualization, 116e117, 117f Ansible, 188 Antimoney laundering (AML), 131, 136e137 Apparatus for LEP PHysics (ALEPH) detector, 89 Application lifecycle management (ALM), 183 ApplicationMaster, 36 Artifact management repository, 183 Artificial intelligence (AI), 11 banking industry applications and usage, 128 customer journey centric teams, 208e209 data accessibility and culture, 209 data management, 209 end-to-end lifecycle management, 209 intelligent technology, 209e210 vision, 208 B BackupNode, 33 Banking industry applications and usage AI and machine learning, 128 algorithmic trading, 137 analytics and big data applications, 134e135 antimoney laundering (AML) detection, 136e137 chatbots, 136 crowd engineering, 128, 128f customer churn prediction algorithms, 142 application development, 144 application predictions, 143e144 application requirements, 138e140 behavioral variables, 141 customer behavior analysis, 138 customer demographics variables, 141 customer dissatisfaction, 140e141 data management problem solving, 138 decision trees, 140 definition, 140 engagement windows, 142 logistic regression models, 139 macroenvironment variables, 142 neural networks, 143 perceptions variables, 141 transactional behavior, 138 cybersecurity threats, 129 digital transformation, 127 electronic data processing era, 127 fraud and compliance tracking, 135e136 granular trading information, 128 Internet of Things (IoT), 127 millennial age customer, 127 recommendation engines, 137e144 technology implementation, 129 uber banking antimoney laundering (AML), 131 banking experience, 132e133 business challenges, 131e132 compliance, 131 customer journey and continuum, 133e134 digital bank, 130e131 financial services industry (FinTech), 130e131 global banking industry, 130 modern data applications, 134 technology investments, 130 Basel Committee on Banking Supervision (BASEL III and BASEL IV), 210 Basho Riak, 68e69 219 220 Index Big data applications analytics, 10 artificial intelligence, 11 asset classification and control, 192 banking applications See Banking industry applications and usage budgets, 189e190 business continuity management, 195e197 business requirements, access control, 195 business value, 1e12 big data-driven call center, call center optimization, 3e4 data analysis and usage, data processing techniques, e-commerce market, 1e2 metadata-driven architecture, communications and operations management, 194 continuous delivery process, 179, 182e183 critical success factors, 13e14 data governance See Data governance data processing, 12e13 data storyboard See Data storyboard designing, 178e180 development cycle, 180e181 DevOps tool, 181e182 Ansible, 188 Chef, 186 Code Climate, 188e189 Consul, 187 Docker, 187 Enov8, 184 Ganglia, 185 Jenkins, 184 Juju, 189 Monit, 188 Nagios, 186 OverOps, 187 PagerDuty, 185 Puppet Enterprise, 189 Scalyr, 189 Snort, 185 Splunk, 186 Stackify Retrace, 187e188 Sumo Logic, 186 Supervisor, 188 Vagrant, 185 governance program, 10 Hadoop technology, healthcare and costs, data characteristics, 5e6 enterprise applications, 7e8 metadata-based data integration, 6e7 prototype solution flow, 7f smart machines, technology innovations, traditional application processing, vendor-led solution developments, information and software exchange, 194e195 information classification, 192 information security infrastructure, 191 information security policy, 191 infrastructure and technology See Infrastructure and technology job definition and resourcing security, 192e193 Kanban goals, 183e184 log file collection and analysis, 177 machine learning, 11 media handling and security, 194 mobile computing, 195 Not-only-SQL (NoSQL), paradigm shift, 9e10 physical and environmental security, 193e194 privacy issue, 10 reporting, 10e11 requirements, 177 research projects, 176e177 risks and pitfalls, 15e16 scientific research applications See European Council for Nuclear Research (CERN) security requirements, 10 security/threat incidents, 193 smart cars, 11e12 smart health monitoring, 12 smart thermostats, 11 Index storyboard data, 175e176 storyboard development, 175 technical requirements, 179e180 telecommuting, 195 third-party access security, 191e192 travel and tourism industry applications See Travel and tourism industry applications user training, 193 Business continuity management, 195e197 Business intelligence metadata, 161 Business metadata, 159e160 Business processing metadata, 159 C Capital market, 133 Cars shopping APIs, 155 Cassandra client consistency, 65 column, 61 column family, 61, 61f consistency management, 63e64 consistency repair features Antientropynode repair, 65 Hinted handoff, 65 Read Repair, 65 data goals, 60 data partitioning, 63, 66e67 data placement, 66 data sorting, 63 datamodel, 60e62 gossip protocolenode management, 67e68 keyevalue pair, 61, 61f keyspace, 62e65 peer to peer model, 67 read consistency, 65 ring architecture, 66e69, 66f super column family, 62 write consistency, 64 Centralized master data management system, 162 Checkpoint, 29e30 Chef, 186 Chukwa, 54 Clickstream analysis 221 big data application strategy, 119e120 ecommerce marketplace, 121 predictive analytical models, 121 product search, 119 recommendation engine, 121 search engines, 119e120 smart meters, 121e122 web user interaction, 120 Clienteserver data processing, 19, 19f Cloud computing, 201e202, 202f Code Climate, 188e189 Complex event processing (CEP), 114e115 Comprehensive Capital Analysis and Review (CCAR), 210 Conseil Europe´en pour la Recherche Nucle´aire (CERN) See European Council for Nuclear Research (CERN) Consistency, availability, and partition tolerance (CAP) theorem, 58e62 Consul, 187 Contextual metadata, 160 Continuous deployment, 183 Continuous integration (CI), 183 Corporate banking, 133 Cosmic frontier, 88 Crowd engineering, 128, 128f D Data acquisition, 122 Data classification, 123 Data discovery, 122e123 Data discovery and connectivity artificial intelligence (AI), 205e206 customer journey centric teams, 208e209 data accessibility and culture, 209 data management, 209 end-to-end lifecycle management, 209 intelligent technology, 209e210 vision, 208 auto-and machine learning (ML) systems, 206 business intelligence, 199 challenges, 207e208 cloud computing, 201e202 compliance and regulations, 210e212 222 Index Data discovery and connectivity (Continued ) data and analytics, 200e201 data catalog, 204, 204f data lake model, 202 data management, 199, 199f data management tools, 204 data warehouses, 199e200 frustrated consumers, 205 infrastructure fatigue, 200e201, 201f new connected islands, 201f new infrastructure divide, 202, 203f well-organized data catalog, 203 Data discovery visualization, 114, 114f Data fracturing, 167 Data governance big data application building, 167e168, 168f big data processing abstracted layers, 167 data ambiguity, 166e167 format variety, 166 lack of metadata, 167 limitations, 167 data management, 163e166 definition, 157 machine learning algorithms, 169e170, 169f data processing, 173 definition, 169 metadata and master data, 172e173 real-life implementations, 170 recommendation engine, 170e171 search process, 172 web-sale data processing, 173 master data, 162e163 metadata See Metadata Data hub visualization, 115e116 Data lake model, 202 Data lake visualization, 115 Data management, 157, 158f acquire stage, 165 corporate metadata, 164 data processing cycles, 165f data-driven architecture, 164e165, 164f distribution stage, 166 process stage, 165e166 storage stage, 166 Data mining metadata, 161 Data modeling, 123 Data partitioning, 63 Data scientist, 125 Data storyboard See Visualization and storyboarding data management, 82 data science, 83 ecommerce interactions analytics, 77e78 compute mechanism, 78 electronic healthcare records (EHR), 81, 81f governance mechanisms, 79 markets and malls, 77 mcommerce model, 78e79 patient care, 80 REST API layers, 80 information layers, 73 machine learning algorithms, 83 mall interactions exorbitant cost models, 77 mall activities, 75e76 online transaction processing (OLTP), 77 self-service concept, 75 market interactions, 73e74, 74f Data tagging, 123 DataNode, 28e29 Descriptive analytics, 152e153 DEtector with Lepton, Photon, and Hadron Identification (DELPHI) detector, 89 DevOps tool, 181e182 Digital transformation, 127 Directed Acyclical Graphs (DAGs), 54 Docker, 187 Dodd-Frank Wall Street Reform and Consumer Protection Act, 210e211 Driver, Hive architecture, 50 Drug discovery, 105e106, 107f E Electronic healthcare records (EHR), 81f Energy frontier, 88 Enov8, 184 Index Enterprise data warehouse (EDW), Eroom’s law, 106 European Council for Nuclear Research (CERN) Hadoop configuration, 92 Higgs Boson discovery, 85e86 Big Bang theory, 96 drag force, 94 governance, 97 Large Hadron Collider (LHC), 95 mathematical studies, 94 open source adoption, 97 quantum physics, 94 solution segment, 96 high-energy accelerators, 86 Large Hadron Collider (LHC) ALEPH detector, 89 data calculations, 90 data generation, 91e92, 91f data processing architecture, 92 DELPHI detector, 89 detectors, 89 experiments, 88, 90 L3 detector, 89 location and components, 88 OPAL detector, 89 Worldwide LHC Computing Grid (WLCG), 90 mass and energy measurement, 86 PySpark implementation, 92 quarks and leptons, 87e88 service for web-based analysis (SWAN), 93 Standard Model Higgs boson, 86 XRootD filesystem interface project, 93 Execution Engine, Hive architecture, 50 F Filesystem snapshots, 33e36 Flight APIs, 154 Flume, 54 G Ganglia, 185 General Data Protection Regulation (GDPR), 211 223 Google MapReduce cluster, 24f architecture, 25 chunkservers, 24 corruption, 24e25 input data files, 23 metadata, 24 single point of failure (SPOF), 24 Graph databases, 14, 70 H Hadoop distributed filesystem (HDFS) architecture, 28, 29f BackupNode, 33 block allocation and storage, 30 Checkpoint, 29e30 CheckpointNode, 32 Chukwa, 54 client, 30 data processing problem, 27e28 DataNode, 28e29 Filesystem snapshots, 33e36 fundamental design principles, 27e28 Image, 29 Journal, 29 NameNode, 28 principle goals, 28 replication and recovery, 31 startup, 30 Hadoop technology, HBASE architecture implementation, 47e49 components, 48f data model, 46, 47f HBaseMaster, 47 HRegionServer, 47 META table, 48e49 ROOT table, 48e49 HBaseMaster, 47 HCatalog, 54e58 High frequency trades (HFTs), 137 High-Performance Computing (HPC), 110 Hinted handoff, 65 Hive architecture, 50e51, 50f data types, 53 224 Index Hive (Continued ) design goals, 49 execution, 51e53 infrastructure, 51 process flow, 52f Hotel shopping APIs, 155 HRegionServer, 47 I Image, 29 Infrastructure and technology Basho Riak, 68e69 big data processing requirements, 21e22 technologies, 22 Cassandra See Cassandra distributed data processing clienteserver data processing, 19, 19f distributed file based storage, 20 extreme parallel processing, 20 fault tolerance, 20 generic new generation distributed data architecture, 21f high-speed replication, 19 limitations, 20e21 linearly scalable infrastructure, 18 localized processing, 19 mastereslave configuration, 18 minimal database usage, 20 object oriented programming, 19e20 programmable APIs, 20 relational database management system (RDBMS), 18, 18f document-oriented databases, 69e70 graph databases, 70 Hadoop core components, 26e28, 27f Hadoop distributed filesystem (HDFS) See Hadoop distributed filesystem (HDFS) history, 26 HBASE See HBASE Hive See Hive MapReduce features, 22e23 Google architecture, 23e25 programming model, 23 Pig Latin See Pig Latin Infrastructure fatigue, 200e201, 201f Intensity frontier, 88 J Jenkins, 184 Journal, 29 Juju, 189 L Large ElectronePositron Collider (LEP), 88 Large Hadron Collider (LHC) ALEPH detector, 89 data calculations, 90 data generation, 91e92, 91f data processing architecture, 92 DELPHI detector, 89 detectors, 89 experiments, 88, 90 L3 detector, 89 location and components, 88 OPAL detector, 89 Worldwide LHC Computing Grid (WLCG), 90 Leptons, 87e88 Logistic regression models, 143 M Machine intelligence, 14 Machine learning, 11, 128 MapReduce code, 123 Market interactions, 73e74, 74f Master data business entities, 162 centralized master data management system, 162 data governance and stewardship teams, 162e163 end state architecture, 162 technology platform, 162 mcommerce model, 78e79 Metadata business intelligence metadata, 161 Index business metadata, 159e160 contextual metadata, 160 core business metadata, 160e161 cost impact, 159 database changes, 158 definition, 157 infrastructure metadata, 160 operational metadata, 161 process design level metadata, 160 program level metadata, 160 source, 159 technical metadata, 159 value of, 157 Metastore, Hive architecture, 50 Microbatch processing, 102e103 Minimal incremental effect, 106 Monit, 188 N Nagios, 186 NameNode, 28 Native stream processing, 102e103 Neural networks, 143 Next Generation Sequencing (NGS) research, 110 Node Manager, 36 Not-only-SQL (NoSQL) database, 9, 57e58 Novartis Institutes for Biomedical Research (NIBR), 110 O OLAP metadata, 161 Omni-Purpose Apparatus for LEP (OPAL), 89 Online transaction processing (OLTP), 77 Oozie, 54 Operational metadata, 161 Ordered preserving partitioners, 67 OverOps, 187 P PagerDuty, 185 Pharmacy industry applications case study, 110e111 data complexity, 99 analytics, 101 attribution process, 104 big data algorithms, 102e103 business teams’ rules, 104 data discovery process, 103 data exploration, 103e104 data formulation, 101 data streaming, 103 feedback and loops, 99e100 initial sensitivity, 100 methodology, 102, 102f microbatch processing, 102e103 native stream processing, 102e103 nonlinear relationship, 100 system self-adaptability, 100 volume vs granularity, 100, 101f data transformation big data applications, 107 data architecture, 105 data collaboration, 108 drug discovery, 105e106, 107f integrations and business rules, 104e105 microservice architecture, 105 research and drug outcomes, 108 social media integration, 109 distributed data processing, 99 healthcare system benefits, 109e110 Phi Accrual Failure Detection algorithm, 67e68 Pig Latin common command, 44e46 data types, 44 design goals, 43 high-level language, 43 program flow, 44 program running, 44 programming, 43e44 Predictive analytical models, 121 Predictive analytics, 153e154 Prescriptive analytics, 153 Process design level metadata, 160 Program level metadata, 160 Puppet Enterprise, 189 225 226 Index Q Quarks, 87e88 Query Compiler, Hive architecture, 50 R Rail APIs, 155 Random Partitioner, 67 Relational database management system (RDBMS), 18f Replica placement strategy, 62 Replication factor, 62 Reporting metadata, 161 Representational State Transfer (RESTful) application programming interfaces (APIs), 80 ResourceManager (RM), 35e36 Retail and consumer banking, 132e133 S Scalyr, 189 Scientific research applications See European Council for Nuclear Research (CERN) Sensor data analytics, 14 Service for web-based analysis (SWAN), 93 Smart cars, 11e12 Smart health monitoring, 12 Smart meters, 121e122 Smart thermostats, 11 Snort, 185 Splunk, 186 Sqoop, 55e57 Stackify Retrace, 187e188 Storyboard development, 175 Sumo Logic, 186 Supervisor, 188 T Technical metadata, 159 Textual processing engine, Timestamp, 61 Train APIs, 155 Travel and tourism industry applications annual revenue, 145 computing platforms, 145 data and agile API, 154e155 data application, 146 data block layout, 145, 146f data lake, 147 descriptive analytics, 152e153 hospitality industry and big data customer categorization, 151 personalized service, 151e152 social media, 152 yield management, 152 innovation, 147e148 Mac versus Windows price strategy, 147 neural network and machine learningdriven applications, 147e148 niche targeting and selling propositions, 150 operational and raw data analytics, 146 optimized disruption management, 149e150 personalized services, 146 predictive analytics, 153e154 prescriptive analytics, 153 real-time conversion optimization, 149 San Francisco airport, 148 Schiphol airport, 148 sentiment analysis, 150 smart social media listening, 150 Travel intelligence APIs, 155 Travel record APIs, 155 Tuple-at-a-time processing, 102e103 U USA Patriot Act, 211 V Vagrant, 185 Visualization and storyboarding See Data storyboard analytics visualization, 116e117, 117f big data applications algorithm implementation, 124 application-specific requirements, 118 audit trails and logs, 119 big data platform, 123 corporate data, 117e118 Index customer sentiment analytics integration, 123e124 data discovery, 122e123 data scientist, 125 data types, 122, 122f enterprise guidance, 118 innovative organization, 118 MapReduce code, 123 mash-up platform, 123 narrative process, 118 outcome-based strategy, 118 statistical software, 124 storyboard sequences, 118 data discovery visualization, 114, 114f data hub visualization, 115e116 data lake visualization, 115 operational data analytics and visualization, 114e115, 116f Voldemort, 59e60 227 W Wealth managers, 133 Worldwide LHC Computing Grid (WLCG), 90 Write Once Model, 167 X XRootD filesystem interface project, 93 Y Yet Another Resource Negotiator (YARN) ApplicationMaster, 36 architecture, 36f execution flow, 37e40, 38f JobTracker, 34e35 Node Manager, 36 ResourceManager (RM), 35e36 scalability, 37 SQL/MapReduce, 37e39 ... how to monetize from Big Data First to understand how to build applications with Big Data, we need to look at Big Data from both the technology and data perspectives Big Data delivers business... industry situations Big Data applicationsdprocessing data The processing of Big Data applications requires a step-by-step approach: Chapter Big Data introduction 13 Acquire data from all sources... semistructured, and unstructured), the other 50% or less is the new data that is called Big Data (web data, machine data, and sensor data) Big Data Applications are the answer to leveraging the analytics