Managing the Data Lake Moving to Big Data Analysis Andy Oram Managing the Data Lake by Andy Oram Copyright © 2015 O’Reilly Media, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Shannon Cutt Interior Designer: David Futato Cover Designer: Karen Montgomery September 2015: First Edition Revision History for the First Edition 2015-09-02: First Release 2015-10-20: Second Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Managing the Data Lake and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights Cover photo credit: “55 Flying Fish” by Michal (flikr) 978-1-491-94168-3 [LSI] Chapter Moving to Big Data Analysis Can you tell by sailing the surface of a lake whether it has been well maintained? Can local fish and plants survive? Dare you swim? And how about the data maintained in your organization’s data lake? Can you tell whether it’s healthy enough to support your business needs? An increasing number of organizations maintain fast-growing repositories of data, usually from multiple sources and formatted in multiple ways, that are commonly called “data lakes.” They use a variety of storage and processing tools—especially in the Hadoop family—to extract value quickly and inform key organizational decisions This report looks at the common needs that modern organizations have for data management and governance The MapReduce model—introduced in 2004 in a paper1 by Jeffrey Dean and Sanjay Ghemawat—completely overturned the way the computing community approached big data analysis Many other models, such as Spark, have come since then, creating excitement and seeing eager adoption by organizations of all sizes to solve the problems that relational databases were not suited for But these technologies bring with them new demands for organizing data and keeping track of what you’ve got I take it for granted that you understand the value of undertaking a big data initiative, as well as the value of a framework such as Hadoop, and are in the process of transforming the way you manage your organization’s data I have interviewed a number of experts in data management to find out the common challenges you are about to face, so you can anticipate them and put solutions in place before you find yourself overwhelmed Essentially, you’ll need to take care of challenges that never came up with traditional relational databases and data warehouses, or that were handled by the constraints that the relational model placed on data There is wonderful value in those constraints, and most of us will be entrusting data to relational systems for the foreseeable future But some data tasks just don’t fit And once you escape the familiarity and safety of the relational model, you need other tools to manage the inconsistencies, unpredictability, and breakneck pace of the data you’re handling The risk of the new tools is having many disparate sources of data—and perhaps multiple instances of Hadoop or other systems offering analytics operating inefficiently—which in turn causes you to lose track of basic information you need to know about your data This makes it hard to set up new jobs that could provide input to the questions you urgently need to answer The fix is to restore some of the controls you had over old data sources through careful planning and coding, while still being flexible and responsive to fast-moving corporate data needs The main topics covered in this report are: Acquisition and ingestion Data comes nowadays from many different sources: internal business systems, product data from customers, external data providers, public data sets, and more You can’t force everyone to provide the data in a format that’s convenient for you Nor can you take the time (as in the old days) to define strict schemas and enter all data into schemas The problems of data acquisition and ingestion have to be solved with a degree of automation Metadata (cataloguing) Questions such as who provided the data, when it came in, and how it was formatted—a slew of concerns known as lineage or provenance—are critical to managing your data well A catalog can keep this metadata and make it available to later stages of processing Data preparation and cleaning Just as you can’t control incoming formats, you can’t control data quality You will inevitably deal with data that does not conform Data may be missing, entered in diverse formats, contain errors, and so on In addition, data might be lost or corrupted because sensors run out of battery power, networks fail, software along the way harbored a bug, or the incoming data had an unrecognized format Some data users estimate that detecting these anomalies and cleaning takes up 90% of their time Managing workflows The actual jobs you run on data need to be linked with the three other stages just described Users should be able to submit jobs of their own, based on the work done by experts before them, to handle ingestion, cataloguing, and cleaning You want staff to quickly get a new visualization or report without waiting weeks for a programmer to code it up Access control Data is the organization’s crown jewels You can’t give everybody access to all data In fact, regulations require you to restrict access to sensitive customer data Security and access controls are therefore critical at all stages of data handling Why Companies Move to Hadoop To set the stage for exploration of data management, it is helpful to remind ourselves of why organizations are moving in the direction of big data tools Size “Volume” is one of the main aspects of big data Relational databases cannot scale beyond a certain volume due to architecture restrictions Organizations find that data processing in relational databases takes too long, and as they more and more analytics, such data processing using conventional ETL tools becomes such a big time sink that they hold users back from making full use of the data Variety Typical sources include flat files, RDBMSes, logs from web servers, devices and sensors, and even legacy mainframe data Sometimes you want also to export data from Hadoop to an RDBMS or other repository Free-form data Some data may be almost completely unstructured, as in the case of product reviews and social media postings Other data will come to you inconsistently structured For instance, different data providers may provide the same information in very different formats Streaming data If you don’t keep up with changes in the world around you, it will pass you by—and probably reward a competitor who does adapt to it Streaming has evolved from a few rare cases, such as stock markets and sensor data, to everyday data such as product usage data and social media Fitting the task to the tool Data maintained in relational databases—let alone cruder storage formats, such as spreadsheets —is structured well for certain analytics But for new paradigms such as Spark or the MapReduce model, preparing data can take more time than doing the analytics Data in normalized relational format resides in many different tables and must be combined to make the format that the analytics engine can efficiently process Frequent failures Modern processing systems such as Hadoop contain redundancy and automatic restart to handle hardware failures or software glitches Even so, you can expect jobs to be aborted regularly by bad data You’ll want to get notifications when a job finishes successfully or unsuccessfully Log files should show you what goes wrong, and you should be able to see how many corrupted rows were discarded and what other errors occurred Unless you take management into consideration in advance, you end up unable to make good use of this data One example comes from a telecom company whose network generated records about the details of phone calls for monthly billing purposes Their ETL system didn’t ingest data from calls that were dropped or never connected, because no billing was involved So years later, when they realized they should be looking at which cell towers had low quality, they had no data with which to so A failure to collect or store data may be an extreme example of management problems, but other hindrances—such as storing it in a format that is hard to read, or failing to remember when it arrived —will also slow down processing to the point where you give up opportunities for learning insights from your data When the telecom company just mentioned realized that they could use information on dropped and incomplete calls, their ETL system required a huge new programming effort and did not have the capacity to store or process the additional data Modern organizations may frequently get new sources of data from brokers or publicly available repositories, and can’t afford to spend time and resources doing such coding in order to integrate them In systems with large, messy data, you have to decide what the system should when input is bad When you skip a record, when you run a program to try to fix corrupted data, and when you abort the whole job? A minor error such as a missing ZIP code probably shouldn’t stop a job, or even prevent that record from being processed A missing customer ID, though, might prevent you from doing anything useful with the data (There may be ways to recover from these errors too, as we’ll see.) Your choice depends of course on your goal If you’re counting sales of a particular item, you don’t need the customer ID If you want to update customer records, you probably A more global problem with data ingestion comes when someone changes the order of fields in all the records of an incoming data set Your program might be able to detect what happened and adjust, or might have to abort At some point, old data will pile up and you will have to decide whether to buy more disk space, archive the data (magnetic tape is still in everyday use), or discard it Archiving or discarding has to be automated to reduce errors You’ll find old data surprisingly useful if you can manage to hold on to it And of course, having it readily at hand (instead of on magnetic tape) will permit you to quickly run analytics on that data Acquisition and Ingestion At this point we turn to the steps in data processing Acquisition comes first Nowadays it involves much more than moving data from an external source to your own repository In fact, you may not be storing every source you get data from at all: you might accept streams of fast-changing data from sensors or social media, process them right away, and save only the results On the other hand, if you want to keep the incoming data, you may need to convert it to a format understood by Hadoop or other processing tools, such as Avro or Parquet The health care field provides a particularly complex data collection case You may be collecting: Electronic health records from hospitals using different formats Claims data from health care providers or payers Profiles from health plans Data from individuals’ fitness devices Electronic health records illustrate the variety and inconsistency of all these data types Although there are standards developed by the HL7 standards group, they are implemented differently by each EHR vendor Furthermore, HL7 exchanges data through several messaging systems that differ from any other kind of data exchange used in the computer field In a situation like this, you will probably design several general methods of ingesting data: one to handle the HL7 messages from EHRs, another to handle claims data, and so on You’ll want to make it easy for a user to choose one of these methods and adjust parameters such as source, destination file, and frequency in order to handle a new data feed Successful ingestion requires you to know in detail how the data is coming in Read the documentation carefully: you may find that the data doesn’t contain what you wanted at all, or needs complex processing to extract just what you need And the documentation may not be trustworthy, so you have to test your ingestion process on actual input As mentioned earlier, you may be able to anticipate how incoming data changes—such as reordered fields—and adapt to it However, there are risks to doing this First, your tools become more complicated and harder to maintain Second, they may make the wrong choice because they think they understand the change and get it wrong Another common ingestion task is to create a consolidated record from multiple files of related information that are used frequently together— for example, an Order Header and Details merged into one file Hadoop has a particular constraint on incoming data: it was not designed for small files Input may consist of many small files, but submitting them individually will force a wasteful input process onto Hadoop and perhaps even cause a failure For this reason, it is recommended that, prior to processing these small files, they be combined into a single large file to leverage the Hadoop cluster more efficiently This example highlights an important principle governing all the processing discussed in this report: use open formats if possible, and leverage everything the open source and free software communities have made available This will give you more options, because you won’t be locked into one vendor Open source also makes it easier to hire staff and get them productive quickly However, current open source tools don’t everything you need You’ll have to fill in the gaps with commercial solutions or hand-crafted scripts For instance, Sqoop is an excellent tool for importing data from a relational database to Hadoop and supports incremental loads However, building a complete insert-update-delete solution to keep the Hive table in sync with the RDBMS table would be a pretty complex task Here you might benefit from Zaloni’s Bedrock product, which offers a Change Data Capture (CDC) action that handles inserts, updates, and deletes and is easy to configure Metadata (Cataloguing) Why you need to preserve metadata about your data? Reasons for doing so abound: For your analytics, you will want to choose data from the right place and time For instance, you may want to go back to old data from all your stores in a particular region Data preparation and cleaning require a firm knowledge of which data set you’re working on Different sets require different types of preparation, based on what you have learned about them historically Analytical methods are often experimental and have some degree of error To determine whether you can trust results, you may want to check the data that was used to achieve the results, and review how it was processed When something goes wrong in any stage from ingestion through to the processing, you need to quickly pinpoint the data causing the problem You also must identify the source so you can contact them and make sure the problem doesn’t reoccur in future data sets In addition to cleaning data and preventing errors, you may have other reasons related to quality control to preserve the lineage or provenance of data Access has to be restricted to sensitive data If users deliberately or inadvertently try to start a job on data they’re not supposed to see, your system should reject the job Regulatory requirements may require the access restrictions mentioned in the previous bullet, as well as imposing other requirements that depend on the data source Licenses may require access restrictions and other special treatment of some data sources Ben Sharma, CEO and co-founder of Zaloni, talks about creating “a single source of truth” from the diverse data sets you take in By creating a data catalog, you can store this metadata for use by downstream programs Zaloni divides metadata roughly into three types: Business metadata This can include the business names and descriptions that you assign to data fields to make them easier to find and understand For instance, the technical staff may have a good reason to assign the name loc_outlet to a field that represents a retail store, but you will want users to be able to find it through common English words This kind of metadata also covers business rules, such as putting an upper limit (perhaps even a lower limit) on salaries, or determining which data must be removed from some jobs for security and privacy Operational metadata This is generated automatically by the processes described in this report, and include such things as the source and target locations of data, file size, number of records, how many records were rejected during data preparation or a job run, and the success or failure of that run itself Technical metadata This includes the data’s type and format (text, images, JSON, Avro, etc.) and the structure or schema This structure includes the names of fields, their data types, their lengths, whether they can be empty, and so on Structure is commonly provided by a relational database or the headings in a spreadsheet, but may also be added during ingestion and data preparation Zaloni’s Bedrock integrates with Apache Hcatalog for technical metadata so that other tools in the Hadoop ecosystem can take advantage of the structure definition As suggested in the previous list, one can also categorize metadata by the way it is gathered: Some metadata is embedded in the data, such as the schema in a relational database Some metadata pertains to the data acquisition process: the source of the data, filename, time of creation, time of acquisition, file size, redundancy checks generated to make sure the transmission was not corrupted, and MD5 hashes generated to uniquely identify a file Some metadata is created during ingestion For instance, a watermark can be added to a file or to a column within the file If you take JSON or other relatively unstructured data and create a schema around it, that schema becomes part of the metadata Some metadata is created during a job run, such as the number of records successfully processed, the number of bad fields or bad records, and how long a job took The next question is how to create metadata Many tools can extract the easy stuff, such as file sizes and timestamps, as the stages of processing proceed Other metadata requires custom-written programs that such things as tag particular data fields you’ll want to extract later At any stage of processing, you may choose to update the metadata Each stage can also consult the metadata when applying rules for user access, cleaning, and submitting data to jobs We’ll see later how, at least in theory, storing feedback in metadata can create an environment of continuous quality improvement Currently, one of the huge challenges in data management is communicating metadata to downstream parts of a workflow A good deal of Zaloni Bedrock’s benefits rest on its ability to this conveniently Work is just starting on open source project named Apache Atlas, which addresses some of these issues as well Data Preparation and Cleaning Assume that your data will come with a certain amount of errors, corrupted formats, and duplicates I’m not using “assume” in a hypothetical sense here—you had better assume the presence of errors or you will be blindsided when they happen What will be the impacts of such errors? Suppose data transfers don’t complete, for instance? Your workflows should be able to handle the most common problems, and you’ll need to research your data feeds to discover those problems A sense of what you can run into comes, like several other examples in this report, from health care The US government’s Center for Medicare & Medicaid Services (CMS), which covers a large percentage of health care payments in the country, requires participating health care providers to submit quality data in a format called the Healthcare Effectiveness Data and Information Set (HEDIS) This format is strict, demanding, and absolutely gigantic Fields that get mixed up or have incorrect coding cost huge amounts of money as providers rush to fix them Why is HEDIS hard to fill out? Because the data is drawn from reports that undergo many processing steps, in paper or electronic forms You would not want your organs during a surgery to pass through as many hands as HEDIS data does The doctor’s original note is processed by a business office within the provider, after which it is sent to an outside billing service because payer requirements are so strict and complicated The forms then go to the insurer, who may question the claim and send it back through the route on which it came The trek may undergo several iterations, taking months As the health care provider strives to get payment, lost data and errors in coding are likely to enter the data Rest assured, therefore, that your data will need processing and cleaning There are two types of fixes that require different responses from your organization: fixes that can be done on a single piece of data and fixes that require analytics to be run on large data sets Note that even a fix on a single piece of data may be developed by analytics carried out within your organization, or a vendor For instance, research can show that the state of California is commonly represented as Ca, CA, Cal, or Cali in data sets A simple programming check, using fixed strings or regular expressions, can identify the various possible values and harmonize them on a single standard, such as CA Similar research can help with the HL7 example I cited earlier, where different vendors implement a standard differently and put data in different places Once you identify how a particular vendor codes an address, you can write a program to read it into the format of your choice This program must be updated, of course, if the vendor changes their coding, which probably will happen without notice Good reason for running more analytics A missing customer ID probably can’t be fixed by examining a single record, although it is possible you’ll discover the ID entered into a different field of the record More likely, you’ll run a job to match customers by name, gender, address, and other characteristics You can probably find a record in a different data set and be able to trust, with a good deal of confidence, that it’s the customer with the missing ID A job can identify two records that refer to the same customer This mistake often happens when combining data sets from different sources It could also happen out in the real world for many reasons: the customer changed his name, moved to a new address, decided to use a different email address, got a misspelled name because someone entered it into the system sloppily, etc Another example where a job can help enforce quality is checking city names against ZIP codes in US addresses If you find two cities with the same ZIP code in your data, at least one is incorrect Every ZIP code in the U.S is assigned to only one city (although a city can have many ZIP codes) To decide what needs to be cleaned up and how, work with the business team and come up with rules for data quality When you check individual records, typical rules might include: Data older than a certain age should be discarded, or marked as less trustworthy because it might have changed Certain fields must not be empty An empty field may be hard to identify because some people enter meaningless strings such as X or 9999 when they don’t know something Sometimes you can find the data elsewhere and fill it in, but sometimes you’ll choose to reject the whole record Dates and times must be correct, and must be in a standard format Many commercial tools provide built-in functions to common checks and even make fixes, but many sites write filters of their own at least part of the time In addition to checking each field, you usually need some higher-level checks that involve files and metadata For instance, did incoming data conform to the schema you expected? Are you getting two identical files? Comparing the MD5 hashes generated on the files is a simple way to determine the answer to the previous question The data preparation stage is often where sensitive data, such as financial and health information, is protected Although terms for this differ, most systems distinguish two types of protection: removing a field completely (often called masking) and changing the field to something innocuous (often called tokenization) As an example of tokenization, test data sets substitute realistic but fake names for real names so that developers can test their code against these sets Another kind of tokenization is to run the value from a field through a one-way hash (such as MD5), which ensures that the same value is always represented by the same hash, but prevents anyone from deriving the original value This is a type of pseudonymity Often, you need to bring a human back into the loop to clean data A system may suggest that 1-800REDCROSS is an incorrect phone number because it contains letters (and is one character too long, as well), but a human observer can tell the system to accept it Over time, the system picks up more and more such information and becomes smarter, even when processing new data sets One of the most interesting experiments currently taking place in big data research is a form of continuous quality improvement, according to Ihab Ilyas, professor at the University of Waterloo and co-founder of Tamr A program analyzes the data to find error patterns and develop some rules to restore consistency It can then use these rules to catch current and future errors at earlier stages of data processing, and perhaps even fix or suggest fixes to the errors Managing Workflows You have designed your filters and jobs for ingestion, cataloguing metadata, data preparation, and Hadoop itself Can you make regular, productive use of all these things? That depends on how easily you can combine the tasks in end-to-end workflows First, you should make workflows for each task How is data from a particular source ingested? Do you have a general workflow to which you can just assign parameters such as the source and type of data? And how is the workflow triggered? Forcing someone to launch the job manually is a waste of staff time, and prone to errors You could something as simple as schedule a job at regular intervals (Unix and Linux provide cron for that purpose.) YARN is an open source tool that helps with resource allocation and scheduling Resource allocation gets particularly complex in the cloud You want to ensure you can get the number and capacity of systems you need for the turn-around time you need, while avoiding the risk of jobs growing to an enormous, costly scale Your workflow processor should also be able to handle triggers, so that when something important happens like the arrival of new data, the job launches on its own For instance, AWS Data Pipeline lets you specify that a job starts whenever a particular file is uploaded to S3 storage The open source Oozie project can also start a job based on the availability of data Scheduling should also be flexible One site I talked to sometimes delays a workflow for a few hours when the servers are at capacity Having small workflows in place, you should be able to compose larger workflows from subworkflows In that way you can robustly construct a single workflow covering data acquisition, ingestion (putting it in the right repository), cleansing, format conversion, enrichment, and provisioning of the results Most sites have multiple environments: for instance, development, test, and production It should be possible to run the same workflow in these environments, with different parameters appropriate for each environment With such a system in place, you can have strong confidence that the programs your developers and testers work on will hold up in production Currently, most sites create workflows through a programming language Some developers use Java because that’s the basic way of creating jobs for Hadoop and related tools Most use popular scripting languages such as Python or simply the Unix shell However, not all formats handled by Hadoop are supported by all languages Libraries are continually being added to fill the gap, but you are likely to find a need to incorporate a Java program to format data into your workflow One advantage of using a programming or scripting language is that you can use source control and testing as you would on any program Ideally, users without a technical background could construct and launch their own workflows To enable this, Zaloni provides a graphical user interface where users can drag and drop predefined workflows, connect them by dragging arrows between them, and then schedule the job Job failures, as mentioned before, may sometimes be handled by rerunning the job at various levels of your system, but you’ll have to plan what to if the job can’t recover from an error Thus, workflows should send notifications on important events, particularly success or failure They should also embody rules to decide when to skip a record, or when to stop entirely For instance, suppose you have two rules during data preparation, one making sure that the input is a number and the other making sure it’s within an allowed range If the input isn’t a number, it would be meaningless to check it against a range, and there is no point to running the second rule After a run, reports can include lots of useful statistics in addition to success or failure How many records were dropped because they were corrupt? Were input files missing? What were the percentages of such errors, in relation to the whole job? Such information can be stored in logs and then processed by the operations team to produce web displays and dashboards You’ll want to track errors on several levels: for a single job, for a collection of jobs, and over time That way you can tell whether your input data is slipping in quality, and whether your tools are doing as good a job as they did on the data where you first ran them Your metadata catalog can come in valuable at the error stage The operations team should be able to see from a log or other report where the problem occurred (which file, which record) and go back to the original data to diagnose the cause Tags and watermarks enable this forensic research The tag you assign to a particular column in a particular source should last throughout the pipeline and appear in the log entry that reports a problem Access Control We have seen that access control is crucial for organizational safety, privacy, and regulatory compliance Large organizations achieve security by dividing users into groups—research teams, operations teams, etc.—and grouping data into resources with access rights Then you can grant users or groups access to particular data resources For instance, one research team may be researching the effectiveness of a website, so you can grant it access to all logs and data about the website without being able to see other things such as sales data One site I talked to isolated personally identifiable information (PII) through a hybrid solution It’s often easy to tell by the column name whether data is personally identifiable, and route such columns to a different repository with different access rights Sometimes a processor needs to tag data with special identifiers so that it is routed later to the secure repository Each stage, including the analytics, can be restricted to the repositories that don’t contain PII In another sub-optimal type of operating environment, teams keep their data and Hadoop jobs separate within silos or “data puddles.” With appropriate access controls, your organization should be able to save money, increase security, and leverage your data better by managing it all in a systematic manner Conclusion A recent report2 found that governments and other organizations are opening up large quantities of data, but many of the companies who could benefit from it don’t know it exists The same problem can happen within your own organization Hadoop, at its core, is a file system and a set of libraries to process large quantities of data Management of that data—ingestion, data preparation, job scheduling, and access rights—must be addressed by other tools Tools such as Sqoop and YARN are emerging in the open source community to pick off various pieces of the data management problem You should use robust open source tools where they are available and keep data in transparent formats so that it can be submitted to these tools, while taking advantage of commercial products aimed at the data lake You’re spending a lot of money to accumulate and store data Therefore, the people who need the data must be able to find it and combine it quickly into analytic jobs that produce useful insights Recognizing the specific tasks you need for acquisition and ingestion, cataloguing, data cleaning, and analytical jobs can help you prepare for the problems you’ll encounter in these phases and have production-ready solutions at hand Workflows and access control contribute important management solutions across the entire system All that shiny data is there for your users to enjoy—make it a pleasure for them http://bit.ly/1hJyJzi (PDF) http://bit.ly/1WWHAxC (PDF) About the Author Andy Oram is an editor at O’Reilly Media An employee of the company since 1992, Andy currently specializes in programming and health IT His work for O’Reilly includes the first books ever published commercially in the United States on Linux, and the 2001 title Peer-to-Peer ... in the previous list, one can also categorize metadata by the way it is gathered: Some metadata is embedded in the data, such as the schema in a relational database Some metadata pertains to the. .. Managing the Data Lake Moving to Big Data Analysis Andy Oram Managing the Data Lake by Andy Oram Copyright © 2015 O’Reilly Media, Inc All rights reserved Printed in the United States... whether your input data is slipping in quality, and whether your tools are doing as good a job as they did on the data where you first ran them Your metadata catalog can come in valuable at the