1 Clusters Building Solutions Single vs Many Clusters Multitenancy Backup & Disaster Recovery Cloud Services Provisioning Summary 2 Compute & Storage Computer architecture for Hadoopers Commodity servers Non-Uniform Memory Access Server CPUs & RAM The Linux Storage Stack Server Form Factors 1U 2U 4U Form Factor Price Comparison Workload Profiles Other Form Factors Cluster Configurations and Node Types Master Nodes Worker Nodes Utility Nodes Edge Nodes Small Cluster Configurations Medium Cluster Configurations Large Cluster Configurations 3 High Availability Planning for Failure What we mean by High Availability? Lateral or Service HA Vertical or Systemic HA Automatic or Manual Failover How available does it need to be? Service Level Objectives Percentages Percentiles Operating for High Availability Monitoring Playbooks High Availability Building Blocks Quorums Load Balancing Database HA Ancillary Services High Availability of Hadoop Services General considerations ZooKeeper HDFS YARN HBase High Availability KMS Hive Impala Solr 10 Oozie 11 Flume 12 Hue 13 Laying out the Services Hadoop in the Enterprise: Architecture A Guide to Successful Integration Jan Kunigk, Lars George, Paul Wilkinson, Ian Buss Hadoop in the Enterprise: Architecture by Jan Kunigk , Lars George , Paul Wilkinson , and Ian Buss Copyright © 2017 Jan Kunigk, Lars George, Ian Buss, and Paul Wilkinson All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc , 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles ( http://oreilly.com/safari ) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Nicole Tache Production Editor: FILL IN PRODUCTION EDITOR Copyeditor: FILL IN COPYEDITOR Proofreader: FILL IN PROOFREADER Indexer: FILL IN INDEXER Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest September 2017: First Edition Revision History for the First Edition 2017-03-22: First Early Release See http://oreilly.com/catalog/errata.csp?isbn=9781491969274 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Hadoop in the Enterprise: Architecture, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-96927-4 [FILL IN] Chapter Clusters Big Data and Apache Hadoop are by no means trivial in practice, as there are many moving parts and each requires its own set of considerations In fact, each component in Hadoop, for example HDFS, is supplying distributed processes that have their own peculiarities and a long list of configuration parameters that all may have an impact on your cluster and use-case Or maybe not You need to whittle down everything in painstaking trial and error experiments, or consult what you can find in regards to documentation In addition, new releases of Hadoop—but also your own data pipelines built on top of that—requires careful retesting and verification that everything holds true and works as expected We will discuss practical solutions to this and many other issues throughout this book, invoking what the authors have learned (and are still learning) about implementing Hadoop clusters and Big Data solutions at enterprises, both large and small One thing though is obvious, Hadoop is a global player, and the leading software stack when it comes to Big Data storage and processing No matter where you are in the world, you all may struggle with the same basic questions around Hadoop, its setup and subsequent operations By the time you are finished reading this book, you should be much more confident in conceiving a Hadoop based solution that may be applied to various and exciting new usecases In this chapter, we kick things off with a discussion about cluster environments, which is a topic often overlooked as it is assumed that the successful proof-ofconcept cluster delivering the promised answers is also the production environment running the new solution at scale, automated, reliable, and maintainable—which is often far from the truth Building Solutions Developing for Hadoop is quite unlike common software development, as you are mostly concerned with building not a single, monolithic application but rather a concerted pipeline of distinctive pieces, which in the end are to deliver the final result Often this is insight into the data that was collected, and on which is built further products, such as recommendation or other realtime decision making engines Hadoop itself is lacking graphical data representation tools, though there are some ways to visualize information during discovery and data analysis, for example, using Apache Zeppelin or similar with charting support built-in In other words, the main task in building Hadoop-based solutions is to apply Big Data Engineering principles, that comprise the following selection (and, optionally, creation) of suitable hard- and software components, data sources and preparation steps, processing algorithms, access and provisioning of resulting data, and automation of processes for production As outlined in Figure 1-1, the Big Data engineer is building a data pipeline, which might include more traditional software development, for example, to write an Apache Spark job that uses the supplied MLlib applying a linear regression algorithm to the incoming data But there is much more that needs to be done to establish a whole chain of events that leads to the final result, or the wanted insight Figure 1-1 Big Data Engineering A data pipeline comprises, in very generic terms, the task of ingesting the incoming data, and staging it for processing, processing the data itself in an automated fashion, triggered by time or data events, and delivering the final results (as in, new or enriched datasets) to the consuming systems These tasks are embedded into an environment, one that defines the boundaries and constraints in which to develop the pipeline (see Figure 1-2) In practice the structure of this environment is often driven by the choice of Hadoop distribution, placing an emphasis on the included Apache projects that form the platform In recent times, distribution vendors are more often going their own way and selecting components that are similar to others, but are not interchangeable (for example choosing Apache Ranger vs Apache Sentry for authorization within the cluster) This does result in vendor dependency, no matter if all the tools are open-source or not Figure 1-2 Solutions are part of an environment The result is, that an environment is usually a cluster with a specific Hadoop distribution (see [Link to Come]), running one or more data pipelines on top of it, which are representing the solution architecture Each solution is embedded into further rules and guidelines, for example the broader topic if governance, which includes backup (see [Link to Come]), metadata and data management, lineage, security, auditing, and other related tasks During development though, or during rapid prototyping, say for a proof-of-concept project, it is common that only parts of the pipeline are built For example, it may suffice to stage the source data in HDFS, but not devise a fully automated ingest setup Or the final provisioning of the results is covered by integration testing assertions, but not connected to the actual consuming systems No matter what the focus of the development is, in the end a fully planned data pipeline is a must to be able to deploy the solution in the production environment It is common for all of the other environments before that to reflect the same approach, making the deployment process more predictable Figure 1-3 summarizes the full Big Data Engineering flow, where a mixture of engineers work on each major stage of the solution, including the automated See the following blog for an excellent description of Oozie HA functionality and the design documents on OOZIE-615 An Oozie HA setup is shown in Figure [Link to Come] In the diagram, users interact with Oozie via a load balancer making self-contained requests using the REST API Oozie servers, keeping their state in an HA database, kick off workflows via launcher jobs (MapReduce jobs) running in YARN When complete, these jobs make a request to a callback URL, again via a load balancer Oozie servers are also constantly monitoring the outcome of launcher jobs Figure 3-9 Oozie High Availability Deployment Considerations Oozie HA is straightforward to configure and there are relatively few considerations for the enterprise architect to be aware of Use master nodes Although not heavy resource user itself Oozie will suffer if placed on extremely busy nodes For this reason Oozie servers can be placed on master nodes, utility nodes or edge nodes in the cluster but not on worker nodes Use an HA database Oozie servers store most of their state in a database, which should itself be HA Configure security for HA Each Oozie server’s Kerberos keytab should contain entries for both the load balancer DNS name and the actual hostname of the server Similarly, if using SSL, the certificates should contain both the DNS name and actual hostname as subject alternative names Flume Apache Flume is a framework for ingesting streams of messages into Hadoop and is targeted at delivery of event streams of small sized messages (