Moving Hadoop to the Cloud Harnessing Cloud Features and Flexibility for Hadoop Clusters Bill Havanki Moving Hadoop to the Cloud by Bill Havanki Copyright © 2017 Bill Havanki Jr All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Marie Beaugureau Production Editor: Colleen Cole Copyeditor: Kim Cofer Proofreader: Christina Edwards Indexer: WordCo Indexing Services, Inc Interior Designer: David Futato Cover Designer: Karen Montgomery July 2017: First Edition Illustrator: Rebecca Demarest Revision History for the First Edition 2017-07-05: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491959633 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Moving Hadoop to the Cloud, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-95963-3 [LSI] Foreword Apache Hadoop as software is a simple framework that allows for distributed processing of data across many machines As a technology, Hadoop and the surrounding ecosystem have changed the way we think about data processing at scale No longer does our data need to fit in the memory of a single machine, nor are we limited by the I/O of a single machine’s disks These are powerful tenets So too has cloud computing changed our way of thinking While the notion of colocating machines in a faraway data center isn’t new, allowing users to provision machines on-demand is, and it’s changed everything No longer are developers or architects limited by the processing power installed in onpremise data centers, nor we need to host small web farms under our desks or in that old storage closet The pay-as-you-go model has been a boon for ad hoc testing and proof-of-concept efforts, eliminating time spent in purchasing, installation, and setup Both Hadoop and cloud computing represent major paradigm shifts, not just in enterprise computing, but affecting many other industries Much has been written about how these technologies have been used to make advances in retail, public sector, manufacturing, energy, and healthcare, just to name a few Entire businesses have sprung up as a result, dedicated to the care, feeding, integration, and optimization of these new systems It was inevitable that Hadoop workloads would be run on cloud computing providers’ infrastructure The cloud offers incredible flexibility to users, often complementing on-premise solutions, enabling them to use Hadoop in ways simply not possible previously Ever the conscientious software engineer, author Bill Havanki has a strong penchant for documenting He’s able to break down complex concepts and explain them in simple terms, without making you feel foolish Bill writes the kind of documentation that you actually enjoy, the kind you find yourself reading long after you’ve discovered the solution to your original problem Hadoop and cloud computing are powerful and valuable tools, but aren’t simple technologies by any means This stuff is hard Both have a multitude of configuration options and it’s very easy to become overwhelmed All major cloud providers offer similar services like virtual machines, network attached storage, relational databases, and object storage — all of which can be utilized by Hadoop — but each provider uses different naming conventions and has different capabilities and limitations For example, some providers require that resource provisioning occurs in a specific order Some providers create isolated virtual networks for your machines automatically while others require manual creation and assignment It can be confusing Whether you’re working with Hadoop for the first time or a veteran installing on a cloud provider you’ve never used before, knowing about the specifics of each environment will save you a lot of time and pain Cloud computing appeals to a dizzying array of users running a wide variety of workloads Most cloud providers’ official documentation isn’t specific to any particular application (such as Hadoop) Using Hadoop on cloud infrastructure introduces additional architectural issues that need to be considered and addressed It helps to have a guide to demystify the options specific to Hadoop deployments and to ease you through the setup process on a variety of cloud providers, step by step, providing tips and best practices along the way This book does precisely that, in a way that I wish had been available when I started working in the cloud computing world Whether code or expository prose, Bill’s creations are approachable, sensible, and easy to consume With this book and its author, you’re in capable hands for your first foray into moving Hadoop to the Cloud Alex Moundalexis, May 2017 Preface It’s late 2015, and I’m staring at a page of mine on my employer’s wiki, trying to think of an OKR An OKR is something like a performance objective, a goal to accomplish paired with a way to measure if it’s been accomplished While my management chain defines OKRs for the company as a whole and major organizations in it, individuals define their own We grade ourselves on them, but they not determine how well we performed because they are meant to be aspirational, not necessary If you meet all your OKRs, they weren’t ambitious enough My coworkers had already been impressed with writing that I’d done as part of my job, both in product documentation and in internal presentations, so focusing on a writing task made sense How aspirational could I get? So I set this down “Begin writing a technical book! On something! That is, begin working on one myself, or assist someone else in writing one.” Outright ridiculous, I thought, but why not? How’s that for aspirational Well, I have an excellent manager who is willing to entertain the ridiculous, and so she encouraged me to float the idea to someone else in our company who dealt with things like employees writing books, and he responded “Here’s an idea: there is no book out there about Running Hadoop in the Cloud Would you have enough material at this point?” I work on a product that aims to make the use of Hadoop clusters in the cloud easier, so it was admittedly an extremely good fit It didn’t take long at all for this ember of an idea to catch, and the end result is the book you are reading right now Who This Book Is For Between the twin subjects of Hadoop and the cloud, there is more than enough to write about Since there are already plenty of good Hadoop books out there, this book doesn’t try to duplicate them, and so you should already be familiar with running Hadoop The details of configuring Hadoop clusters are only covered as needed to get clusters up and running You can apply your prior Hadoop knowledge with great effectiveness to clusters in the cloud, and much of what other Hadoop books cover still applies It is not assumed, however, that you are familiar with the cloud Perhaps you’ve dabbled in it, spun up an instance or two, read some documentation from a provider Perhaps you haven’t even tried it at all, or don’t know where to begin Readers with next to no knowledge of the cloud will find what they need to get rolling with their Hadoop clusters Often, someone is tasked by their organization with “moving stuff to the cloud,” and neither the tasker nor the tasked truly understands what that means If this describes you, this book is for you DevOps engineers, system administrators, and system architects will get the most out of this book, since it focuses on constructing clusters in a cloud provider and interfacing with the provider’s services Software developers should also benefit from it; even if they not build clusters themselves, they should understand how clusters work in the cloud so they know what to ask for and how to design their jobs What You Should Already Know Besides having a good grasp of Hadoop concepts, you should have a working knowledge of the Java programming language and the Bash shell, or similar languages At least being able to read them should suffice, although the Bash scripts not shy away from advanced shell features Code examples are constrained to only those languages Before working on your clusters, you will need credentials for a cloud provider The first two parts of the book not require a cloud account to follow along, but the later hands-on parts Your organization may already have an account with a provider, and if so, you can seek your own account within that to work with If you are on your own, you can sign up for a free trial with any of the cloud providers this book covers in detail Colophon The animal on the cover of Moving Hadoop to the Cloud is a southern reedbuck (Redunca arundinum) Southern reedbucks are typically found in southern Africa They inhabit areas of tall grass near a source of water The grass offers camouflage from predators such as lions, leopards, cheetahs, spotted hyenas, pythons, and crocodiles Being herbivores, the tall grass also provides sustenance Southern reedbucks need to drink water at least every few days, which is not typical for species in this arid region of Africa An elegant antelope, southern reedbucks have distinctive dark lines running down the front of their forelegs and lower hind legs The color of their coat ranges between light- and greyish-brown and their underparts are white Only the males bear forward-curving horns, about 35–45 cm (14–18 in) long The southern reedbuck is monogamous, a pair inhabits a territory that is defended by the male from other males A single calf is born after a gestation period of around eight months and remains hidden in the dense grass for the next two months During this period, the female does not stay with her young but instead visits it for 10 to 30 minutes each day This antelope has an average lifespan of ten years The southern reedbuck makes a number of characteristic noises, including a shrill whistle through the nostrils, a clicking noise to alert others about danger, and a distinctive “popping” sound, caused by the inguinal glands, heard when the southern reedbuck jumps Many of the animals on O’Reilly covers are endangered; all of them are important to the world To learn more about how you can help, go to animals.oreilly.com The cover image is from Wood’s Animate Creation The cover fonts are URW Typewriter and Guardian Sans The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono Foreword Preface Who This Book Is For What You Should Already Know What This Book Leaves Out How This Book Works Which Software Versions This Book Uses Conventions Used in This Book IP Addresses Using Code Examples O’Reilly Safari How to Contact Us Acknowledgments I Introduction to the Cloud Why Hadoop in the Cloud? What Is the Cloud? What Does Hadoop in the Cloud Mean? Reasons to Run Hadoop in the Cloud Reasons to Not Run Hadoop in the Cloud What About Security? Hybrid Clouds Hadoop Solutions from Cloud Providers Elastic MapReduce Google Cloud Dataproc HDInsight Hadoop-Like Services A Spectrum of Choices Getting Started Overview and Comparison of Cloud Providers Amazon Web Services References Google Cloud Platform References Microsoft Azure References Which One Should You Use? II Cloud Primer Instances Instance Types Regions and Availability Zones Instance Control Temporary Instances Spot Instances Preemptible Instances Images No Instance Is an Island Networking and Security A Drink of CIDR Virtual Networks Private DNS Public IP Addresses and DNS Virtual Networks and Regions Routing Routing in AWS Routing in Google Cloud Platform Routing in Azure Network Security Rules Inbound Versus Outbound Allow Versus Deny Network Security Rules in AWS Network Security Rules in Google Cloud Platform Network Security Rules in Azure Putting Networking and Security Together What About the Data? Storage Block Storage Block Storage in AWS Block Storage in Google Cloud Platform Block Storage in Azure Object Storage Buckets Data Objects Object Access Object Storage in AWS Object Storage in Google Cloud Platform Object Storage in Azure Cloud Relational Databases Cloud Relational Databases in AWS Cloud Relational Databases in Google Cloud Platform Cloud Relational Databases in Azure Cloud NoSQL Databases Where to Start? III A Simple Cluster in the Cloud Setting Up in AWS Prerequisites Allocating Instances Generating a Key Pair Launching Instances Securing the Instances Next Steps Setting Up in Google Cloud Platform Prerequisites Creating a Project Allocating Instances SSH Keys Creating Instances Securing the Instances Next Steps Setting Up in Azure Prerequisites Creating a Resource Group Creating Resources SSH Keys Creating Virtual Machines The Manager Instance The Worker Instances Next Steps Standing Up a Cluster The JDK Hadoop Accounts Passwordless SSH Hadoop Installation HDFS and YARN Configuration The Environment XML Configuration Files Finishing Up Configuration Startup SSH Tunneling Running a Test Job What If the Job Hangs? Running Basic Data Loading and Analysis Wikipedia Exports Analyzing a Small Export Go Bigger IV Enhancing Your Cluster 10 High Availability Planning HA in the Cloud HDFS HA YARN HA Installing and Configuring ZooKeeper Adding New HDFS and YARN Daemons The Second Manager HDFS HA Configuration YARN HA Configuration Testing HA Improving the HA Configuration A Bigger Cluster Complete HA A Third Availability Zone? Benchmarking HA MRBench Terasort Grains of Salt 11 Relational Data with Apache Hive Planning for Hive in the Cloud Installing and Configuring Hive Startup Running Some Test Hive Queries Switching to a Remote Metastore The Remote Metastore and Stopped Clusters Hive Control Scripts Hive on S3 Configuring the S3 Filesystem Adding Data to S3 Configuring S3 Authentication Configuring the S3 Endpoint External Table in S3 What About Google Cloud Platform and Azure? A Step Toward Transient Clusters A Different Means of Computation 12 Streaming in the Cloud with Apache Spark Planning for Spark in the Cloud Installing and Configuring Spark Startup Running Some Test Jobs Configuring Hive on Spark Add Spark Libraries to Hive Configure Hive for Spark Switch YARN to the Fair Scheduler Try Out Hive on Spark on YARN Spark Streaming from AWS Kinesis Creating a Kinesis Stream Populating the Stream with Data Streaming Kinesis Data into Spark What About Google Cloud Platform and Azure? Building Clusters Versus Building Clusters Well V Care and Feeding of Hadoop in the Cloud 13 Pricing and Performance Picking Instance Types The Criteria General Cluster Instance Roles Persistent Versus Ephemeral Block Storage Stopping and Starting Entire Clusters Using Temporary Instances Geographic Considerations Regions Availability Zones Performance and Networking 14 Network Topologies Public and Private Subnets SSH Tunneling SOCKS Proxy VPN Access Access from Other Subnets Cluster Topologies The Public Cluster The Secured Public Cluster Gateway Instances The Private Cluster Cluster Access to the Internet and Cloud Provider Services Geographic Considerations Regions Availability Zones Starting Topologies Higher-Level Planning 15 Patterns for Cluster Usage Long-Running or Transient? Single-User or Multitenant? Self-Service or Managed? Cloud-Only or Hybrid? Watching Cost The Rising Need for Automation 16 Using Images for Cluster Management The Structure of an Image EC2 Images GCE Images Azure Images Image Preparation Wait, I’m Using That! Image Creation Image Creation in AWS Image Creation in Google Cloud Platform Image Creation in Azure Image Use Scripting Hadoop Configuration Image Maintenance Image Deletion Image Deletion in AWS Image Deletion in Google Cloud Platform Image Deletion in Azure Automated Image Creation with Packer Automated Cloud Cluster Creation Cloudera Director Hortonworks Data Cloud Qubole Data Service General System Management Tools Images or Tools? More Tooling 17 Monitoring and Automation Monitoring Choices Cloud Provider Monitoring Services Rolling Your Own Cloud Provider Command-Line Interfaces AWS CLI Google Cloud Platform CLI Azure CLI Data Formatting for CLI Results What to Monitor Instance Existence Instance Reachability Hadoop Daemon Status System Load Putting Scripting to Use Custom Metrics in CloudWatch Basic Metrics Defining a Custom Metric Feeding Custom Metric Data to CloudWatch Setting an Alarm on a Custom Metric Elastic Compute Using a Custom Metric A Custom Metric for Compute Capacity Prerequisites for Autoscaling Compute Triggering Autoscaling with an Alarm Action What About Shrinking? Other Things to Watch Ingesting Logs into CloudWatch Creating an IAM User for Log Streaming Installing the CloudWatch Agent Creating a Metric Filter Creating an Alarm from a Metric Filter So Much More to See and Do 18 Backup and Restoration Patterns to Supplement Backups Backup via Imaging HDFS Replication Cloud Storage Filesystems HDFS Snapshots Hive Metastore Replication Logs A General Cloud Hadoop Backup Strategy Not So Different, But Better To the Cloud A Hadoop Component Start and Stop Scripts Apache ZooKeeper Apache Hive B Hadoop Cluster Configuration Scripts SSH Key Creation and Distribution Configuration Update Script New Worker Configuration Update Script C Monitoring Cloud Clusters with Nagios Where Nagios Should Run Instance Existence Through Ping Hosts and Host Groups Services and Service Groups Provider CLI Integration Index ... Moving Hadoop to the Cloud Harnessing Cloud Features and Flexibility for Hadoop Clusters Bill Havanki Moving Hadoop to the Cloud by Bill Havanki Copyright ©... also benefit from it; even if they not build clusters themselves, they should understand how clusters work in the cloud so they know what to ask for and how to design their jobs What You Should... Inc Moving Hadoop to the Cloud, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the