Moving Hadoop to the Cloud Harnessing Cloud Features and Flexibility for Hadoop Clusters Bill Havanki Moving Hadoop to the Cloud by Bill Havanki Copyright © 2017 Bill Havanki Jr All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Marie Beaugureau Production Editor: Colleen Cole Copyeditor: Kim Cofer Proofreader: Christina Edwards Indexer: WordCo Indexing Services, Inc Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest July 2017: First Edition Revision History for the First Edition 2017-07-05: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491959633 for release details The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Moving Hadoop to the Cloud, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-95963-3 [LSI] Foreword Apache Hadoop as software is a simple framework that allows for distributed processing of data across many machines As a technology, Hadoop and the surrounding ecosystem have changed the way we think about data processing at scale No longer does our data need to fit in the memory of a single machine, nor are we limited by the I/O of a single machine’s disks These are powerful tenets So too has cloud computing changed our way of thinking While the notion of colocating machines in a faraway data center isn’t new, allowing users to provision machines on-demand is, and it’s changed everything No longer are developers or architects limited by the processing power installed in onpremise data centers, nor we need to host small web farms under our desks or in that old storage closet The pay-as-you-go model has been a boon for ad hoc testing and proof-of-concept efforts, eliminating time spent in purchasing, installation, and setup Both Hadoop and cloud computing represent major paradigm shifts, not just in enterprise computing, but affecting many other industries Much has been written about how these technologies have been used to make advances in retail, public sector, manufacturing, energy, and healthcare, just to name a few Entire businesses have sprung up as a result, dedicated to the care, feeding, integration, and optimization of these new systems It was inevitable that Hadoop workloads would be run on cloud computing providers’ infrastructure The cloud offers incredible flexibility to users, often complementing on-premise solutions, enabling them to use Hadoop in ways simply not possible previously Ever the conscientious software engineer, author Bill Havanki has a strong penchant for documenting He’s able to break down complex concepts and explain them in simple terms, without making you feel foolish Bill writes the kind of documentation that you actually enjoy, the kind you find yourself reading long after you’ve discovered the solution to your original problem Hadoop and cloud computing are powerful and valuable tools, but aren’t simple technologies by any means This stuff is hard Both have a multitude of configuration options and it’s very easy to become overwhelmed All major cloud providers offer similar services like virtual machines, network attached storage, relational databases, and object storage—all of which can be utilized by Hadoop— but each provider uses different naming conventions and has different capabilities and limitations For example, some providers require that resource provisioning occurs in a specific order Some providers create isolated virtual networks for your machines automatically while others require manual creation and assignment It can be confusing Whether you’re working with Hadoop for the first time or a veteran installing on a cloud provider you’ve never used before, knowing about the specifics of each environment will save you a lot of time and pain Cloud computing appeals to a dizzying array of users running a wide variety of workloads Most cloud providers’ official documentation isn’t specific to any particular application (such as Hadoop) Using Hadoop on cloud infrastructure introduces additional architectural issues that need to be considered and addressed It helps to have a guide to demystify the options specific to Hadoop deployments and to ease you through the setup process on a variety of cloud providers, step by step, providing tips and best practices along the way This book does precisely that, in a way that I wish had been available when I started working in the cloud computing world Whether code or expository prose, Bill’s creations are approachable, sensible, and easy to consume With this book and its author, you’re in capable hands for your first foray into moving Hadoop to the Cloud Alex Moundalexis, May 2017 Preface It’s late 2015, and I’m staring at a page of mine on my employer’s wiki, trying to think of an OKR An OKR is something like a performance objective, a goal to accomplish paired with a way to measure if it’s been accomplished While my management chain defines OKRs for the company as a whole and major organizations in it, individuals define their own We grade ourselves on them, but they not determine how well we performed because they are meant to be aspirational, not necessary If you meet all your OKRs, they weren’t ambitious enough My coworkers had already been impressed with writing that I’d done as part of my job, both in product documentation and in internal presentations, so focusing on a writing task made sense How aspirational could I get? So I set this down “Begin writing a technical book! On something! That is, begin working on one myself, or assist someone else in writing one.” Outright ridiculous, I thought, but why not? How’s that for aspirational Well, I have an excellent manager who is willing to entertain the ridiculous, and so she encouraged me to float the idea to someone else in our company who dealt with things like employees writing books, and he responded “Here’s an idea: there is no book out there about Running Hadoop in the Cloud Would you have enough material at this point?” I work on a product that aims to make the use of Hadoop clusters in the cloud easier, so it was admittedly an extremely good fit It didn’t take long at all for this ember of an idea to catch, and the end result is the book you are reading right now Who This Book Is For Between the twin subjects of Hadoop and the cloud, there is more than enough to write about Since there are already plenty of good Hadoop books out there, this book doesn’t try to duplicate them, and so you should already be familiar with running Hadoop The details of configuring Hadoop clusters are only covered as needed to get clusters up and running You can apply your prior Hadoop knowledge with great effectiveness to clusters in the cloud, and much of what other Hadoop books cover still applies It is not assumed, however, that you are familiar with the cloud Perhaps you’ve dabbled in it, spun up an instance or two, read some documentation from a provider Perhaps you haven’t even tried it at all, or don’t know where to begin Readers with next to no knowledge of the cloud will find what they need to get rolling with their Hadoop clusters Often, someone is tasked by their organization with “moving stuff to the cloud,” and neither the tasker nor the tasked truly understands what that means If this describes you, this book is for you DevOps engineers, system administrators, and system architects will get the most out of this book, since it focuses on constructing clusters in a cloud provider and interfacing with the provider’s services Software developers should also benefit from it; even if they not build clusters themselves, they should understand how clusters work in the cloud so they know what to ask for and how to design their jobs What You Should Already Know Besides having a good grasp of Hadoop concepts, you should have a working knowledge of the Java programming language and the Bash shell, or similar languages At least being able to read them should suffice, although the Bash scripts not shy away from advanced shell features Code examples are constrained to only those languages Before working on your clusters, you will need credentials for a cloud provider The first two parts of the book not require a cloud account to follow along, but the later hands-on parts Your organization may already have an account with a provider, and if so, you can seek your own account within that to work with If you are on your own, you can sign up for a free trial with any of the cloud providers this book covers in detail What This Book Leaves Out As stated previously, this book does not delve into Hadoop details more than necessary A seasoned Hadoop administrator may notice that configurations are not necessarily optimal, and that clusters are not tuned for maximum efficiency This information is left out for brevity, so as not to duplicate content in books that focus only on Hadoop Many of the principles for Hadoop maintenance apply to cloud clusters just as well as ordinary ones The core Hadoop components of HDFS and YARN are covered here, along with other important components such as ZooKeeper, Hive, and Spark This doesn’t imply at all that other components won’t work in the cloud; there are simply so many components that, due to space considerations, not all could be included A limited set of popular cloud providers is covered in this book: Amazon Web Services, Google Cloud Platform, and Microsoft Azure There are other cloud providers, both publicly available and deployed privately, but they are not included The ones that were chosen are the most popular, and you should find that their concepts transfer over rather directly to those in other providers Even so, each provider does things a little, or a lot, differently from its peers When getting you up and running, all of them are covered equally, but beyond that, only Amazon Web Services is fully considered, since it is the dominant choice at this time Brief summaries of equivalent procedures in the other providers are given to get you started with them Overall, between Hadoop and the cloud, there is just so much to write about What’s more, cloud providers introduce new services and revamp older services all the time, and it can be challenging to keep up even when you work in the cloud every day This book attempts to stick with the most vital, core Hadoop components and cloud services to be as relevant as possible in this fast-changing world Understanding them will serve you well when integrating new features into your clusters in the future How This Book Works Part I starts off this book by asking why you would host Hadoop clusters in a cloud provider, and briefly introduces the providers this book looks at Part II describes the common concepts of cloud providers, like instances and virtual networks If you are already familiar with a cloud provider or two, you might skim or skip these parts Part III begins the hands-on portion of this book, where you build out a Hadoop cluster in one of the cloud providers There is a chapter for the unique steps needed by each provider, and a common chapter for bringing up a cluster and seeing it in action Later parts of the book use this first cluster as a launching point for more If you are interested in making an even more capable cluster, Part IV can help you It covers adding high availability and installing Hive and Spark You can try any combination of the enhancements, and learn even more about the ramifications of running in a cloud provider Finally, Part V looks at patterns and practices for running cloud clusters well, from designing for price and security to dealing with maintenance Those first starting out in the cloud may not need the guidance in this part, but as usage ramps up, it becomes much more important Which Software Versions This Book Uses Here are the versions of Hadoop components used in this book All are distributed through Apache: Apache Hadoop 2.7.2 Apache ZooKeeper 3.4.8 Apache Hive 2.1.0 Apache Spark 1.6.3 and 2.0.2 Code examples require: Java Bash Cloud providers update their services continually, and so determining the exact “versions” used for them is not possible Most of the work in the book was performed during 2016 with the services as they existed at that time Since then, service web interfaces may have changed and workflows may have been altered Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords Constant width bold Shows commands or other text that should be typed literally by the user Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context TIP This element signifies a tip or suggestion manager instance Azure, The Manager Instance-The Manager Instance defined, General Cluster Instance Roles EC2, The manager instance-The manager instance Google Cloud Platform, The manager instance-The manager instance second instance for HA, The Second Manager MapReduce for analyzing small data export, The MapReduce jobs-The MapReduce jobs in cluster, Running a Test Job MRBench and, MRBench Spark as alternative to, Streaming in the Cloud with Apache Spark Terasort and, Terasort metric filter, CloudWatch alarm for, Creating an Alarm from a Metric Filter-Creating an Alarm from a Metric Filter creating, Creating a Metric Filter Microsoft Azure (see Azure) monitoring, Monitoring and Automation-Creating an Alarm from a Metric Filter AWS CLI for, AWS CLI Azure CLI for, Azure CLI cloud cluster monitoring with Nagios, Monitoring Cloud Clusters with Nagios-Provider CLI Integration cloud provider CLI interfaces for, Cloud Provider Command-Line Interfaces-Data Formatting for CLI Results cloud provider services for, Cloud Provider Monitoring Services creating your own service for, Rolling Your Own custom metrics in CloudWatch, Custom Metrics in CloudWatch-Setting an Alarm on a Custom Metric elastic compute using a custom metric, Elastic Compute Using a Custom Metric-Other Things to Watch Google Cloud Platform CLI for, Google Cloud Platform CLI Hadoop daemon status, Hadoop Daemon Status-Rolling your own Hadoop daemon status checks ingesting logs into CloudWatch, Ingesting Logs into CloudWatch-Creating an Alarm from a Metric Filter instance existence, Instance Existence, Instance Existence Through Ping instance reachability, Instance Reachability scripting and, Putting Scripting to Use Stackdriver, Google Cloud Platform system monitoring-Google Cloud Platform system monitoring system choices for, Monitoring Choices system load, System Load system monitoring by Azure, Azure system monitoring what to monitor, What to Monitor-Putting Scripting to Use MRBench, MRBench multitenant clusters, Single-User or Multitenant? N Nagios cloud cluster monitoring, Monitoring Cloud Clusters with Nagios-Provider CLI Integration host and host groups, Hosts and Host Groups instance existence checking, Instance Existence Through Ping provider CLI integration, Provider CLI Integration-Provider CLI Integration services and service groups, Services and Service Groups where to run, Where Nagios Should Run network ACL rules, Network ACLs-Network ACLs network address translation (NAT) gateway, Cluster Access to the Internet and Cloud Provider Services network address translation (NAT) instance, Cluster Access to the Internet and Cloud Provider Services network security groups, Network Security Rules in Azure network topologies, Network Topologies-Starting Topologies availability zones, Availability Zones-Availability Zones cluster access to Internet/cloud provider services, Cluster Access to the Internet and Cloud Provider Services-Cluster Access to the Internet and Cloud Provider Services cluster topologies, Cluster Topologies-Cluster Access to the Internet and Cloud Provider Services gateway instances, Gateway Instances geographic considerations, Geographic Considerations-Availability Zones private cluster, The Private Cluster public cluster, The Public Cluster public/private subnets, Public and Private Subnets-Access from Other Subnets regions, Regions secured public cluster, The Secured Public Cluster-The Secured Public Cluster SOCKS proxy server, SOCKS Proxy-SOCKS Proxy SSH tunneling, SSH Tunneling-SSH Tunneling starting topologies, Starting Topologies VPN access, VPN Access networking, Networking and Security-What About the Data? CIDR notation, A Drink of CIDR combining with security, Putting Networking and Security Together routing, Routing-Routing in Azure security rules, Network Security Rules-Network Security Rules in Azure virtual networks, Virtual Networks-Public IP Addresses and DNS new worker configuration update script, New Worker Configuration Update Script next hop, Routing in Google Cloud Platform NoSQL databases (see cloud NoSQL databases) O object access, Object Access object storage, Object Storage-Object Storage in Azure accessing objects in, Object Access basics, Object Storage buckets, Buckets containers (Azure), Buckets data objects, Data Objects in AWS, Object Storage in AWS in Azure, Object Storage in Azure in Google Cloud Platform, Object Storage in Google Cloud Platform on-premise clusters, What Does Hadoop in the Cloud Mean? Oracle JDK, The JDK outbound rules, Inbound Versus Outbound P Packer, Automated Image Creation with Packer-Automated Image Creation with Packer page blobs, Block Storage in Azure peered networks, Private DNS per-team clusters, Single-User or Multitenant? performance (see pricing and performance) persistent disks, Block Storage in Google Cloud Platform persistent storage, Block Storage, Persistent Versus Ephemeral Block Storage port scanners, The Secured Public Cluster preemptible instances, Preemptible Instances pricing and performance, Pricing and Performance-Performance and Networking as reason not to run Hadoop in the cloud, Reasons to Not Run Hadoop in the Cloud availability zones, Availability Zones cloud provider selection, Which One Should You Use? cluster usage patterns, Watching Cost geographic considerations, Geographic Considerations instance types and, Picking Instance Types-General Cluster Instance Roles persistent vs ephemeral block storage, Persistent Versus Ephemeral Block Storage regions, Regions stopping/starting entire clusters, Stopping and Starting Entire Clusters-Stopping and Starting Entire Clusters temporary instances, Using Temporary Instances private cloud, What Is the Cloud? private cluster, The Private Cluster private DNS, Private DNS private subnet, Routing in AWS, Public and Private Subnets provisioners (Packer), Automated Image Creation with Packer public cloud, What Is the Cloud? public cluster, The Public Cluster public DNS hostname, Public IP Addresses and DNS public IP addresses, Public IP Addresses and DNS public subnet, Routing in AWS, Public and Private Subnets Q Qubole Data Service (QDS), Qubole Data Service Quorum Journal Manager, HDFS HA R RDS (Relational Database Service), Cloud Relational Databases in AWS hosting remote Hive metastore, Switching to a Remote Metastore reachability monitoring, Instance Reachability regions defined, Regions and Availability Zones factors in selecting, Regions and Availability Zones instances and, Regions and Availability Zones network topologies and, Regions pricing/performance considerations, Regions virtual networks and, Virtual Networks and Regions relational data (see Hive) Relational Database Service (RDS) (see RDS) relational databases (see cloud relational databases) remote metastore, Hive, Switching to a Remote Metastore-The Remote Metastore and Stopped Clusters resource group (Azure), Creating a Resource Group-Creating a Resource Group resources (Azure), Creating Resources-Creating Resources restoration (see backup) route collection, Routing in Google Cloud Platform routing, Routing-Routing in Azure AWS, Routing in AWS Azure, Routing in Azure Google Cloud Platform, Routing in Google Cloud Platform routing table, Routing S S3 (Simple Storage Service) adding data to, Adding Data to S3 configuring authentication for, Configuring S3 Authentication-Configuring S3 Authentication configuring endpoint, Configuring the S3 Endpoint eventual consistency, Configuring the S3 Filesystem external Hive table in, External Table in S3 filesystem configuration, Configuring the S3 Filesystem Hive on, Hive on S3-External Table in S3 scripts cluster configuration, Hadoop Cluster Configuration Scripts-New Worker Configuration Update Script configuration update, Configuration Update Script for monitoring, Putting Scripting to Use new worker configuration update, New Worker Configuration Update Script SSH key creation/distribution, SSH Key Creation and Distribution start/stop, Hadoop Component Start and Stop Scripts secured public cluster, The Secured Public Cluster-The Secured Public Cluster security, Putting Networking and Security Together (see also network topologies) in the cloud, What About Security? securing instances in AWS EC2, Securing the Instances securing instances in Azure, Next Steps securing instances in GCE, Securing the Instances-Securing the Instances security groups, AWS, Security groups security rules (for networking), Network Security Rules-Network Security Rules in Azure allow vs deny, Allow Versus Deny AWS, Network Security Rules in AWS-Network ACLs Azure, Network Security Rules in Azure Google Cloud Platform, Network Security Rules in Google Cloud Platform-Network Security Rules in Google Cloud Platform inbound vs outbound, Inbound Versus Outbound self-service clusters, Self-Service or Managed? service tier (Azure), Cloud Relational Databases in Azure Simple Storage Service (see S3) single-user clusters, Single-User or Multitenant? snapshots HDFS, HDFS Snapshots of volumes in block storage, Block Storage SOCKS proxy server, SOCKS Proxy-SOCKS Proxy Spark configuring Hive on, Configuring Hive on Spark-Try Out Hive on Spark on YARN installing and configuring, Installing and Configuring Spark planning for cloud deployment, Planning for Spark in the Cloud running on YARN, Planning for Spark in the Cloud startup, Startup streaming from AWS Kinesis, Spark Streaming from AWS Kinesis-Stopping the streaming job streaming in the cloud with, Streaming in the Cloud with Apache Spark-What About Google Cloud Platform and Azure? test jobs, Running Some Test Jobs trying out Hive on Spark on YARN, Try Out Hive on Spark on YARN Spark Streaming, Spark Streaming from AWS Kinesis-Stopping the streaming job spot instances, Spot Instances, Using Temporary Instances SSH keys creation/distribution scripts, SSH Key Creation and Distribution generating for Azure, SSH Keys generating for EC2, Generating a Key Pair generating for GCE, SSH Keys-SSH Keys image security and, Image Preparation passwordless, Passwordless SSH SSH tunneling, SSH Tunneling-SSH Tunneling Stackdriver about, Cloud Provider Monitoring Services Google Cloud Platform system monitoring, Google Cloud Platform system monitoring-Google Cloud Platform system monitoring standing up a cluster, Standing Up a Cluster-Go Bigger analyzing a small export, Analyzing a Small Export-The MapReduce jobs Hadoop accounts for, Hadoop Accounts Hadoop installation, Hadoop Installation HDFS and YARN configuration, HDFS and YARN Configuration-Finishing Up Configuration JDK, The JDK memory configuration, What If the Job Hangs? passwordless SSH for, Passwordless SSH running a test job, Running a Test Job running basic data loading/analysis, Running Basic Data Loading and Analysis-The MapReduce jobs running large data loading/analysis, Go Bigger-Go Bigger SSH tunneling, SSH Tunneling startup, Startup Wikipedia exports for data loading/analysis, Wikipedia Exports XML configuration files, XML Configuration Files start/stop scripts, Hadoop Component Start and Stop Scripts stopped clusters, Stopping and Starting Entire Clusters-Stopping and Starting Entire Clusters Hive remote metastore and, The Remote Metastore and Stopped Clusters pricing and performance, Stopping and Starting Entire Clusters-Stopping and Starting Entire Clusters storage, Storage-Where to Start? block storage, Block Storage-Block Storage in Azure cloud NoSQL databases, Cloud NoSQL Databases cloud relational databases, Cloud Relational Databases-Cloud Relational Databases in Azure object storage, Object Storage-Object Storage in Azure storage account (Azure), Object Storage in Azure storage class, Data Objects storage instance types, The Criteria streaming, with Apache Spark (see Spark) subnets access from other subnets, Access from Other Subnets location, Virtual Networks and Regions location in virtual network, Virtual Networks and Regions public/private, Public and Private Subnets-Access from Other Subnets SOCKS proxy server, SOCKS Proxy-SOCKS Proxy SSH tunneling, SSH Tunneling-SSH Tunneling virtual networks and, Virtual Networks-Virtual Networks VPN access, VPN Access system load monitoring, System Load system routes (Azure), Routing in Azure T temporary instances, Temporary Instances-Preemptible Instances, Using Temporary Instances adding to cluster automatically, Elastic Compute Using a Custom Metric and removal of unneeded instances, What About Shrinking? preemptible instances, Preemptible Instances spot instances, Spot Instances, Using Temporary Instances Terasort, Terasort tools, for automated cloud cluster creation, Automated Cloud Cluster Creation-General System Management Tools Cloudera Director, Cloudera Director general system management tools, General System Management Tools HDCloud, Hortonworks Data Cloud images vs., Images or Tools? QDS, Qubole Data Service transient clusters and backup, Patterns to Supplement Backups defined, Long-Running or Transient? long-running clusters vs., Long-Running or Transient?-Long-Running or Transient? off-cluster storage, A Step Toward Transient Clusters U URLs, for object access, Object Access usage patterns, Patterns for Cluster Usage-The Rising Need for Automation cloud-only vs hybrid, Cloud-Only or Hybrid? cost factors, Watching Cost long-running vs transient, Long-Running or Transient?-Long-Running or Transient? self-service vs managed, Self-Service or Managed? single-user vs multitenant, Single-User or Multitenant? V Virtual Hard Disks (VHDs), Block Storage in Azure virtual hard disks (VHDs), Azure Images virtual machines, Instances (see also instances) Azure, Creating Virtual Machines-The Worker Instances virtual networks about, Virtual Networks-Public IP Addresses and DNS private DNS and, Private DNS public IP addresses and DNS hostname, Public IP Addresses and DNS regions and, Virtual Networks and Regions subnet location, Virtual Networks and Regions Virtual Private Clouds (VPCs), Virtual Networks Virtual Private Networks (VPNs), VPN Access W worker instances Azure, The Worker Instances configuration update script, New Worker Configuration Update Script defined, General Cluster Instance Roles EC2, The worker instances Google Cloud Platform, The worker instances-The worker instances Y YARN adding new daemons for HA, Adding New HDFS and YARN Daemons-YARN HA Configuration configuration environment, The Environment-The Environment configuration when standing up a cluster, HDFS and YARN Configuration-Finishing Up Configuration HA configuration, YARN HA Configuration-YARN HA Configuration HA enabling, YARN HA running Spark on, Planning for Spark in the Cloud switching to fair scheduler, Switch YARN to the Fair Scheduler trying out Hive on Spark on YARN, Try Out Hive on Spark on YARN XML configuration files, XML Configuration Files Z ZooKeeper installation/configuration for HA, Installing and Configuring ZooKeeper-Installing and Configuring ZooKeeper start/stop scripts, Apache ZooKeeper About the Author Bill Havanki is a software engineer working for Cloudera, where he has contributed to Hadoop components as well as systems for deploying Hadoop clusters into public Cloud services Prior to joining Cloudera he worked for 15 years developing software for government contracts, focusing mostly on analytic frameworks and authentication and authorization systems He earned his B.S in Electrical Engineering from Rutgers University and his M.S in Computer Engineering from North Carolina State University A New Jersey native, he currently lives near Annapolis, Maryland with his family Colophon The animal on the cover of Moving Hadoop to the Cloud is a southern reedbuck (Redunca arundinum) Southern reedbucks are typically found in southern Africa They inhabit areas of tall grass near a source of water The grass offers camouflage from predators such as lions, leopards, cheetahs, spotted hyenas, pythons, and crocodiles Being herbivores, the tall grass also provides sustenance Southern reedbucks need to drink water at least every few days, which is not typical for species in this arid region of Africa An elegant antelope, southern reedbucks have distinctive dark lines running down the front of their forelegs and lower hind legs The color of their coat ranges between light- and greyish-brown and their underparts are white Only the males bear forward-curving horns, about 35–45 cm (14–18 in) long The southern reedbuck is monogamous, a pair inhabits a territory that is defended by the male from other males A single calf is born after a gestation period of around eight months and remains hidden in the dense grass for the next two months During this period, the female does not stay with her young but instead visits it for 10 to 30 minutes each day This antelope has an average lifespan of ten years The southern reedbuck makes a number of characteristic noises, including a shrill whistle through the nostrils, a clicking noise to alert others about danger, and a distinctive “popping” sound, caused by the inguinal glands, heard when the southern reedbuck jumps Many of the animals on O’Reilly covers are endangered; all of them are important to the world To learn more about how you can help, go to animals.oreilly.com The cover image is from Wood’s Animate Creation The cover fonts are URW Typewriter and Guardian Sans The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono .. .Moving Hadoop to the Cloud Harnessing Cloud Features and Flexibility for Hadoop Clusters Bill Havanki Moving Hadoop to the Cloud by Bill Havanki Copyright ©... the Cloud Mean? Now that the term cloud has been defined, it’s easy to understand what the jargony phrase Hadoop in the cloud means: it is running Hadoop clusters on resources offered by a cloud. .. Solutions from Cloud Providers Hadoop Solutions from Cloud Providers There are ways to take advantage of Hadoop technologies without doing the work of creating your own Hadoop clusters Cloud providers