Modern big data processing with hadoop v naresh kumar, prashant shindgikar

Modern Big Data Processing with Hadoop Expert techniques for architecting end-to-end big data solutions to get valuable insights V Naresh Kumar Prashant Shindgikar BIRMINGHAM - MUMBAI Modern Big Data Processing with Hadoop Copyright © 2018 Packt Publishing All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews Every effort has been made in the preparation of this book to ensure the accuracy of the information presented However, the information contained in this book is sold without warranty, either express or implied Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals However, Packt Publishing cannot guarantee the accuracy of this information Commissioning Editor: Amey Varangaonkar Acquisition Editor: Varsha Shetty Content Development Editor: Cheryl Dsa Technical Editor: Sagar Sawant Copy Editors: Vikrant Phadke, Safis Editing Project Coordinator: Nidhi Joshi Proofreader: Safis Editing Indexer: Pratik Shirodkar Graphics: Tania Dutta Production Coordinator: Arvindkumar Gupta First published: M arch 2018 Production reference: 1280318 Published by Packt Publishing Ltd Livery Place 35 Livery Street Birmingham B3 2PB, UK ISBN 978-1-78712-276-5 www.packtpub.com mapt.io Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career For more information, please visit our website Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Mapt is fully searchable Copy and paste, print, and bookmark content PacktPub.com Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy Get in touch with us at service@packtpub.com for more details At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks Contributors A single cluster for the entire business This is the most straightforward approach and every business starts with one cluster, at least As the diversity of the business increases, organizations tend to choose one cluster per department, or business unit The following are some of the advantages: Ease of operability: Since there is only one Hadoop cluster, managing it is very easy and the team sizes will also be optimal when administering it One-stop shop: Since all the company data is in a single place, it's very easy to come up with innovative ways to use and generate analytics on top of the data Integration cost: Teams and departments within the enterprise can integrate with this single system very easily They have less complex configurations to deal with when managing their applications Cost to serve: Enterprises can have a better understanding of their entire big data usage and can also plan, in a less stringent way, on scaling their system Some disadvantages of employing this approach are as follows: Scale becomes a challenge: Even though Hadoop can be run on hundreds and thousands of servers, it becomes a challenge to manage such big clusters, particularly during upgrades and other changes Single point of failure: Hadoop internally has replication built-in to it in the HDFS File System When more nodes fail, the chances are that there is loss of data and it's hard to recover from that Governance is a challenge: As the scale of data, applications, and users increase, it is a challenge to keep track of the data without proper planning and implementation in place Security and confidential data management: Enterprises deal with a variety of data that varies from highly sensitive to transient data When all sorts of data is put in a big-data solution, we have to employ very strong authentication and authorization rules so that the data is visible only to the right audience With these thoughts, let's take a look at the other possibility of having Hadoop clusters in an enterprise Multiple Hadoop clusters Even though having a single Hadoop cluster is easier to maintain within an organization, sometimes its important to have multiple Hadoop clusters to keep the business running smoothly and reduce dependency on a single point of failure system These multiple Hadoop clusters can be used for several reasons: Redundancy Cold backup High availability Business continuity Application environments Redundancy When we think of redundant Hadoop clusters, we should think about how much redundancy we can keep As we already know, the Hadoop Distributed File System (HDFS) has internal data redundancy built in to it Given that a Hadoop cluster has lot of ecosystem built around it (services such as YARN, Kafka, and so on), we should think and plan carefully about whether to have the entire ecosystem made redundant or make only the data redundant by keeping it in a different cluster It's easier to make the HDFS portion of the Hadoop redundant as there are tools to copy the data from one HDFS to another HDFS Let's take a look at possible ways to achieve this via this diagram: As we can see here, the main Hadoop cluster runs a full stack of all its applications, and data is supplied to it via multiple sources We have defined two types of redundant clusters: A fully redundant Hadoop cluster This cluster runs the exact set of applications as the primary cluster and the data is copied periodically from the main Hadoop cluster Since this is a one-way copy from the main cluster to the second cluster, we can be 100% sure that the main cluster isn't impacted when we make any changes to this fully redundant cluster One important thing to understand is that we are running all other instances of applications in this cluster Since every application maintains its state in its own predefined location, the application states are not replicated from the main Hadoop cluster to this cluster, which means that the jobs that were created in the main Hadoop cluster are not visible in this cluster The same applies to the Kafka topics, zookeeper nodes, and many more This type of cluster is helpful for running different environments such as QA, Staging, and so on A data redundant Hadoop cluster In this type of cluster setup, we create a new Hadoop cluster and copy the data from the main cluster, like in the previous case; but here we are not worried about the other applications that are run in this cluster This type of setup is good for: Having data backup for Hadoop in a different geography Sharing big data with other enterprises/organizations Cold backup Cold backup is important for enterprises as the data gets older Even though Hadoop is designed to store unlimited amounts of data, it's not always necessary to keep all the data available for processing It is sometimes necessary to preserve the data for auditing purposes and also for historical reasons In such cases, we can create a dedicated Hadoop cluster with only the HDFS (File System) component and periodically sync all the data into this cluster The design for this system is similar to the data redundant Hadoop cluster High availability Even though Hadoop has multiple components within the architecture, not all the components are highly available due to the internal design The core component of Hadoop is its distributed, fault-tolerant, filesystem HDFS HDS has multiple components one of them is the NameNode which is the registry of where the files are located in the HDFS In the earlier versions of HDS NameNode was Single point of Failure, In the recent versions Secondary NameNode has been added to assist with high availability requirements for Hadoop Cluster In order to make every component of the Hadoop ecosystem a highly available system, we need to add multiple redundant nodes (they come with their own cost) which work together as a cluster One more thing to note is that high availability with Hadoop is possible within a single geographical region, as the locality of the data with applications is one of the key things with Hadoop The moment we have multiple data centers in play we need to think alternatively to achieve high availability across the data centers Business continuity This is part of Business Continuity Planning (BCP) where natural disasters can bring an end to the Hadoop system, if not planned correctly Here, the strategy would be to use multiple geographical regions as providers to run the big data systems When we talk about multiple data centers, the obvious challenge is the network and the cost associated with managing both systems One of the biggest challenges is how to keep multiple regions in sync One possible solution is to build a fully redundant Hadoop cluster in other geographical regions and keep the data in sync, periodically In the case of any disaster/breakdown of one region, our businesses won't come to halt as we can smoothly run our operations Application environments Many businesses internally follow different ways of releasing their software to production As part of this, they follow several continuous integration methodologies, in order to have better control over the stability of the Hadoop environments It's good to build multiple smaller Hadoop clusters with X% of the data from the main production environment and run all the applications here Applications can build their integration tests on these dedicated environments (QA, Staging, and so on) and can release their software to production once everything is good One practice that I have come across is that organizations tend to directly ship the code to production and end up facing outage of their applications because of an untested workflow or bug It's good practice to have dedicated Hadoop application environments to test the software thoroughly and achieve higher uptime and happier customers Hadoop data copy We have seen in the previous sections that, having highly available data is very important for a business to succeed and stay up to date with its competition In this section, we will explore the possible ways to achieve highly available data setup HDFS data copy Hadoop uses HDFS as its core to store the files HDFS is rack aware and is intelligent enough to reduce the network data transfer when applications are run on the data nodes One of the preferred ways of data copying in an HDFS environment is to use the DistCp The official documentation for this is available at the following URL http://hadoop.apache.org/docs/r1.2.1/distcp.html We will see a few examples of copying data from one Hadoop cluster to another Hadoop cluster But before that, let's look at how the data is laid out: In order to copy the data from the production Hadoop cluster to the backup Hadoop cluster, we can use distcp Let's see how to it: hadoop hadoop hadoop hadoop hadoop distcp distcp distcp distcp distcp hdfs://NameNode1:8020/projects hdfs://NameNode2:8020/projects hdfs://NameNode1:8020/users hdfs://NameNode2:8020/users hdfs://NameNode1:8020/streams hdfs://NameNode2:8020/streams hdfs://NameNode1:8020/marketing hdfs://NameNode2:8020/marketing hdfs://NameNode1:8020/sales hdfs://NameNode2:8020/sales When we run the distcp command, a MapReduce job is created to automatically find out the list of files and then copy them to the destination The full command syntax looks like this: Distcp [OPTIONS] : These are the multiple options the command takes which control the behavior of the execution OPTIONS : A source path can be any valid File System URI that's supported by Hadoop DistCp supports taking multiple source paths in one go destination path: This is a single path where all the source paths need to be copied source path Let's take a closer look at a few of the important options: Flag/Option Description append Incrementally writes the data to the destination files if they already exist (only append is performed, no block level check is performed to incremental copy) async Performs the copy in a non-blocking way atomic Perform all the file copy or aborts even if one fails Tmp Path to be used for atomic commit delete Deletes the files from the destination if they are not present in the source tree Bandwidth Limits how much network bandwidth to be used during the copy process f Filename consisting of a list of all paths which need to be copied i Ignores any errors during file copy Log Location where the execution log is saved M Maximum number of concurrent maps to use for copying overwrite Overwrites the files even if they exist on destination update Copies only the missing files and directories skipcrccheck If passed, CRC checks are skipped during transfer Summary In this chapter, we learned about Apache Ambari and studied its architecture in detail We then understood how to prepare and create our own Hadoop cluster with Ambari In order to this, we also looked into configuring the Ambari server as per the requirement before preparing our cluster We also learned about single and multiple Hadoop clusters and how they can be used, based on the business requirement .. .Modern Big Data Processing with Hadoop Expert techniques for architecting end-to-end big data solutions to get valuable insights V Naresh Kumar Prashant Shindgikar BIRMINGHAM - MUMBAI Modern. .. architecture principles Volume Velocity Variety Veracity The importance of metadata Data governance Fundamentals of data governance Data security Application security Input data Big data security RDBMS... definition in a recursive way In a typical big data system, we have these three levels of verticals: Applications writing data to a big data system Organizing data within the big data system Applications

Định dạng
Số trang	567
Dung lượng	16,94 MB