Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 428 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
428
Dung lượng
6,65 MB
Nội dung
For your convenience Apress has placed some of the front matter material after the index Please use the Bookmarks and Contents at a Glance links to access them Contents at a Glance About the Authors�������������������������������������������������������������������������������������������������������������� xix About the Technical Reviewer������������������������������������������������������������������������������������������� xxi Acknowledgments����������������������������������������������������������������������������������������������������������� xxiii Introduction���������������������������������������������������������������������������������������������������������������������� xxv ■■Chapter 1: Motivation for Big Data������������������������������������������������������������������������������������1 ■■Chapter 2: Hadoop Concepts�������������������������������������������������������������������������������������������11 ■■Chapter 3: Getting Started with the Hadoop Framework�������������������������������������������������31 ■■Chapter 4: Hadoop Administration����������������������������������������������������������������������������������47 ■■Chapter 5: Basics of MapReduce Development���������������������������������������������������������������73 ■■Chapter 6: Advanced MapReduce Development������������������������������������������������������������107 ■■Chapter 7: Hadoop Input/Output������������������������������������������������������������������������������������151 ■■Chapter 8: Testing Hadoop Programs����������������������������������������������������������������������������185 ■■Chapter 9: Monitoring Hadoop���������������������������������������������������������������������������������������203 ■■Chapter 10: Data Warehousing Using Hadoop���������������������������������������������������������������217 ■■Chapter 11: Data Processing Using Pig�������������������������������������������������������������������������241 ■■Chapter 12: HCatalog and Hadoop in the Enterprise�����������������������������������������������������271 ■■Chapter 13: Log Analysis Using Hadoop������������������������������������������������������������������������283 ■■Chapter 14: Building Real-Time Systems Using HBase�������������������������������������������������293 ■■Chapter 15: Data Science with Hadoop�������������������������������������������������������������������������325 ■■Chapter 16: Hadoop in the Cloud�����������������������������������������������������������������������������������343 v ■ Contents at a Glance ■■Chapter 17: Building a YARN Application����������������������������������������������������������������������357 ■■Appendix A: Installing Hadoop �������������������������������������������������������������������������������������381 ■■Appendix B: Using Maven with Eclipse�������������������������������������������������������������������������391 ■■Appendix C: Apache Ambari������������������������������������������������������������������������������������������399 Index���������������������������������������������������������������������������������������������������������������������������������403 vi Introduction This book is designed to be a concise guide to using the Hadoop software Despite being around for more than half a decade, Hadoop development is still a very stressful yet very rewarding task The documentation has come a long way since the early years, and Hadoop is growing rapidly as its adoption is increasing in the Enterprise Hadoop 2.0 is based on the YARN framework, which is a significant rewrite of the underlying Hadoop platform It has been our goal to distill the hard lessons learned while implementing Hadoop for clients in this book As authors, we like to delve deep into the Hadoop source code to understand why Hadoop does what it does and the motivations behind some of its design decisions We have tried to share this insight with you We hope that not only will you learn Hadoop in depth but also gain fresh insight into the Java language in the process This book is about Big Data in general and Hadoop in particular It is not possible to understand Hadoop without appreciating the overall Big Data landscape It is written primarily from the point of view of a Hadoop developer and requires an intermediate-level ability to program using Java It is designed for practicing Hadoop professionals You will learn several practical tips on how to use the Hadoop software gleaned from our own experience in implementing Hadoop-based systems This book provides step-by-step instructions and examples that will take you from just beginning to use Hadoop to running complex applications on large clusters of machines Here’s a brief rundown of the book’s contents: Chapter introduces you to the motivations behind Big Data software, explaining various Big Data paradigms Chapter is a high-level introduction to Hadoop 2.0 or YARN It introduces the key concepts underlying the Hadoop platform Chapter gets you started with Hadoop In this chapter, you will write your first MapReduce program Chapter introduces the key concepts behind the administration of the Hadoop platform Chapters 5, 6, and 7, which form the core of this book, a deep dive into the MapReduce framework You learn all about the internals of the MapReduce framework We discuss the MapReduce framework in the context of the most ubiquitous of all languages, SQL We emulate common SQL functions such as SELECT, WHERE, GROUP BY, and JOIN using MapReduce One of the most popular applications for Hadoop is ETL offloading These chapters enable you to appreciate how MapReduce can support common data-processing functions We discuss not just the API but also the more complicated concepts and internal design of the MapReduce framework Chapter describes the testing frameworks that support unit/integration testing of MapReduce frameworks Chapter describes logging and monitoring of the Hadoop Framework Chapter 10 introduces the Hive framework, the data warehouse framework on top of MapReduce xxv ■ Introduction Chapter 11 introduces the Pig and Crunch frameworks These frameworks enable users to create data-processing pipelines in Hadoop Chapter 12 describes the HCatalog framework, which enables Enterprise users to access data stored in the Hadoop file system using commonly known abstractions such as databases and tables Chapter 13 describes how Hadoop can used for streaming log analysis Chapter 14 introduces you to HBase, the NoSQL database on top of Hadoop You learn about use-cases that motivate the use of Hbase Chapter 15 is a brief introduction to data science It describes the main limitations of MapReduce that make it inadequate for data science applications You are introduced to new frameworks such as Spark and Hama that were developed to circumvent MapReduce limitations Chapter 16 is a brief introduction to using Hadoop in the cloud It enables you to work on a true production–grade Hadoop cluster from the comfort of your living room Chapter 17 is a whirlwind introduction to the key addition to Hadoop 2.0: the capability to develop your own distributed frameworks such as MapReduce on top of Hadoop We describe how you can develop a simple distributed download service using Hadoop 2.0 xxvi Chapter Motivation for Big Data The computing revolution that began more than decades ago has led to large amounts of digital data being amassed by corporations Advances in digital sensors; proliferation of communication systems, especially mobile platforms and devices; massive scale logging of system events; and rapid movement toward paperless organizations have led to a massive collection of data resources within organizations And the increasing dependence of businesses on technology ensures that the data will continue to grow at an even faster rate Moore’s Law, which says that the performance of computers has historically doubled approximately every years, initially helped computing resources to keep pace with data growth However, this pace of improvement in computing resources started tapering off around 2005 The computing industry started looking at other options, namely parallel processing to provide a more economical solution If one computer could not get faster, the goal was to use many computing resources to tackle the same problem in parallel Hadoop is an implementation of the idea of multiple computers in the network applying MapReduce (a variation of the single instruction, multiple data [SIMD] class of computing technique) to scale data processing The evolution of cloud-based computing through vendors such as Amazon, Google, and Microsoft provided a boost to this concept because we can now rent computing resources for a fraction of the cost it takes to buy them This book is designed to be a practical guide to developing and running software using Hadoop, a project hosted by the Apache Software Foundation and now extended and supported by various vendors such as Cloudera, MapR, and Hortonworks This chapter will discuss the motivation for Big Data in general and Hadoop in particular What Is Big Data? In the context of this book, one useful definition of Big Data is any dataset that cannot be processed or (in some cases) stored using the resources of a single machine to meet the required service level agreements (SLAs) The latter part of this definition is crucial It is possible to process virtually any scale of data on a single machine Even data that cannot be stored on a single machine can be brought into one machine by reading it from a shared storage such as a network attached storage (NAS) medium However, the amount of time it would take to process this data would be prohibitively large with respect to the available time to process this data Consider a simple example If the average size of the job processed by a business unit is 200 GB, assume that we can read about 50 MB per second Given the assumption of 50 MB per second, we will need seconds to read 100 MB of data from the disk sequentially, and it would take us approximately hour to read the entire 200 GB of data Now imagine that this data was required to be processed in under minutes If the 200 GB required per job could be evenly distributed across 100 nodes, and each node could process its own data (consider a simplified use-case such as simply selecting a subset of data based on a simple criterion: SALES_YEAR>2001), discounting the time taken to perform the CPU processing and assembling the results from 100 nodes, the total processing can be completed in under minute This simplistic example shows that Big Data is context-sensitive and that the context is provided by business need Chapter ■ Motivation for Big Data ■■Note Dr Jeff Dean Keynote discusses parallelism in a paper you can find at www.cs.cornell.edu/projects/ ladis2009/talks/dean-keynote-ladis2009.pdf To read MB of data sequentially from a local disk requires 20 million nanoseconds Reading the same data from a Gbps network requires about 250 million nanoseconds (assuming that KB needs 250,000 nanoseconds and 500,000 nanoseconds per round-trip for each KB) Although the link is a bit dated, and the numbers have changed since then, we will use these numbers in the chapter for illustration The proportions of the numbers with respect to each other, however, have not changed much Key Idea Behind Big Data Techniques Although we have made many assumptions in the preceding example, the key takeaway is that we can process data very fast, yet there are significant limitations on how fast we can read the data from persistent storage Compared with reading/writing node local persistent storage, it is even slower to send data across the network Some of the common characteristics of all Big Data methods are the following: • Data is distributed across several nodes (Network I/O speed