pro hadoop

THÔNG TIN TÀI LIỆU

this print for content only—size & color not accurate spine = 0.844" 440 page count Books for professionals By professionals ® Pro Hadoop Dear Reader, Pro Hadoop is a guide to using Hadoop Core, a wonderful tool that allows you to use ordinary hardware to solve extraordinary problems. In the course of my work, I have needed to build applications that would not fit on a single afford- able machine, creating custom scaling and distribution tools in the process. With the advent of Hadoop and MapReduce, I have been able to focus on my applications instead of worrying about how to scale them. It took some time before I had learned enough about Hadoop Core to actu- ally be effective. This book is a distillation of that knowledge, and a book I wish was available to me when I first started using Hadoop Core. I begin by showing you how to get started with Hadoop and the Hadoop Core shared file system, HDFS. Then you will see how to write and run func- tional and effective MapReduce jobs on your clusters, as well as how to tune your jobs and clusters for optimum performance. I provide recipes for unit testing and details on how to debug MapReduce jobs. I also include examples of using advanced features such as map-side joins and chain mapping. To bring everything together, I take you through the step-by-step development of a nontrivial MapReduce application. This will give you insight into a real-world Hadoop project. It is my sincere hope that this book provides you an enjoyable learning expe- rience and with the knowledge you need to be the local Hadoop Core wizard. Jason Venner US $39.99 Shelve in Software Engineering/ Software Development User level: Intermediate–Advanced Venner Pro Hadoop The eXperT’s Voice ® in open source Pro Hadoop cyan MaGenTa yelloW Black panTone 123 c Jason Venner Companion eBook Available www.apress.com SOURCE CODE ONLINE Companion eBook See last page for details on $10 eBook version Build scalable, distributed applications in the cloud ISBN 978-1-4302-1942-2 9 781430 219422 5 3 9 9 9 THE APRESS ROADMAP Beginning Google App Engine Pro Amazon EC2 and WS Beginning Scala Pro Hadoop The Definitive Guide to Terracotta www.it-ebooks.info www.it-ebooks.info Pro Hadoop Jason Venner www.it-ebooks.info Pro Hadoop Copyright © 2009 by Jason Venner All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. ISBN-13 (pbk): 978-1-4302-1942-2 ISBN-13 (electronic): 978-1-4302-1943-9 Printed and bound in the United States of America 9 8 7 6 5 4 3 2 1 Trademarked names may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, we use the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. Java™ and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc., in the US and other countries. Apress, Inc., is not affiliated with Sun Microsystems, Inc., and this book was written without endorsement from Sun Microsystems, Inc. Lead Editor: Matthew Moodie Technical Reviewer: Steve Cyrus Editorial Board: Clay Andres, Steve Anglin, Mark Beckner, Ewan Buckingham, Tony Campbell, Gary Cornell, Jonathan Gennick, Michelle Lowman, Matthew Moodie, Duncan Parkes, Jeffrey Pepper, Frank Pohlmann, Douglas Pundick, Ben Renow-Clarke, Dominic Shakeshaft, Matt Wade, Tom Welsh Project Manager: Richard Dal Porto Copy Editors: Marilyn Smith, Nancy Sixsmith Associate Production Director: Kari Brooks-Copony Production Editor: Laura Cheu Compositor: Linda Weidemann, Wolf Creek Publishing Services Proofreader: Linda Seifert Indexer: Becky Hornyak Artist: Kinetic Publishing Services Cover Designer: Kurt Krames Manufacturing Director: Tom Debolski Distributed to the book trade worldwide by Springer-Verlag New York, Inc., 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax 201-348-4505, e-mail orders-ny@springer-sbm.com, or visit http://www.springeronline.com. For information on translations, please contact Apress directly at 2855 Telegraph Avenue, Suite 600, Berkeley, CA 94705. Phone 510-549-5930, fax 510-549-5939, e-mail info@apress.com, or visit http://www.apress.com. Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special Bulk Sales–eBook Licensing web page at http://www.apress.com/info/bulksales. The information in this book is distributed on an “as is” basis, without warranty. Although every pre- caution has been taken in the preparation of this work, neither the author(s) nor Apress shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in this work. The source code for this book is available to readers at http://www.apress.com. You may need to answer questions pertaining to this book in order to successfully download the code. www.it-ebooks.info This book is dedicated to Joohn Choe. He had the idea, walked me through much of the process, trusted me to write the book, and helped me through the rough spots. www.it-ebooks.info www.it-ebooks.info v Contents at a Glance About the Author xix About the Technical Reviewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii Introduction xxv CHAPTER 1 Getting Started with Hadoop Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 The Basics of a MapReduce Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 CHAPTER 3 The Basics of Multimachine Clusters 71 CHAPTER 4 HDFS Details for Multimachine Clusters . . . . . . . . . . . . . . . . . . . . . . . . . 97 CHAPTER 5 MapReduce Details for Multimachine Clusters 127 CHAPTER 6 Tuning Your MapReduce Jobs 177 CHAPTER 7 Unit Testing and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 CHAPTER 8 Advanced and Alternate MapReduce Techniques . . . . . . . . . . . . . . . 239 CHAPTER 9 Solving Problems with Hadoop 285 CHAPTER 10 Projects Based On Hadoop and Future Directions . . . . . . . . . . . . . . . 329 APPENDIX A The JobConf Object in Detail 339 Index 387 www.it-ebooks.info www.it-ebooks.info vii Contents About the Author xix About the Technical Reviewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii Introduction xxv CHAPTER 1 Getting Started with Hadoop Core 1 Introducing the MapReduce Model 1 Introducing Hadoop 4 Hadoop Core MapReduce 5 The Hadoop Distributed File System 6 Installing Hadoop 7 The Prerequisites 7 Getting Hadoop Running 13 Checking Your Environment 13 Running Hadoop Examples and Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Hadoop Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Hadoop Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Troubleshooting 24 Summary 24 CHAPTER 2 The Basics of a MapReduce Job . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 The Parts of a Hadoop MapReduce Job 27 Input Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A Simple Map Function: IdentityMapper . . . . . . . . . . . . . . . . . . . . . . . 31 A Simple Reduce Function: IdentityReducer 34 Configuring a Job. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Specifying Input Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Setting the Output Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Configuring the Reduce Phase 51 Running a Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 www.it-ebooks.info ■CONTENTS viii Creating a Custom Mapper and Reducer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Setting Up a Custom Mapper 56 After the Job Finishes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Creating a Custom Reducer 63 Why Do the Mapper and Reducer Extend MapReduceBase? 66 Using a Custom Partitioner 67 Summary 69 CHAPTER 3 The Basics of Multimachine Clusters 71 The Makeup of a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Cluster Administration Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Hadoop Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Hadoop Core Server Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 A Sample Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Configuration Requirements 80 Configuration Files for the Sample Cluster . . . . . . . . . . . . . . . . . . . . . 82 Distributing the Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Verifying the Cluster Configuration 87 Formatting HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Starting HDFS 89 Correcting Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 The Web Interface to HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Starting MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Running a Test Job on the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Summary 95 CHAPTER 4 HDFS Details for Multimachine Clusters 97 Configuration Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 HDFS Installation for Multimachine Clusters 98 Building the HDFS Configuration 98 Distributing Your Installation Data 101 Formatting Your HDFS 102 Starting Your HDFS Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Verifying HDFS Is Running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 www.it-ebooks.info [...]... ORE Introducing Hadoop Hadoop is the Apache Software Foundation top-level project that holds the various Hadoop subprojects that graduated from the Apache Incubator The Hadoop project provides and supports the development of open source software that supplies a framework for the development of highly scalable distributed computing applications The Hadoop framework handles the processing details,... Note The Hadoop logo is a stuffed yellow elephant And Hadoop happened to be the name of a stuffed yellow elephant owned by the child of the principle architect The introduction on the Hadoop project web page (http:/ /hadoop. apache.org/) states: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing, including: Hadoop Core, our flagship sub-project, provides... external programs to be used to provide the MapReduce functionality Chapter 9, Solving Problems with Hadoop: This chapter describes step-by-step development of a nontrivial MapReduce job, including the whys of the design decisions The sample MapReduce job performs range joins, and uses custom comparator and partitioner classes Chapter 10, Projects Based on Hadoop and Future Directions: This chapter provides... framework While Hadoop Core provides HDFS, HDFS is not required In Hadoop JIRA (the issue-tracking system), item 4686 is a tracking ticket to separate HDFS into its own Hadoop project In addition to HDFS, Hadoop Core supports the CloudStore (formerly Kosmos) file system (http://kosmosfs.sourceforge.net/) and Amazon Simple Storage Service (S3) file system (http://aws.amazon.com/s3/) The Hadoop Core framework... Summary Chapter 10 Solving Problems with Hadoop 285 287 287 288 291 294 298 298 301 302 302 302 Projects Based On Hadoop and Future Directions 329 Hadoop Core–Related Projects HBase: HDFS-Based Column-Oriented Table Hive: The Data... JobConf object is the heart of the application developer’s interaction with Hadoop This book’s appendix goes through each method in detail Prerequisites For those of you who are new to Hadoop, I strongly urge you to try Cloudera’s open source Distribution for Hadoop (http://www.cloudera.com /hadoop) It provides the stable base of Hadoop 0.18.3 with bug fixes and some new features back-ported in and added-in... guide to developing and running software using Hadoop Core, a project hosted by the Apache Software Foundation This chapter introduces Hadoop Core and details how to get a basic Hadoop Core installation up and running Introducing the MapReduce Model Hadoop supports the MapReduce model, which was introduced by Google as a method of solving a class of petascale problems with large clusters of inexpensive... public boolean getProfileEnabled() public void setProfileEnabled(boolean newValue) public String getProfileParams() public void setProfileParams(String value) public Configuration.IntegerRanges getProfileTaskRange (boolean isMap) public void setProfileTaskRange(boolean... concise guide to getting started with Hadoop and getting the most out of your Hadoop clusters My early experiences with Hadoop were wonderful and stressful While Hadoop supplied the tools to scale applications, it lacked documentation on how to use the framework effectively This book provides that information It enables you to rapidly and painlessly get up to speed with Hadoop This is the book I wish was... on Hadoop Core that provides data summarization, adhoc querying and analysis of datasets The Hadoop Core project provides the basic services for building a cloud computing environment with commodity hardware, and the APIs for developing software that will run on that cloud The two fundamental pieces of Hadoop Core are the MapReduce framework, the cloud computing environment, and he Hadoop Distributed . for professionals By professionals ® Pro Hadoop Dear Reader, Pro Hadoop is a guide to using Hadoop Core, a wonderful tool that allows you to use ordinary hardware to solve extraordinary problems Google App Engine Pro Amazon EC2 and WS Beginning Scala Pro Hadoop The Definitive Guide to Terracotta www.it-ebooks.info www.it-ebooks.info Pro Hadoop Jason Venner www.it-ebooks.info Pro Hadoop Copyright. Started with Hadoop Core 1 Introducing the MapReduce Model 1 Introducing Hadoop 4 Hadoop Core MapReduce 5 The Hadoop Distributed File System 6 Installing Hadoop 7 The Prerequisites 7 Getting Hadoop

Ngày đăng: 28/04/2014, 16:45

Xem thêm: pro hadoop