Big data and hadoop learn by example (bhushan, mayank)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	721
Dung lượng	10,63 MB

Nội dung

BIG DATA & HADOOP Learn by Example by Mayank Bhushan FIRST EDITION 2020 Copyright © BPB Publications, INDIA ISBN: 978-93-8655-199-3 All Rights Reserved No part of this publication can be stored in a retrieval system or reproduced in any form or by any means without the prior written permission of the publishers LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY The Author and Publisher of this book have tried their best to ensure that the programmes, procedures and functions described in the book are correct However, the author and the publishers make no warranty of any kind, expressed or implied, with regard to these programmes or the documentation contained in the book The author and publisher shall not be liable in any event of any damages, incidental or consequential, in connection with, or arising out of the furnishing, performance or use of these programmes, procedures and functions Product name mentioned are used for identification purposes only and may be trademarks of their respective companies All trademarks referred to in the book are acknowledged as properties of their respective owners Distributors: BPB PUBLICATIONS 20, Ansari Road, Darya Ganj New Delhi-110002 Ph: 23254990/23254991 BPB BOOK CENTRE 376 Old Lajpat Rai Market, Delhi-110006 Ph: 23861747 MICRO MEDIA Shop No 5, Mahendra Chambers, 150 DN Rd Next to Capital Cinema, V.T (C.S.T.) Station, MUMBAI-400 001 Ph: 22078296/22078297 DECCAN AGENCIES 4-3-329, Bank Street, Hyderabad-500195 Ph: 24756967/24756400 Published by Manish Jain for BPB Publications, 20, Ansari Road, Darya Ganj, New Delhi-110002 and Printed by Repro India Pvt Ltd, Mumbai Dedicated To My beloved Family Mrs Neelam Sharma/Mr Gopal Krishna Sharma Mrs Aarti/Mr Shashank Mrs Apoorva Most loving-Anjika Preface I am very confident that the present work will come as a relief to the students wishing to go through a comprehensive work explaining difficult concepts in the layman's language, offering a variety of practical approaches and conceptual problems along with their systematically worked out solutions, covering all the syllabus prescribed at various levels in universities This book promises to be a very good starting point for beginners and an asset to advanced users too This book is written as per the syllabus of various universities learning pattern and its aim is to keep course approach as “learning with example” Difficult concepts of Big Data-Hadoop is given in an easy and practical way, so that students can able to understand it in an efficient manner This book provides screenshots of practical approaches which can be helpful for students It is said “To err is human, to forgive divine” In this light I wish that the shortcomings of the book will be forgiven At the same I am open to any kind of constructive criticisms and suggestions for further improvement All intelligent suggestions are welcome and I will try my best to incorporate such in valuable suggestions in the subsequent editions of this book 23rd March 2018 Mayank Bhushan Acknowledgement I would like to express my gratitude to all those who provided support, talked things over, read, wrote, offered comments, allowed me to quote their remarks and assisted in the editing, proofreading and design I have relied on many people to guide me directly and indirectly in writing this book I am very thankful to Hadoop community; from whom I have learned with continuous efforts and I also owe a debt of gratitude for ABES College to provide me all facilities for Big Data-Hadoop lab There is always a sense of gratitude, which every one expresses others for their helpful and needy services they render during difficult phases of life and to achieve the goal already set It is impossible to thank individually but we are here by making humble effort to thanks some of them At the outset I am thankful to the almighty that is constantly and invisibly guiding every body and have also helped us to work on the right path I am very much thankful to Prof (Dr.) Shailesh Tiwari, H.O.D (CSE), ABES Engineering College, Ghaziabad (U.P.) for guiding and supporting me He is the main source of inspiration for me I would also like to thanks to Dr Munesh Chandra Trivedi Dean (REC-Azamgarh) Dr Pratibha Singh (Prof., ABES Engineering College) and Dr Shaswati Banerjea, Asst Prof (MNNIT Allahabad) who always provide me support everywhere Without help from them this book is not possible I am in debt of technical help from my dearest friend and colleague Mr Omesh Kumar who guide me technically for every problem I wish my thanks to my all Guru's, friends and colleagues who helped and kept us motivated for writing this text Special thanks to: Dr K.K Mishra, MNNIT Allahabad Dr Mayank Pandey, MNNIT Allahabad Dr Shashank Srivastava, MNNIT Allahabad Mr Nitin Shukla, MNNIT Allahabad Mr Suraj Deb Barma Govt Polytechnic College, Agartala Dr A.L.N Rao, GL Bajaj, Greater Noida Mr Ankit Yadav, Mr Desh Deepak Pathak, ABES EC Ghaziabad Dr Sumit Yadav, IP University Mr Aatif Jamshed, Galgotia College, Greater Noida Write working procedure of HDFS and also explain its features (2015-16) 7.5 marks How many types of data format used in hadoop in HDFS Explain all What is the replication policy of Hadoop Explain it with references of blocks How fault tolerance is useful in Hadoop Write anatomy of file read and write in HDFS with proper diagram What is the role of Hadoop achieved? Explain short notes on: Compression input splits Avro file-based data structure How Hadoop decide number of splits will be required in reducers Explain procedure Chapter Hadoop Installation Define all methods for installation of Hadoop Create node clusters in practically and transfer file between it Practice all commands of HDFS in standalone/distributed system Understand all types of file configuration that used in Hadoop installation purpose Chapter MapReduce Applications Explain in detail about Map-reduce Workflows (2015-16) marks What is mapreduce? Explain steps involved in its processing (2015-16) marks Write steps for word count programs which use marreduce for processing (2016-17) 7.5 marks Explain various stages of mapreduce How mapreduce is different from other processing techniques? Explain Explain difference between mapreduce processing with traditional approach of processing of large amount of data Explain map side and reduce side join operation Explain YARN What are the various input and output format of map reduce Explain all Write anatomy of file read and write for processing with mapreduce (2016-17) 7.5 marks What is the role of mapreduce1 How replica management techniques used in Hadoop Explain all schemes involved in it and also explain about its advantages What you understand by lazy output? Explain the role of key input text format and text input format What is the significance of YARN and how it creates difference with mapreduce1 (2016-17) 7.5 marks Chapter Hadoop Related Tools-I (Hbase & Cassandra) What is the role of NoSQL database in system How NoSQL database is useful in processing files Explain its difference from existing database Enumerate the rules followed while data modelling in Cassandra How the relationships are handled in Cassandra? (2016-17) 7.5 marks Explain HBase and its data model techniques (2015-16) marks Explain Cassandra data model with example (2015-16) marks Write and explain all steps involved in installation of HBase What is region, region server and locking Write 10 commands and its usages in HBase How column oriented database is different from row oriented database How to create table and database in HBase How CAP theorem can be applies in Cassandra database (201516) marks Explain characteristics of Cassandra What are roles of column families in NoSQL database Explain its advantages Write short note on: (2015-16) marks Super columns key spaces clusters column families Define delete table, truncate table and alter table in NoSQL with example (2016-17) 7.5 marks Chapter Hadoop Related Tools-II (PigLatin & HiveQL) How Date and Time data types are used in Hive? (2016-17) marks Why Hive is preferred instead of PigLatin? (2016-17) marks Mention the usage of Grunt (2016-17) marks Write down the Hive queries for natural join and outer join Give examples (2016-17) marks Consider the student data File (st.txt), Data in the following format Name, District, age, gender (2016-17) 7.5 marks Write a PIG script to Display Names of all female students Write a PIG script to find the number of Students form XXXX District Write a PIG script to Display District wise count of all male students Explain the operators supported by pig w.r.to data access, transformations and debugging operations (2016-17) marks Explain detail about hive data manipulation language, queries, data definition and data types (2015-16) marks Explain all types of execution modes which used by PigLatin How many type of platform used in PigLatin How many types of different type of commands used by PigLatin Explain any five of it What you understand by Pig Data model? Explain Following is the sample data: en google.com 50 100 en yahoo.com 60 100 us google.com 70 100 en google.com 68 100 Explain how to load it into database and how to retrieve total hits that mentioned in last column What is mean by UDFs? Explain its use How testing can be performed in PigLatin Explain all scripts use in it Define hive architecture Explain files involved in it Chapter Practical & Research based Topics For project and research related query there will be need of bulk amount of data for that mail me on: mayankbhushan2006@gmail.com with your requirements Create any project which uses data analysis as step in it Data analysis can be done using: Traffic analysis, speed analysis, marks analysis, weather analysis, water consumption analysis etc What is bloom filter? Explain it use and requirement (2015-16) marks What are the methods to retrieve data from twitter? What is the role of flume, explain it use Explain all application involved with bloom filter (2016-17) marks B.Tech Theory Examination (Semester-VI) 2015-16 BIG DATA Time : Hours Max Marks : 100 Section-A Attempt all parts All parts carry equal marks Write answer of each part in short (2 × 10 = 20) How you can define the term big data? What comes under big data? Explain the major challenges of big data What are the different types of big data technologies? Explain the difference between operational and analytical system What is Hadoop architecture? How does Hadoop work? Explain the advantages of Hadoop Explain Hadoop distributed file system What you mean by Hadoop operation modes? Explain the concept of multi node cluster for distributed environment Section-B Attempt any five parts All parts carry equal marks: (10 × = 50) Write the working procedure of HDFS and also explain the features of HDFS What is Map Reduce? Explain the stages of Map Reduce program execution Explain big data and algorithmic trading Discuss crowd SOURCING analytics and inter, trans firewall analytics Explain big data and Hadoop open source technology Write the data models relationship and databases types Explain the aggregate data models with an example Write a brief note on composing map-reduce calculation Section-C Note: Attempt any two questions from this section (15 × = 30) Explain H base and their data model and implementations, Cassandra data model with an example Explain in detail about the Hive data manipulation, queries, data definition and data types Elaborate on graph Mapping schemes What you mean by lower bounds replication rate? B.TECH Theory Examination (Semester-VI) 2016-17 BIG DATA Time : Hours Max Marks : 100 Note : Be precise in your answer In case of numerical problem assume data wherever not provided Section-A Explain the following: (10 x = 20) List the characteristics of big data How to calculate risk in marketing? Why would you use inferential statistics in big data? What you mean by shrading? State the usage of Hadoop pipes Compare Master-Slave and peer to peer architecture in NoSql What is the purpose of bloom filter? Compare the classic Map Reduce with YARN Mention the usage of Grunt How Date and Time data types are used in Hive? Why Hive is preferred instead of PigLatin? Section-B Attempt any five of the following questions: x 10 = 50 Relate crowd sourcing and big data Justify the relationship with an example Write down the aggregate data model in detail with an example Differentiate “Scale up and Scale out” Explain with an example How Hadoop uses Scale out feature to improve the Performance Discuss in detail about the basic building blocks of Hadoop with a neat sketch Explain in detail about Map-reduce Workflows Provide overview of HBase data model Enumerate the rules followed while data modelling in Cassandra How the relationships are handled in Cassandra? Write down the Hive queries for natural join and outer join Give examples Section-C Note: Attempt any two of the following questions: (2 x 15 = 30) Explain with a neat sketch about the processing of a job in Hadoop List the various operational modes of Hadoop cluster configuration and explain in detail about configuring/installing the Hadoop in local/standalone mode Consider the student data File (st.txt), Data in the following format Name,District, age, gender Write a PIG script to Display Names of all female students Write a PIG script to find the number of Students form XXXX District Write a PIG script to Display District wise count of all male students Explain the operators supported by pig w.r.to data access, transformations and debugging operations Discuss the different ways of constructing version stamps What are their prosand cons? Write in detail about the three dimensions of big data ... Big data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: ... data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: data: ... users Big data Vs Traditional techniques of databases databases databases databases databases databases databases databases databases databases databases databases databases databases databases databases

Ngày đăng: 16/09/2022, 22:17