Aytas y building a modern data platform big data systems 2021

Designing Big Data Platforms Designing Big Data Platforms How to Use, Deploy, and Maintain Big Data Systems Yusuf Aytas Dublin, Ireland This edition first published 2021 © 2021 John Wiley and Sons, Inc All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions The right of Yusuf Aytas to be identified as the author of this work has been asserted in accordance with law Registered Office John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA Editorial Office 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com Wiley also publishes its books in a variety of electronic formats and by print-on-demand Some content that appears in standard print versions of this book may not be available in other formats Limit of Liability/Disclaimer of Warranty The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make This work is sold with the understanding that the publisher is not engaged in rendering professional services The advice and strategies contained herein may not be suitable for your situation You should consult with a specialist where appropriate Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages Library of Congress Cataloging-in-Publication Data Applied for: ISBN: 9781119690924 Cover design by Wiley Cover image: © monsitj/iStock/Getty Images Set in 9.5/12.5pt STIXTwoText by Straive, Chennai, India 10 v Contents List of Contributors xvii Preface xix Acknowledgments xxi Acronyms xxiii Introduction xxv 1.1 1.2 1.2.1 1.2.1.1 1.2.1.2 1.2.1.3 1.2.1.4 1.2.2 1.2.2.1 1.2.2.2 1.2.2.3 1.2.2.4 1.2.2.5 1.2.2.6 1.2.2.7 1.2.2.8 An Introduction: What’s a Modern Big Data Platform Defining Modern Big Data Platform Fundamentals of a Modern Big Data Platform Expectations from Data Ease of Access Security Quality Extensibility Expectations from Platform Storage Layer Resource Management ETL Discovery Reporting Monitoring Testing Lifecycle Management 2.1 2.1.1 2.1.2 A Bird’s Eye View on Big Data 11 A Bit of History 11 Early Uses of Big Data Term 11 A New Era 12 vi Contents 2.1.2.1 2.1.2.2 2.1.3 2.1.3.1 2.1.3.2 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 Word Count Problem 12 Execution Steps 13 An Open-Source Alternative 15 Hadoop Distributed File System 15 HadoopMapReduce 17 What Makes Big Data 20 Volume 20 Velocity 21 Variety 21 Complexity 21 Components of Big Data Architecture Ingestion 22 Storage 23 Computation 23 Presentation 24 Making Use of Big Data 24 Querying 24 Reporting 25 Alerting 25 Searching 25 Exploring 25 Mining 25 Modeling 26 3.1 3.1.1 3.1.2 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 3.2.8 3.3 3.3.1 3.3.2 3.3.3 A Minimal Data Processing and Management System Problem Definition 27 Online Book Store 27 User Flow Optimization 28 Processing Large Data with Linux Commands 28 Understand the Data 28 Sample the Data 28 Building the Shell Command 29 Executing the Shell Command 30 Analyzing the Results 31 Reporting the Findings 32 Automating the Process 33 A Brief Review 33 Processing Large Data with PostgreSQL 34 Data Modeling 34 Copying Data 35 Sharding in PostgreSQL 37 22 27 Contents 3.3.3.1 3.3.3.2 3.4 Setting up Foreign Data Wrapper 37 Sharding Data over Multiple Nodes 38 Cost of Big Data 39 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.2 4.2.1 4.2.1.1 4.2.1.2 4.2.1.3 4.2.1.4 4.2.2 4.2.2.1 4.2.2.2 4.2.2.3 4.2.3 4.2.3.1 4.2.3.2 4.2.3.3 4.2.3.4 4.2.3.5 4.3 4.3.1 4.3.2 4.3.2.1 4.3.2.2 4.3.2.3 4.3.2.4 4.3.3 4.4 4.4.1 4.4.1.1 4.4.1.2 4.4.1.3 4.4.2 4.4.2.1 Big Data Storage 41 Big Data Storage Patterns 41 Data Lakes 41 Data Warehouses 42 Data Marts 43 Comparison of Storage Patterns 43 On-Premise Storage Solutions 44 Choosing Hardware 44 DataNodes 44 NameNodes 45 Resource Managers 45 Network Equipment 45 Capacity Planning 46 Overall Cluster 46 Resource Sharing 47 Doing the Math 47 Deploying Hadoop Cluster 48 Networking 48 Operating System 48 Management Tools 49 Hadoop Ecosystem 49 A Humble Deployment 49 Cloud Storage Solutions 53 Object Storage 54 Data Warehouses 55 Columnar Storage 55 Provisioned Data Warehouses 56 Serverless Data Warehouses 56 Virtual Data Warehouses 57 Archiving 58 Hybrid Storage Solutions 59 Making Use of Object Store 59 Additional Capacity 59 Batch Processing 59 Hot Backup 60 Making Use of Data Warehouse 60 Primary Data Warehouse 60 vii viii Contents 4.4.2.2 4.4.3 Shared Data Mart 61 Making Use of Archiving 61 5.1 5.2 5.2.1 5.2.1.1 5.2.1.2 5.2.2 5.2.2.1 5.2.2.2 5.3 5.3.1 5.3.2 5.3.2.1 5.3.2.2 5.3.2.3 5.3.2.4 5.3.3 5.3.3.1 5.3.3.2 5.3.4 5.3.4.1 5.3.4.2 5.4 5.5 5.5.1 5.5.2 5.5.2.1 5.5.2.2 5.5.2.3 5.5.2.4 Ofﬂine Big Data Processing 63 Defining Offline Data Processing 63 MapReduce Technologies 65 Apache Pig 65 Pig Latin Overview 66 Compilation To MapReduce 66 Apache Hive 67 Hive Database 68 Hive Architecture 69 Apache Spark 70 What’s Spark 71 Spark Constructs and Components 71 Resilient Distributed Datasets 71 Distributed Shared Variables 73 Datasets and DataFrames 74 Spark Libraries and Connectors 75 Execution Plan 76 The Logical Plan 77 The Physical Plan 77 Spark Architecture 77 Inside of Spark Application 78 Outside of Spark Application 79 Apache Flink 81 Presto 83 Presto Architecture 83 Presto System Design 84 Execution Plan 84 Scheduling 86 Resource Management 86 Fault Tolerance 87 6.1 6.2 6.3 6.3.1 6.3.1.1 6.3.1.2 Stream Big Data Processing 89 The Need for Stream Processing 89 Defining Stream Data Processing 90 Streams via Message Brokers 92 Apache Kafka 92 Apache Samza 93 Kafka Streams 98 Contents 6.3.2 6.3.2.1 6.3.3 6.4 6.4.1 6.4.1.1 6.4.1.2 6.4.2 6.4.2.1 6.4.2.2 6.4.3 6.4.3.1 6.4.3.2 6.4.4 6.4.4.1 6.4.4.2 Apache Pulsar 100 Pulsar Functions 102 AMQP Based Brokers 105 Streams via Stream Engines 106 Apache Flink 106 Flink Architecture 107 System Design 109 Apache Storm 111 Storm Architecture 114 System Design 115 Apache Heron 116 Storm Limitations 116 Heron Architecture 117 Spark Streaming 118 Discretized Streams 119 Fault-tolerance 120 7.1 7.1.1 7.1.2 7.1.2.1 7.1.2.2 7.1.2.3 7.2 7.2.1 7.2.2 7.2.3 7.2.4 7.2.5 7.2.5.1 7.2.5.2 7.2.5.3 7.2.5.4 7.3 7.3.1 7.3.2 7.3.2.1 7.3.2.2 7.3.3 7.3.4 Data Analytics 121 Log Collection 121 Apache Flume 122 Fluentd 122 Data Pipeline 123 Fluent Bit 124 Fluentd Deployment 124 Transferring Big Data Sets 125 Reloading 126 Partition Loading 126 Streaming 127 Timestamping 127 Tools 128 Sqoop 128 Embulk 128 Spark 129 Apache Gobblin 130 Aggregating Big Data Sets 132 Data Cleansing 132 Data Transformation 134 Transformation Functions 134 Transformation Stages 135 Data Retention 135 Data Reconciliation 136 ix x Contents 7.4 7.4.1 7.4.2 7.4.2.1 7.4.2.2 7.4.3 7.4.3.1 7.4.3.2 7.4.3.3 7.4.3.4 7.4.4 7.5 7.5.1 7.5.1.1 7.5.1.2 7.5.1.3 7.5.2 7.5.2.1 7.5.2.2 7.5.2.3 7.5.3 7.5.3.1 7.5.3.2 7.5.3.3 7.5.3.4 7.5.4 7.5.4.1 7.5.4.2 7.6 7.6.1 7.6.2 Data Pipeline Scheduler 136 Jenkins 137 Azkaban 138 Projects 139 Execution Modes 139 Airflow 139 Task Execution 140 Scheduling 141 Executor 141 Security and Monitoring 142 Cloud 143 Patterns and Practices 143 Patterns 143 Data Centralization 143 Singe Source of Truth 144 Domain Driven Data Sets 145 Anti-Patterns 146 Data Monolith 146 Data Swamp 147 Technology Pollution 147 Best Practices 148 Business-Driven Approach 148 Cost of Maintenance 148 Avoiding Modeling Mistakes 149 Choosing Right Tool for The Job 150 Detecting Anomalies 150 Manual Anomaly Detection 151 Automated Anomaly Detection 151 Exploring Data Visually 152 Metabase 152 Apache Superset 153 8.1 8.1.1 8.1.2 8.1.3 8.2 8.2.1 8.2.2 8.2.3 Data Science 155 Data Science Applications 155 Recommendation 156 Predictive Analytics 156 Pattern Discovery 157 Data Science Life Cycle 158 Business Objective 158 Data Understanding 159 Data Ingestion 159 296 Bibliography Apache Pulsar Apache pulsar is an open-source distributed pub-sub messaging system, 2020 http://pulsar.apache.org/docs/en/2.5.2/standalone/ Apache Storm Apache storm is a free and open source distributed realtime computation system, 2019 https://storm.apache.org/releases/2.1.0/index.html/ Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al Spark SQL: relational data processing in spark In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383–1394, 2015 Atlas Apache atlas – data governance and metadata framework for Hadoop, 2020 https://atlas.apache.org Beam Apache Beam - an advanced unified programming model, 2020 https://beam apache.org/ Eric A Brewer Towards robust distributed systems In PODC, volume 7, pages 343477–343502 Portland, OR, 2000 Paulo Caroli Lean inception, 2017 https://martinfowler.com/articles/leaninception/ Cassandra Apache Cassandra, 2020 https://cassandra.apache.org/ Michael Cox and David Ellsworth Managing big data for scientific visualization, 1997 Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al The snowflake elastic data warehouse In Proceedings of the 2016 International Conference on Management of Data, pages 215–226, 2016 DataEthics Data ethics principles, 2017 https://dataethics.eu/data-ethicsprinciples/ Jeffrey Dean and Sanjay Ghemawat Mapreduce: simplified data processing on large clusters Communications of the ACM, 51 (1): 107–113, January 2004 ISSN 0001-0782 10.1145/1327452.1327492 http://doi.acm.org/10.1145/1327452 1327492 Zhamak Dehghani How to move beyond a monolithic data lake to a distributed data mesh, 2019 https://martinfowler.com/articles/data-monolith-to-mesh.html James Dixon Pentaho, Hadoop, and data lakes Blog, October, 2010 Bradley Efron Missing data, imputation, and the bootstrap Journal of the American Statistical Association, 89 (426): 463–475, 1994 Eric Evans Domain-Driven Design: Tackling Complexity in the Heart of Software Addison-Wesley Professional, 2004 Wei Fang, Xue Zhi Wen, Yu Zheng, and Ming Zhou A survey of big data security and privacy preserving IETE Technical Review, 34 (5): 544–560, 2017 Apache Flink Apache Flink is an open source platform for distributed stream and batch data processing, 2020 https://ci.apache.org/projects/flink/flink-docsrelease-1.10/ Bibliography FluentD Fluentd is an open source data collector for unified logging layer, 2020 https://docs.fluentd.org/ Martin Fowler Event sourcing, 2005 https://www.martinfowler.com/eaaDev/ EventSourcing.html Nir Friedman, Michal Linial, Iftach Nachman, and Dana Pe’er Using Bayesian networks to analyze expression data In Proceedings of the 4th Annual International Conference on Computational Molecular Biology, RECOMB ’00, pages 127–135, New York, NY, USA, 2000 ACM ISBN 1-58113-186-0 10.1145/332306.332355 http://doi.acm.org/10.1145/332306.332355 Ajit Gaddam Securing your big data environment Black Hat USA, 2015, 2015 Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava Building a high-level dataflow system on top of map-reduce: the pig experience Proceedings of the VLDB Endowment, (2): 1414–1425, 2009 Mark Grover Amundsen - Lyft’s data discovery & metadata engine, 2019 https://eng lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9 Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan Amazon redshift and the case for simpler data warehouses In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1917–1923, 2015 Adam Jacobs The pathologies of big data Communications of the ACM, 52 (8): 36–44, 2009 David Karger, Alex Sherman, Andy Berkheimer, Bill Bogstad, Rizwan Dhanidina, Ken Iwamoto, Brian Kim, Luke Matkins, and Yoav Yerushalmi Web caching with consistent hashing Computer Networks, 31 (11–16): 1203–1213, 1999 Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman Leakage in data mining: formulation, detection, and avoidance ACM Transactions on Knowledge Discovery from Data (TKDD), (4): 1–21, 2012 Zehra Kavasoglu Airbnb Istanbul data playbook, 2019 https://github.com/ kavasoglu/airbnb_istanbul Vijay Khatri and Carol V Brown Designing data governance Communications of the ACM, 53 (1): 148–152, 2010 Martin Kleppmann Making Sense of Stream Processing: The Philosophy Behind Apache Kafka and Scalable Stream Data Platforms O’Reilly Media, Inc., 2016 Martin Kleppmann Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems O’Reilly Media, Inc., 2017 Knox Knox gateway, 2020 https://knox.apache.org/ Ron Kohavi et al A study of cross-validation and bootstrap for accuracy estimation and model selection In Ijcai, volume 14, pages 1137–1145 Montreal, Canada, 1995 Jay Kreps Questioning the lambda architecture Online article, July, page 205, 2014 https://www.martinfowler.com/eaaDev/EventSourcing.html 297 298 Bibliography Patrick Kua, N Ford, and R Parsons Building Evolutionary Architectures O’Reilly Media, Inc., Sebastopol, CA, 2017 Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja Twitter Heron: stream processing at scale In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 239–250, 2015 C Charles Law, William J Schroeder, Kenneth M Martin, and Joshua Temkin A multi-threaded streaming pipeline architecture for large structured data sets In Proceedings of the Conference on Visualization ’99: Celebrating Ten Years, VIS ’99, pages 225–232, Los Alamitos, CA, USA, 1999 IEEE Computer Society Press ISBN 0-7803-5897-X http://dl.acm.org/citation.cfm?id=319351.319378 Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian t-closeness: privacy beyond k-anonymity and l-diversity In 2007 IEEE 23rd International Conference on Data Engineering, pages 106–115 IEEE, 2007 Steve Lohr The age of big data New York Times, 11 2012 Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam l-diversity: privacy beyond k-anonymity ACM Transactions on Knowledge Discovery from Data (TKDD), (1): 3, 2007 Ajoy Majumdar and Zhen Li Metacat: making big data discoverable and meaningful at netflix, 2018 https://netflixtechblog.com/metacat-making-big-datadiscoverable-and-meaningful-at-netflix-56fb36a53520 Nathan Marz How to beat the cap theorem, 2011 http://nathanmarz.com/blog/howto-beat-the-cap-theorem.html J R Mashey Big data … and the next wave of infrastress 04 1998 Viktor Mayer-Schönberger and Kenneth Cukier Big Data: A Revolution That Will Transform How We Live, Work, and Think Houghton Mifflin Harcourt, 2013 Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis Dremel: interactive analysis of web-scale datasets Proceedings of the VLDB Endowment, (1–2): 330–339, 2010 Mike Mesnier, Gregory R Ganger, and Erik Riedel Object-based storage IEEE Communications Magazine, 41 (8): 84–90, 2003 Arun C Murthy, Vinod Kumar Vavilapalli, Doug Eadline, Joseph Niemiec, and Jeff Markham Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop Addison-Wesley Professional, 1st edition, 2014 ISBN 0321934504, 9780321934505 Shadi A Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H Campbell Samza: stateful scalable stream processing at linkedin Proceedings of the VLDB Endowment, 10 (12): 1634–1645, 2017 Mike Olson HADOOP: scalable, flexible data storage and analysis IQT Quart, 1: 14–18, 2010 Bibliography Zeljko Panian Some practical experiences in data governance World Academy of Science, Engineering and Technology, 62 (1): 939–946, 2010 Eugenia Politou, Efthimios Alepis, and Constantinos Patsakis Forgetting personal data and revoking consent under the GDPR: challenges and proposed solutions Journal of Cybersecurity, (1): tyy001, 2018 Catherine Pope, Susan Halford, Ramine Tinati, and Mark Weal What’s the big fuss about ’big data’? Journal of Health Services Research & Policy, 19: 67–68, 2014 10.1177/1355819614521181 David Martin Powers Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, 2011 Foster Provost and Tom Fawcett Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking O’Reilly Media, Inc., 2013 R What is R? 2019 https://www.r-project.org/about.html Ranger Apache ranger - introduction, 2020 https://ranger.apache.org/ Philip Russom et al Big data analytics TDWI Best Practices Report, Fourth Quarter, 19 (4): 1–34, 2011 Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl Item-based collaborative filtering recommendation algorithms In Proceedings of the 10th International Conference on World Wide Web, pages 285–295, 2001 Sentry Apache sentry, 2018 https://sentry.apache.org/ Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, et al Presto: SQL on everything In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1802–1813 IEEE, 2019 Victoria Stodden The data science life cycle: a disciplined approach to advancing data science as a science Communications of the ACM, 63 (7): 58–66, 2020 Latanya Sweeney Achieving k-anonymity privacy protection using generalization and suppression International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10 (05): 571–588, 2002 Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy Hive: a warehousing solution over a map-reduce framework Proceedings of the VLDB Endowment, (2): 1626–1629, 2009 Robin Van Meteren and Maarten Van Someren Using content-based filtering for recommendation In Proceedings of the Machine Learning in the New Information Age: MLnet/ECML2000 Workshop, volume 30, pages 47–56, 2000 Ham Vocke The practical test pyramid, 2018 https://martinfowler.com/articles/ practical-test-pyramid.html Sholom M Weiss and Nitin Indurkhya Predictive Data Mining: A Practical Guide Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998 ISBN 1-55860-403-0 299 300 Bibliography Rick L Wilson and Peter A Rosen Protecting data through perturbation techniques: the impact on knowledge discovery in databases Journal of Database Management (JDM), 14 (2): 14–26, 2003 Longzhi Yang, Jie Li, Noe Elisa, Tom Prickett, and Fei Chao Towards big data governance in cybersecurity Data-Enabled Discovery and Applications, (1): 10, 2019 Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, Ion Stoica, et al Spark: cluster computing with working sets HotCloud, 10 (10–10): 95, 2010 Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing In Presented as part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 15–28, 2012 Qing Zheng, Haopeng Chen, Yaguang Wang, Jiangang Duan, and Zhiteng Huang COSBench: a benchmark tool for cloud object storage services In 2012 IEEE 5th International Conference on Cloud Computing, pages 998–999 IEEE, 2012 301 Index a access control list (ACL) 200, 221 access permission 6, 16, 206 activeAPI 105 advanced message queuing protocol (AMQP) 92, 105 alerting 25, 234, 259 amundsen 191–193 ansible 49, 53, 221 Apache airflow 139–143 airflow backfill 140 airflow dag run 140–141 airflow pool 140 airflow queue 140 airflow scheduler 141–142 airflow sensor 140 airflow worker 142 Apache Ambari 48, 233 Apache Atlas 193–197 atlas bridge 196 atlas hook 196 atlas term 194 atlas term category 194 atlas term hierarchy 194 atlas type system 193 Apache Avro 122, 129 Apache beam 228 Apache bookkeeper 101–102 bookkeeper bookie 102 bookkeeper ledger 102 Apache Cassandra 106, 112, 119, 225 Apache Flink 81–82, 106–111 Flink alignment 109–110 Flink barrier 109 Flink DataSet 81 Flink DataStream 81 Flink Gelly 108 Flink ML 108 Flink sink 107 Flink source 107 Flink table API 82 Flink TableEnvironment 82 Flink task manager 110 Apache Flume 122 Flume agent 122 Flume channel 122 Flume even 122 Flume sink 122 Apache Gobblin 130–132 Gobblin converter 131–132 Gobblin extractor 131–132 Gobblin publisher 131–132 Gobblin quality checker 131–132 Gobblin runtime 132 Gobblin source 131–132 Gobblin writer 131–132 Apache Hadoop 15, 44–53, 229, 244 Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems, First Edition Yusuf Aytas © 2021 John Wiley & Sons, Inc Published 2021 by John Wiley & Sons, Inc 302 Index Apache HBase 174, 196, 225 Apache Heron 116–118 Heron instance 118 Heron metrics manager 118 Heron stream manager 118 Heron topology master 117–118 Heron tracker 118 Heron UI 118 Heron VIZ 118 Apache hive 45 hive metastore 45, 70 hive query language (HQL) 68 HiveServer2 217 Apache Impala 212 Apache Kafka 92–93 Kafka consumer group 93 Kafka offset 93 Kafka partition 92–93 Kafka streams 98–100 Kafka topic 92–93, 106 Apache Knox 214–218 Knox gateway 215 Knox gateway provider 217 Knox gateway service 217 Knox policy 214–215 KnoxSSO 216 Apache Mesos 78, 108 Apache NiFi 209 Apache Nutch 15 Apache Pig 65–67 Pig Latin 66 Apache PredictionIO 173–174 PredictionIO event server 174 PredictionIO prediction server 174 Apache Pulsar 100–105 Pulsar functions 102–105 Pulsar namespace 100–101 Pulsar partition 100–101 Pulsar subscription mode 101 Pulsar tenant 100–101 Pulsar topic 100–101 nonpersistent topic 101 persistent topic 101 Apache QPid 105 Apache Ranger 196, 208–212 Ranger admin server 211 Ranger audit server 212 Ranger policy server 211–212 Ranger tag sync 212 Ranger user/group sync 212 resource-based access policy 208–209 tag-based access policy 208–209 Apache Samza 93–97 Apache Sentry 212–214 Sentry plugin 213 Sentry server 213 Apache Spark 70–79, 129–130 discretized stream (DStream) 119 distributed shared variable 73–74 accumulator 74 broadcast variable 73 resilient distributed dataset (RDD) 71 partitioner 71 Spark architecture 77–80 Spark client mode 78 Spark cluster manager 78 Spark cluster mode 78 Spark context 78–79 Spark driver 77–80 Spark executor 77–80 Spark local mode 78 Spark session 78–79 Spark stage 79 Spark DataFrame 74–75 Spark DataSet 74–75 Spark execution plan 76–77 Spark GraphFrame 75 Spark ML 75, 171–173 Spark estimator 172 Spark ML pipelines 172 Spark transformer 172 Index Spark narrow transformation 72–73 Spark streaming 118–120 Spark wide transformation 72–73 Apache Solr 196–197, 212 Apache Sqoop 128 Sqoop import 128 Sqoop export 128 Apache Storm 111–116 Storm bolts 112 Storm Nimbus 114–115, 117 Storm spouts 112 Storm stream 112 Storm supervisor 114–115 Storm Trident 115–116 Storm tuple 112 Apache Superset 153, 192 Superset chart 153 Superset dashboard 153 Apache Tez 70 Apache Thrift 70 Apache ZooKeeper 45, 101, 115, 118, 218 anomaly detection 8, 134, 150–152 automated anomaly detection 151–152 manual anomaly detection 151 auditing 200, 218 Azkaban 138–139 Azkaban busy waiting 139 Azkaban execution modes 139 Azkaban JMX metrics 139 Azkaban projects 139 b backfilling 141, 147, 160 batch processing 18, 59, 226–227 big data collections 11–12 big data governance 188 big data object 11–12 big data revolution 12 business intelligence 1, 37, 54 business decision 1–2 cross-functional insights 145 c CAP theorem 259 capacity planning 48 buffer space 48 compression ratio 48 growth rate 48 intermediate space facto 48 replication factor 48 Cartesian product 22 cascading style sheet (CSS) 32 celery 141–142, 153 checkpoint 16, 109, 135 Cloudera Manager 49, 233 clustering 24, 157, 166 k-means clustering 166 collaborative filtering 25 colocation 56, 170 column-oriented database 23, 55 computational facility 20 compute node 56 container 18, 95, 108, 117, 121–125 continuous integration (CI) 235 cookie 28, 207 correlation 24, 121, 204, 288 cost structure coupling 145, 245–246 d data complexity 21–22 extensibility reprocessing reproducibility quality accuracy completeness 134 consistency 3, 159 303 304 Index data (contd.) reliability 3, 159 visibility 3, variety 21 semi-structured data 21, 42, 147 structured data 21, 42, 147 unstructured data 21, 42, 147 velocity 4, 21 volume 4, 20–21 data abstraction 190 data access 2, 187–188 data acquisition 158, 235 data analytics 121–153 choosing tools 150 maintenance 148–149 pipelines 149 workload data anonymization 204–205 differential privacy 204 k-anonymity 204 l-diversity 204 t-closeness 204 data archiving 58 61 data backup 4, 60 data catalog 181, 250–252 data center 44, 59, 93 data centralization 143–145 data classification 184, 193 data cleansing 42, 132, 187 data cleanup 4, 158 data clustering 184 data collection 207, 248 data compression 46, 57, 129 LZ4 129 LZO 129 zip 129 data definition language (DDL) 68, 231 data dictionary 180, 185 data discovery 179–197 crawling 6–7, 181 query sampling 6–7 data encryption 202–203 application layer encryption 203 database layer encryption 203 file system layer encryption 203 transport layer encryption 203 data ethics 260 data exploration 25, 152–153, 160–161 data extraction 23, 179 data feedback loop 251–252 data glossary 185, 194 data governance 147, 186–188 data grouping 183–184 data growth 229–230 data guild 256–257 data ingestion 22–23, 47, 159–160, 192–193 data lake 41–42, 144, 179 data life cycle 147, 194 data lineage 6, 177, 182, 193 data locality 45, 71 data mart 43, 61, 181 data matrix 185 data mesh 245–247 data metrics 183 data mining 25 pattern discovery 156–157 pattern mining 25 recommendation 156 data modeling 26, 149–150 data monitoring 254 data monolith 146–147, 245 data notification 185, 190, 196 data organization 17, 248–249 data outlier 134 data ownership 7, 145, 182–183 data partitioning 7, 13, 35–38, 135 partition key sub-partitioning 38 data perturbation 204–205 data pipeline 136–137 dependency management 137 Index ease of use 137 ownership 137 scheduling 137 visualization 137 workflow 137 workflow orchestration 182 data policy 249 data pollution 248 data preparation 160 data presentation 24, 186 data privacy 202 data processing 254 data product 247 data publishing 253 data quality 186–187 accuracy 187 completeness 187 timeliness 187 trustworthiness 187 data querying 24, 43, 68, 83, 108, 152 data reconciliation 136 data replication 16, 47, 102, 225 data reservoir 248–250 data retention 4, 46, 135–136, 183, 240 data sampling 6, 28–29, 258 data sharding 37–39 data shuffling 13, 72 data science 155–178 data science model deployment 163 data science model performance 162–163 accuracy 162 confusion matrix 162 F1 score 162 PR curve 162 precision, sensitivity 162 receiver operating characteristics (ROC) curve 162 specificity 162 data science model validation 162–163 bootstrap validation 162 cross validation 162 hold-out validation 162 k-fold cross-validation 162 leave one out validation 162 data science modeling 161–162 data science life cycle 158–164 data science operationalizing 163–164 data security 2, 199–218, 235 authentication 201 authorization 201–202 data breach 2, 202, 235 data leak 2, 235 data protection 2, 207, 260 data splits 12, 88, 129 data standard 249 data storage 6, 59, 221–222 cloud storage 53, 221–223 hybrid storage 59, 223–224 on-premise storage 44, 221 data swamp 147, 248 data synthesis 252 data transfer 125–132, 202, 222, 255 streaming transfer 127 table partition loading 126–127 table reloading 126 timestamping 127–128, 135 data transformation 134–136 data aggregation 135 data filtering 134 data joining 134 data mapping 134 data masking 135, 206 data projection 134 data validation 135 data understanding 159, 291 data versioning 183, 231–232 data visualization 24, 152–153 305 306 Index data warehouse 42–43, 55, 60–61, 179 columnar storage 55 provisioned data warehouses 56 serverless data warehouse 56–57 virtual data warehouse 57 intermediate node 57 leaf node 57 deployment pipelines 9, 163 directed acyclic graph (DAG) 74, 110, 112, 116, 139–143, 182 disaster recovery divide and conquer 34 divide et impera 34 docker 124, 176–177 domain boundary 246 domain-driven design (DDD) 245–246 domain driven data set (DDDS) 145–146 domain driven pipeline (DDP) 145 e ElasticSearch 128, 174, 189 Embulk 128–129 Embulk decoder 129 Embulk encoder 129 Embulk executor 129 Embulk filter 129 Embulk formatter 129 Embulk input 128 Embulk output 128 Embulk parser 129 ephemeral cluster 60 event sourcing 240–242 eventual consistency 23 extract, transform, load (ETL) 5, 12, 22–23, 60, 181 ETL engine ETL job 60, 83 ETL pipeline 150 f Facebook 65 feature engineering 161 file system 49, 203, 225, 242 Fluentd 122–125, 227 Fluentd aggregator 125 Fluentd bit 124 Fluentd data pipeline 123–124 Fluentd input 123 Fluentd parser 123–124 Fluentd filter 124 Fluentd buffer 124 Fluentd router 124 fluentdouput 124 Fluentd record 123 fully qualified domain name (FQDN) 48 g graph processing 71, 108 Google 15–17 h heartbeat 16, 48, 234 Hadoop distributed file system (HDFS) 15–16 DataNode 15–17, 44–48 EditLog 16, 45 FsImage 16, 45 NameNode 15–17, 45, 201 HDFS daemon 48 HadoopMapReduce 17–20, 65–70, 129 holistic decision 21 http referrer 28 status code 28 user agent 28 hyperparameter tuning 178 Index i idempotency 135 imputation 160 information leak 160 infrastructure as a service (IAAS) 53 j JanusGraph 196–197 Java 51, 212, 267 Java database connectivity (JDBC) 70 Jenkins 137–138 Jenkins node 137 Jenkins pipeline 137 Jenkins stage 137 Jenkins step 137 job submission 18 Jupyter notebook 177 just a bunch of disk (JBOD) 44 k Kappa architecture 242–244 keep it simple stupid (kiss) 27 Kerberos 48, 209 key-value 23, 95, 263 key performance indicator (KPI) 152, 271 Kubernetes 78, 99, 105, 108, 124, 141–142, 224 Kubeflow 177–178 Kubeflow katip 178 Kubeflow metadata 178 Kubeflow pipelines 177 l lambda architecture 94–97, 227 leader node 56 lifecycle management deprecation designing developing maintenance 9, 149 planning lightweight directory access protocol (LDAP) 192, 201, 209, 212–215 link prediction 156 collaborative filtering 156 content-based filtering 156 local development 6, 247, 251 m magnetic tapes mandatory access control (MAC) 200 MapReduce 12 map function 12 reduce function 12 message delivery guarantees 97, 110, 115–116 at-least once 97, 115–116, 125 at-most once 115, 125 exactly once 110, 115–116, 125 message-oriented middleware 91, 241–242, 250 metabase 152 metabase dashboard 152 metabase pulse 152 metabase x-ray 152 metacat 189–191 metadata crawling 6, 252 metadata indexing 190–192 metadata search 7, 181, 190–193 metadata store 187, 190–192 micro-batch 63, 118–119, 126, 226 minimum viable product (MVP) 222 MLflow 175–177 MLflow model registry 177 MLflow models 176–177 MLflow projects 176 MLflow run 176 MLflow tracking 175–176 307 308 Index MongoDB 225 monitoring 7–8, 131–132, 139 multitenancy 18, 100 MySQL 86, 139 n Nagios 232 NEO4J 192–193, 247 network file system 46 NoSQL 83, 225 o object storage 55–59, 223 offline processing 24, 63–65 on-premise storage 44, 220, 250 open database connectivity (ODBC) p pattern discovery 157–158 clustering 158 co-occurrence grouping 158 similarity matching 158 personally identifiable information 184, 206 platform as a service (PAAS) 53 PostgreSQL 34–39, 129 common table expressions 34 foreign data wrapper 37 indexing 35 predictive analytics 156–157 decision trees 157 linear regression 157 logistic regression 157 random forest 178 Presto 83–87 Presto coordinator 83 Presto data location API 83 Presto data sink API 83 Presto data source API 83 Presto metadata API 83 Presto system design 84–87 70 ANSISQL 84 cooperative multitasking model 86 overcommitting 87 phased scheduling 86 Presto fault-tolerance 87 Presto physical execution plan 84–85 Presto logical execution plan 84–85 privacy regulations/acts 207 California consumer privacy act (CCPA) 207 general data protection regulation (GDPR) 207 Prometheus 175, 234 publish/subscribe 91–92, 100 python 165–167 flask 28–29, 193 matplotlib 166 NumPy 166 pandas 166 pytorch 177 requests 166 scikit-learn 166, 228 SciPy 166 SQLAlchemy 176 tabulate 166 q query optimization 58, 70 multi-way join 70 predicate pushdown 70, 85 projection pruning 70 r R 164–165 RabbitMQ 105, 142 rack failure 16 real-time analytics 21 real-time processing 24 Redis 128–129, 142, 153 Index redundant array of inexpensive disk (RAID) 44 regression analysis 24 relational database management system (RDMS) 83 replica 16, 263 reporting 7, 25, 32, 44 resiliency 16, 44, 224 resource management 4–5, 137, 141 prioritization 4, 18 queuing 5, 47 resource allocation 18, 116 resource sharing 4–5 RocksDB 95–97 round-robin 93, 101 row-oriented database 23 s scheduler 18, 86, 115, 136 schema 181, 231 schema registry 231–232 single source of truth (SSOT) 43–44, 144–145 searching 25, 181, 190, 194 security assertion markup language (SAML) 201 Security-Enhanced Linux (SELinux) 200 Seldon 175 Sentry 143 service level agreement (SLA) 5, 47, 230–231 service level objective (SLO) 240, 246 statsD 143 storage layer 4, storage reclaiming StormMQ 105 stream processing 89–108, 227 bounded stream 108 unbounded stream 108 t technology pollution 147–148 TensorFlow 167–171 dataflow executor 170–171 kernel 170 operation 170 tensor 170 TensorFlow Keras 169 TensorFlow layer 169 TensorFlow runtime 170 tensor processing unit (TPU) 170–171 testing 8–9, 137, 235–236 a/b testing 235–236 integration test kernel panic load test packet lost performance test pipeline testing 137 split-brain stubbing test suit unit test transport layer security (TLS) 48 u unix command awk 29–32 cron 33 cut 29 git 33 gzip 31 mail 32 sort 30, 32 uniq 30 zgrep 29–31 user defined aggregation function (UDAF) 68 user defined function (UDF) 65 user flow 28 309 310 Index v y vagrant 49–50 vendor lock-in 222–223 yahoo 15, 65 yet another resource negotiator (YARN) 18–20, 78, 96–97, 108 ApplicationsManager 18–19 ApplicationMaster 18–19 node manager 18–19 resource manager 18–19, 45 resource model 18–19 resource request 18–19 yarn daemon 49 w watermarking 90, 108, 131 windowing 90–91, 104 sliding window 104 tumbling window 95 word count problem 12 write ahead log (WAL) 102 ... Data Update Notification 185 Data Presentation 186 Data Governance 186 Data Governance Overview 186 Data Quality 186 Metadata 187 Data Access 187 Data Life Cycle 188 Big Data Governance 188 Data. .. Data Data Data Deploy Designing big data platforms How to use Deploy Maintain Big data systems How to use deploy maintain How to Use Deploy Maintain Designing How Maintain Big data systems Big. .. 1.2.2.8 An Introduction: What’s a Modern Big Data Platform Defining Modern Big Data Platform Fundamentals of a Modern Big Data Platform Expectations from Data Ease of Access Security Quality Extensibility

Định dạng
Số trang	327
Dung lượng	7,25 MB