Aytas y designing big data platforms big data systems 2021

Table of Contents Cover Title Page Copyright List of Contributors Preface Acknowledgments Acronyms Introduction An Introduction: What's a Modern Big Data Platform 1.1 Defining Modern Big Data Platform 1.2 Fundamentals of a Modern Big Data Platform A Bird's Eye View on Big Data 2.1 A Bit of History 2.2 What Makes Big Data 2.3 Components of Big Data Architecture 2.4 Making Use of Big Data A Minimal Data Processing and Management System 3.1 Problem Definition 3.2 Processing Large Data with Linux Commands 3.3 Processing Large Data with PostgreSQL 3.4 Cost of Big Data Big Data Storage 4.1 Big Data Storage Patterns 4.2 On‐Premise Storage Solutions 4.3 Cloud Storage Solutions 4.4 Hybrid Storage Solutions Offline Big Data Processing 5.1 Defining Offline Data Processing 5.2 MapReduce Technologies 5.3 Apache Spark 5.4 Apache Flink 5.5 Presto Stream Big Data Processing 6.1 The Need for Stream Processing 6.2 Defining Stream Data Processing 6.3 Streams via Message Brokers 6.4 Streams via Stream Engines Data Analytics 7.1 Log Collection 7.2 Transferring Big Data Sets 7.3 Aggregating Big Data Sets 7.4 Data Pipeline Scheduler 7.5 Patterns and Practices 7.6 Exploring Data Visually Data Science 8.1 Data Science Applications 8.2 Data Science Life Cycle 8.3 Data Science Toolbox 8.4 Productionalizing Data Science Data Discovery 9.1 Need for Data Discovery 9.2 Data Governance 9.3 Data Discovery Tools 10 Data Security 10.1 Infrastructure Security 10.2 Data Privacy 10.3 Law Enforcement 10.4 Data Security Tools 11 Putting All Together 11.1 Platforms 11.2 Big Data Systems and Tools 11.3 Challenges 12 An Ideal Platform 12.1 Event Sourcing 12.2 Kappa Architecture 12.3 Data Mesh 12.4 Data Reservoirs 12.5 Data Catalog 12.6 Self‐service Platform 12.7 Abstraction 12.8 Data Guild 12.9 Trade‐offs 12.10 Data Ethics Appendix A: Further Systems and Patterns A.1 Lambda Architecture A.2 Apache Cassandra A.3 Apache Beam Appendix B: Recipes B.1 Activity Tracking Recipe B.2 Data Quality Assurance B.3 Estimating Time to Delivery B.4 Incident Response Recipe B.5 Leveraging Spark SQL Metrics B.6 Airbnb Price Prediction Bibliography Index End User License Agreement List of Tables Chapter Table 4.1 Comparison of big data storage patterns Chapter 10 Table 10.1 Gateway to component mapping List of Illustrations Chapter Figure 2.1 MapReduce execution steps Figure 2.2 HDFS architecture Figure 2.3 YARN architecture Figure 2.4 Components of Big Data architecture Chapter Figure 4.1 Provisioned data warehouse architecture Figure 4.2 Tree data warehouse architecture Figure 4.3 Virtual warehouse architecture Chapter Figure 5.1 Offline Big Data processing overview Figure 5.2 Pig compilation and execution steps Figure 5.3 Pig Latin to MapReduce Figure 5.4 Hive architecture overview Figure 5.5 Spark RDD flow Figure 5.6 Narrow vs wide transformations Figure 5.7 Spark layers Figure 5.8 Spark execution plan Figure 5.9 Spark high‐level architecture Figure 5.10 Spark stages Figure 5.11 Spark cluster in detail Figure 5.12 Presto architecture Figure 5.13 Presto logical plan Figure 5.14 Presto stages Chapter Figure 6.1 Average page views by five minutes intervals Figure 6.2 A message broker Figure 6.3 Kafka topic Figure 6.4 Kafka offset Figure 6.5 Kafka producer/consumer Figure 6.6 Samza job structure Figure 6.7 Samza architecture Figure 6.8 Anatomy of Kafka Streams application Figure 6.9 Pulsar topic subscription modes Figure 6.10 Pulsar architecture Figure 6.11 Pulsar functions programming model Figure 6.12 Pulsar functions worker Figure 6.13 Flink architecture Figure 6.14 Flink barriers Figure 6.15 Flink task scheduling Figure 6.16 Execution graph Figure 6.17 Storm layers Figure 6.18 Storm spouts and bolts Figure 6.19 Storm architecture Figure 6.20 Heron architecture Figure 6.21 Spark streaming micro‐batches Chapter Figure 7.1 Flume agent deployment Figure 7.2 Fluentd data pipeline Figure 7.3 Fluentd deployment Figure 7.4 Gobblin architecture Figure 7.5 Data aggregation stages Figure 7.6 Celery executor architecture Figure 7.7 Domain‐driven data sets Chapter Figure 8.1 Data science life cycle Figure 8.2 A sample data science model deployment Figure 8.3 ‐means usage segments Figure 8.4 TensorFlow architecture Figure 8.5 Apache PredictionIO architecture Chapter Figure 9.1 Metacat architecture Figure 9.2 Amundsen architecture Figure 9.3 Atlas types Figure 9.4 Atlas architecture Chapter 10 Figure 10.1 Ranger access conditions Figure 10.2 Ranger architecture Figure 10.3 Sentry architecture Figure 10.4 Knox services Figure 10.5 Knox architecture Chapter 11 Figure 11.1 Big Data platform verticals Chapter 12 Figure 12.1 Shopping cart events Figure 12.2 Kappa architecture Figure 12.3 Subdomain data sets Figure 12.4 Multiple data reservoirs Figure 12.5 Data catalog feedback loop Figure 12.6 Self‐service Big Data platform overview Appendix A Figure A.1 Lambda architecture Figure A.2 Cassandra architecture Figure A.3 Apache Beam programming model Figure A.4 Apache Beam pipeline Figure A.5 Apache Beam ParDo processing Appendix B Figure B.1 Data ingestion components Figure B.2 Computation components Figure B.3 Streaming reference architecture Figure B.4 Streaming feature vector Given a trained ML model and feature ve Figure B.5 The time‐to‐delivery estimation data flow Figure B.6 Spark SQL metrics Figure B.7 Spark SQL metrics pipeline Figure B.8 Airbnb histogram Designing Big Data Platforms How to Use, Deploy and Maintain Big Data Systems Yusuf Aytas Dublin, Ireland This edition first published 2021 © 2021 John Wiley and Sons, Inc All rights reserved No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions The right of Yusuf Aytas to be identified as the author of this work has been asserted in accordance with law Registered Office John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA Editorial Office 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand Some content that appears in standard print versions of this book may not be available in other formats Limit of Liability/Disclaimer of Warranty The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make This work is sold with the understanding that the publisher is not engaged in rendering professional services The advice and strategies contained herein may not be suitable for your situation You should consult with a specialist where appropriate Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages Library of Congress Cataloging‐in‐Publication Data Applied for: data warehouse 42–43, 55, 60–61, 179 columnar storage 55 provisioned data warehouses serverless data warehouse virtual data warehouse 57 intermediate node 57 leaf node 56–57 57 deployment pipelines 9, 163 directed acyclic graph (DAG) disaster recovery divide et impera 74, 110, 112, 116, 139–143, 182 divide and conquer docker 56 34 34 124, 176–177 domain boundary 246 domain‐driven design (DDD) domain driven data set (DDDS) domain driven pipeline (DDP) e ElasticSearch 128, 174, 189 245–246 145–146 145 Embulk 128–129 Embulk decoder 129 Embulk encoder 129 Embulk executor 129 Embulk filter 129 Embulk formatter Embulk input 129 128 Embulk output 128 Embulk parser 129 ephemeral cluster event sourcing 60 240–242 eventual consistency 23 extract, transform, load (ETL) ETL engine ETL job 60, 83 ETL pipeline 150 f Facebook 65 feature engineering file system 161 49, 203, 225, 242 5, 12, 22–23, 60, 181 Fluentd 122–125, 227 Fluentd aggregator Fluentd bit 125 124 Fluentd data pipeline Fluentd input 123 Fluentd parser Fluentd filter 123–124 123–124 124 Fluentd buffer 124 Fluentd router 124 fluentdouput Fluentd record 124 123 fully qualified domain name (FQDN) 48 g graph processing Google 71, 108 15–17 h heartbeat 16, 48, 234 Hadoop distributed file system (HDFS) DataNode 15–17, 44–48 EditLog 16, 45 FsImage 16, 45 NameNode 15–17, 45, 201 HDFS daemon 48 HadoopMapReduce holistic decision 21 17–20, 65–70, 129 15–16 http referrer 28 status code 28 user agent 28 hyperparameter tuning 178 i idempotency imputation 135 160 information leak 160 infrastructure as a service (IAAS) 53 j JanusGraph Java 196–197 51, 212, 267 Java database connectivity (JDBC) Jenkins 137–138 Jenkins node 137 Jenkins pipeline Jenkins stage Jenkins step 137 137 137 job submission 18 Jupyter notebook 177 just a bunch of disk (JBOD) 44 k Kappa architecture 242–244 70 keep it simple stupid (kiss) Kerberos 48, 209 key‐value 23, 95, 263 27 key performance indicator (KPI) Kubernetes Kubeflow 152, 271 78, 99, 105, 108, 124, 141–142, 224 177–178 Kubeflow katip 178 Kubeflow metadata 178 Kubeflow pipelines 177 l lambda architecture leader node 94–97, 227 56 lifecycle management deprecation designing 9 developing maintenance planning 9, 149 lightweight directory access protocol (LDAP) 215 link prediction 156 collaborative filtering content‐based filtering local development m 156 156 6, 247, 251 192, 201, 209, 212– magnetic tapes mandatory access control (MAC) MapReduce 12 map function 12 reduce function 12 message delivery guarantees 97, 110, 115–116 at‐least once 97, 115–116, 125 at‐most once 115, 125 exactly once 110, 115–116, 125 message‐oriented middleware metabase 91, 241–242, 250 152 metabase dashboard metabase pulse 152 metabase x‐ray 152 metacat 200 152 189–191 metadata crawling 6, 252 metadata indexing 190–192 metadata search metadata store micro‐batch 7, 181, 190–193 187, 190–192 63, 118–119, 126, 226 minimum viable product (MVP) 222 MLflow 175–177 MLflow model registry MLflow models MLflow projects MLflow run 176 175–176 225 monitoring multitenancy MySQL 176–177 176 MLflow tracking MongoDB 177 7–8, 131–132, 139 18, 100 86, 139 n Nagios 232 NEO4J 192–193, 247 network file system NoSQL 46 83, 225 o object storage 55–59, 223 offline processing on‐premise storage 24, 63–65 44, 220, 250 open database connectivity (ODBC) p 70 pattern discovery clustering 157–158 158 co‐occurrence grouping similarity matching 158 158 personally identifiable information platform as a service (PAAS) PostgreSQL 53 34–39, 129 common table expressions foreign data wrapper indexing 37 35 predictive analytics decision trees 156–157 157 linear regression 157 logistic regression random forest Presto 34 157 178 83–87 Presto coordinator 83 Presto data location API 83 184, 206 Presto data sink API 83 Presto data source API 83 Presto metadata API 83 Presto system design 84–87 ANSISQL 84 cooperative multitasking model overcommitting phased scheduling 87 86 Presto fault‐tolerance 87 Presto physical execution plan Presto logical execution plan privacy regulations/acts 86 84–85 84–85 207 California consumer privacy act (CCPA) 207 general data protection regulation (GDPR) Prometheus 175, 234 publish/subscribe 91–92, 100 207 python 165–167 flask 28–29, 193 matplotlib 166 NumPy 166 pandas 166 pytorch 177 requests 166 scikit‐learn SciPy 166, 228 166 SQLAlchemy tabulate 176 166 q query optimization multi‐way join 58, 70 70 predicate pushdown 70, 85 projection pruning 70 r R 164–165 RabbitMQ rack failure 105, 142 16 real‐time analytics real‐time processing Redis 21 24 128–129, 142, 153 redundant array of inexpensive disk (RAID) 44 regression analysis 24 relational database management system (RDMS) replica 16, 263 reporting 7, 25, 32, 44 resiliency 16, 44, 224 resource management prioritization queuing 4, 18 5, 47 resource allocation resource sharing RocksDB 4–5, 137, 141 18, 116 4–5 95–97 round‐robin 93, 101 row‐oriented database 23 s scheduler schema 18, 86, 115, 136 181, 231 schema registry 231–232 single source of truth (SSOT) searching 43–44, 144–145 25, 181, 190, 194 security assertion markup language (SAML) Security‐Enhanced Linux (SELinux) Seldon 175 Sentry 143 service level agreement (SLA) service level objective (SLO) 201 200 5, 47, 230–231 240, 246 83 statsD 143 storage layer 4, storage reclaiming StormMQ 105 stream processing 89–108, 227 bounded stream 108 unbounded stream 108 t technology pollution TensorFlow 147–148 167–171 dataflow executor kernel 170 operation tensor 170–171 170 170 TensorFlow Keras TensorFlow layer 169 169 TensorFlow runtime 170 tensor processing unit (TPU) 170–171 testing 8–9, 137, 235–236 a/b testing 235–236 integration test kernel panic load test 9 packet lost performance test pipeline testing split‐brain stubbing 9 137 9 test suit unit test transport layer security (TLS) 48 u unix command awk 29–32 cron 33 cut 29 git 33 gzip 31 mail 32 sort 30, 32 uniq 30 zgrep 29–31 user defined aggregation function (UDAF) 68 user defined function (UDF) user flow 65 28 v vagrant 49–50 vendor lock‐in 222–223 w watermarking windowing 90, 108, 131 90–91, 104 sliding window 104 tumbling window 95 word count problem 12 write ahead log (WAL) 102 y yahoo 15, 65 yet another resource negotiator (YARN) ApplicationsManager ApplicationMaster node manager 18–19 18–19 resource manager resource model 18–19, 45 18–19 resource request yarn daemon 18–19 49 18–19 18–20, 78, 96–97, 108 WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley’s ebook EULA ... of a Modern Big Data Platform A Bird's Eye View on Big Data 2.1 A Bit of History 2.2 What Makes Big Data 2.3 Components of Big Data Architecture 2.4 Making Use of Big Data A Minimal Data Processing... Exploring Data Visually Data Science 8.1 Data Science Applications 8.2 Data Science Life Cycle 8.3 Data Science Toolbox 8.4 Productionalizing Data Science Data Discovery 9.1 Need for Data Discovery 9.2... two ideas: Big Data collections and Big Data objects Big Data collections are streamed by remote sensors as well as satellites The challenge is pretty similar to today's Big Data where data is unstructured