Handbook of big data analytics methodologies

IET COMPUTING SERIES 37 Handbook of Big Data Analytics IET Book Series on Big Data–Call for Authors Editor-in-Chief: Professor Albert Y Zomaya, University of Sydney, Australia The topic of big data has emerged as a revolutionary theme that cuts across many technologies and application domains This new book series brings together topics within the myriad research activities in many areas that analyse, compute, store, manage, and transport massive amounts of data, such as algorithm design, data mining and search, processor architectures, databases, infrastructure development, service and data discovery, networking and mobile computing, cloud computing, high-performance computing, privacy and security, storage, and visualization Topics considered include (but not restricted to) IoT and Internet computing; cloud computing; peer-to-peer computing; autonomic computing; data centre computing; multi-core and many core computing; parallel, distributed, and high-performance computing; scalable databases; mobile computing and sensor networking; Green computing; service computing; networking infrastructures; cyber infrastructures; e-Science; smart cities; analytics and data mining; big data applications, and more Proposals for coherently integrated International co-edited or co-authored handbooks and research monographs will be considered for this book series Each proposal will be reviewed by the Editor-in-Chief and some board members, with additional external reviews from independent reviewers Please email your book proposal for the IET Book Series on Big Data to Professor Albert Y Zomaya at albert.zomaya@sydney.edu.au or to the IET at author_support@theiet.org Handbook of Big Data Analytics Volume 1: Methodologies Edited by Vadlamani Ravi and Aswani Kumar Cherukuri The Institution of Engineering and Technology Published by The Institution of Engineering and Technology, London, United Kingdom The Institution of Engineering and Technology is registered as a Charity in England & Wales (no 211014) and Scotland (no SC038698) † The Institution of Engineering and Technology 2021 First published 2021 This publication is copyright under the Berne Convention and the Universal Copyright Convention All rights reserved Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may be reproduced, stored or transmitted, in any form or by any means, only with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copyright Licensing Agency Enquiries concerning reproduction outside those terms should be sent to the publisher at the undermentioned address: The Institution of Engineering and Technology Michael Faraday House Six Hills Way, Stevenage Herts, SG1 2AY, United Kingdom www.theiet.org While the authors and publisher believe that the information and guidance given in this work are correct, all parties must rely upon their own skill and judgement when making use of them Neither the authors nor publisher assumes any liability to anyone for any loss or damage caused by any error or omission in the work, whether such an error or omission is the result of negligence or any other cause Any and all such liability is disclaimed The moral rights of the authors to be identified as authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988 British Library Cataloguing in Publication Data A catalogue record for this product is available from the British Library ISBN 978-1-83953-064-7 (hardback Volume 1) ISBN 978-1-83953-058-6 (PDF Volume 1) ISBN 978-1-83953-059-3 (hardback Volume 2) ISBN 978-1-83953-060-9 (PDF Volume 2) ISBN 978-1-83953-061-6 (2 volume set) Typeset in India by MPS Limited Printed in the UK by CPI Group (UK) Ltd, Croydon Contents About the editors About the contributors Foreword Foreword Preface Acknowledgements Introduction The impact of Big Data on databases Antonio Sarasa Cabezuelo 1.1 The Big Data phenomenon 1.1.1 Big Data Operational and Big Data Analytical 1.1.2 The impact of Big Data on databases 1.2 Scalability in relational databases 1.2.1 Relational databases 1.2.2 The limitations of relational databases 1.3 NoSQL databases 1.3.1 Disadvantages of NoSQL databases 1.3.2 Aggregate-oriented NoSQL databases 1.3.3 MongoDB: an example of documentary database 1.3.4 Cassandra: an example of columnar-oriented database 1.4 Data distribution models 1.4.1 Sharding 1.4.2 Replication 1.4.3 Combining sharding and replication 1.5 Design examples using NoSQL databases 1.6 Design examples using NoSQL databases 1.6.1 Example 1.6.2 Example 1.6.3 Example 1.6.4 Example 1.7 Conclusions References xiii xv xxi xxiii xxv xxvii xxix 6 11 12 13 14 15 15 17 18 19 19 19 21 26 29 30 31 vi Handbook of big data analytics, volume 1: methodologies Big data processing frameworks and architectures: a survey Raghavendra Kumar Chunduri and Aswani Kumar Cherukuri 37 2.1 2.2 37 39 39 39 Introduction Apache Hadoop framework and Hadoop Ecosystem 2.2.1 Architecture of Hadoop framework 2.2.2 Architecture of MapReduce 2.2.3 Application implemented using MapReduce: concept generation in formal concept analysis (FCA) 2.3 HaLoop framework 2.3.1 Programming model of HaLoop 2.3.2 Task scheduling in HaLoop 2.3.3 Caching in HaLoop 2.3.4 Fault tolerance 2.3.5 Concept generation in FCA using HaLoop 2.4 Twister framework 2.4.1 Architecture of Twister framework 2.4.2 Fault tolerance in Twister 2.5 Apache Pig 2.5.1 Characteristics of Apache Pig 2.5.2 Components of Apache Pig 2.5.3 Pig data model 2.5.4 Word count application using Apache Pig 2.6 Apache Mahout 2.6.1 Apache Mahout features 2.6.2 Applications of Mahout 2.7 Apache Sqoop 2.7.1 Sqoop import 2.7.2 Export from Sqoop 2.8 Apache Flume 2.8.1 Advantages of Flume 2.8.2 Features of Flume 2.8.3 Components of Flume 2.9 Apache Oozie 2.10 Hadoop 2.11 Apache Spark 2.11.1 Spark Core 2.11.2 Driver program 2.11.3 Spark Context 2.11.4 Spark cluster manager 2.11.5 Spark worker node 2.11.6 Spark resilient distributed datasets (RDDs) 2.11.7 Caching RDDs 2.11.8 Broadcast variables in spark 2.11.9 Spark Datasets 46 53 54 54 55 55 56 56 57 58 59 59 59 60 60 61 61 61 62 62 63 63 63 64 64 66 66 67 69 69 69 70 70 70 71 71 72 Contents vii 2.11.10 Spark System optimization 2.11.11 Memory optimization 2.11.12 I/O optimization 2.11.13 Fault tolerance optimization 2.11.14 Data processing in Spark 2.11.15 Spark machine learning support 2.11.16 Spark deep learning support 2.11.17 Programming layer in Spark 2.11.18 Concept generation in formal concept analysis using Spark 2.12 Big data storage systems 2.12.1 Hadoop distributed file system 2.12.2 Alluxio 2.12.3 Amazon Simple Storage Services—S3 2.12.4 Microsoft Azure Blob Storage-WASB 2.12.5 HBase 2.12.6 Amazon Dynamo 2.12.7 Cassandra 2.12.8 Hive 2.13 Distributed stream processing engines 2.13.1 Apache Storm 2.13.2 Apache Flink 2.14 Apache Zookeeper 2.14.1 The Zookeeper data model 2.14.2 ZDM—access control list 2.15 Open issues and challenges 2.15.1 Memory management 2.15.2 Failure recovery 2.16 Conclusion References 73 73 74 74 75 78 79 80 82 85 86 88 88 88 88 88 89 89 90 90 92 94 95 96 97 97 97 98 98 The role of data lake in big data analytics: recent developments and challenges T Ramalingeswara Rao, Pabitra Mitra and Adrijit Goswami 3.1 Introduction 3.1.1 Differences between data warehouses and data lakes 3.1.2 Data lakes pitfalls 3.2 Taxonomy of data lakes 3.2.1 Data silos 3.2.2 Data swamps 3.2.3 Data reservoirs 3.2.4 Big data fabric 3.3 Architecture of a data lake 3.3.1 Raw data layer 3.3.2 Data ingestion layer 105 105 107 108 108 108 109 109 109 110 110 111 viii Handbook of big data analytics, volume 1: methodologies 3.3.3 Process layer 3.3.4 Ingress layer 3.3.5 Responsibilities of data scientists in data lakes 3.3.6 Metadata management 3.3.7 Data lake governance 3.3.8 Data cataloging 3.4 Commercial-based data lakes 3.4.1 Azure data lake environment 3.4.2 Developing a data lake with IBM (IBM DL) 3.4.3 Amazon Web Services (AWS) Galaxy data lake (GDL) 3.5 Open source-based data lakes 3.5.1 Delta lake 3.5.2 BIGCONNECT data lake 3.5.3 Best practices for data lakes 3.6 Case studies 3.6.1 Machine learning in data lakes 3.6.2 Data lake challenges 3.7 Conclusion References 111 112 112 113 113 114 114 114 115 115 116 116 116 117 118 120 120 120 121 Query optimization strategies for big data Nagesh Bhattu Sristy, Prashanth Kadari and Harini Yadamreddy 125 4.1 Introduction 4.1.1 MapReduce preliminaries 4.1.2 Organization of the chapter 4.2 Multi-way joins using MapReduce 4.2.1 Sequential join 4.2.2 Shares approach 4.2.3 SharesSkew 4.2.4 Q-Join 4.3 Graph queries using MapReduce 4.3.1 Counting triangles 4.3.2 Subgraph enumeration 4.4 Multi-way spatial join 4.5 Conclusion and future work References 126 127 127 127 129 130 132 135 138 138 140 147 153 153 Toward real-time data processing: an advanced approach in big data analytics Shafqat Ul Ahsaan, Harleen Kaur and Sameena Naaz 157 5.1 5.2 Introduction Real-time data processing topology 5.2.1 Choosing the platform 5.2.2 Entry points 5.2.3 Data processing infrastructure 157 159 159 159 159 Contents 5.3 5.4 Streaming processing Stream mining 5.4.1 Clustering 5.4.2 Classification 5.4.3 Frequent 5.4.4 Outlier and anomaly detection 5.5 Lambda architecture 5.6 Stream processing approach for big data 5.6.1 Apache Spark 5.6.2 Apache Flink 5.6.3 Apache Samza 5.6.4 Apache Storm 5.6.5 Apache Flume 5.6.6 Apache Kafka 5.7 Evaluation of data streaming processing approaches 5.8 Conclusion Acknowledgment References A survey on data stream analytics Sumit Misra, Sanjoy Kumar Saha and Chandan Mazumdar 6.1 6.2 6.3 ix 160 161 161 161 162 162 162 163 163 164 167 167 168 169 172 172 172 173 175 Introduction Scope and approach Prediction and forecasting 6.3.1 Future direction for prediction and forecasting 6.4 Outlier detection 6.4.1 Future direction for outlier detection 6.5 Concept drift detection 6.5.1 Future direction for concept drift detection 6.6 Mining frequent item sets in data stream 6.6.1 Future direction for frequent item-set mining 6.7 Computational paradigm 6.7.1 Future direction for computational paradigm 6.8 Conclusion References 175 177 178 179 180 182 183 187 187 190 191 196 197 198 Architectures of big data analytics: scaling out data mining algorithms using Hadoop–MapReduce and Spark Sheikh Kamaruddin and Vadlamani Ravi 209 7.1 7.2 7.3 7.4 Introduction Previous related reviews Review methodology Review of articles in the present work 7.4.1 Association rule mining/pattern mining 209 211 214 217 217 Overall conclusions Vadlamani Ravi and Aswani Kumar Cherukuri This volume discussed, at length, various aspects of big data, including all the dimensions, viz five Vs These aspects encompass data ingestion, metamorphosis in the database design, data storage, cloud/fog environment, various frameworks, parallelization, and scaling out various analytical/machine learning models and human–computer interaction in the big data analytics era, and finally streaming data analytics This is one compendium that brought out all those important concepts under one roof Future directions include performing edge intelligence in IoT at scale, large-scale time series data mining, anomaly detection in big datasets involving volume, velocity and variety aspects, hybridizing horizontal and vertical parallelism, i.e a cluster of GPU-based servers, developing parallel and distributed versions of several evolutionary computing algorithms and finally developing fully automated algorithms that detect and eliminate noise in big data and thereby churning out the signal, i.e important part of the big data, which will have several ramifications in many domains Concerted efforts are also needed in the direction of increasing convergence speeds of deep learning architectures using the hybrid parallelization and distributed paradigm of big data Even though this is achieved now using a cluster of GPU-based servers, a lot is left to be desired because they typically consume huge training times in domains like climate modelling, molecular modelling, etc Index accessibility, rationale for 336–7 accessible data visualization 339 Accidents 221 AdaBoost 182 Adaptive Distributed Extreme Learning Machine (A-ELM*) 228 aggregate-oriented NoSQL databases 12–13 Airline on-time dataset 226 Alluxio 88 Amazon Dynamo 88–9 Amazon kinesis 111 Amazon Simple Storage Services, S3 88 Amazon Web Services (AWS) 115 ANOVA 228 ant colony optimization (ACO) algorithm 227 Apache Flink 92–3, 164–7, 177 Apache Kafka 93 Architecture of Kafka 93–4 Apache Flume 63, 168–9 advantages of 63–4 architecture of 65 components of 64 data flow model in 65 features of 64 Apache Hadoop 319, 321–3 Apache Hadoop framework and Hadoop Ecosystem 39 architecture of 39 MapReduce, application implemented using 46–53 MapReduce, architecture of 39–42 executing map phase 42–3 reduce phase execution 45–6 shuffling and sorting 43–5 Apache Hadoop MapReduce 212 Apache Kafka 111, 169–72, 192 Apache Mahout 61 applications of 61–2 features 61 Apache Oozie 66 Apache Pig 59 characteristics of 59 components of 59 pig data model 60 word count application using 60–1 Apache Samza 167 Apache Spark 67–8, 163–4, 177, 220, 319, 322 broadcast variables in spark 71–2 caching RDDs 71 concept generation in formal concept analysis using Spark 82–5 data processing in Spark 75 example of GraphX 77–8 features of Spark GraphX 76–7 Spark GraphX 76 Spark streaming 75–6 driver program 69 fault tolerance optimization 74 I/O optimization 74 data compression and sharing 74 data shuffling 74 memory optimization 73–4 programming layer in Spark 80 DataFrames in Spark SQL 81–2 PySpark 80 SparkR 80–1 348 Handbook of big data analytics, volume 1: methodologies Spark SQL 81 Spark cluster manager 70 Spark Context 69 Spark Core 69 Spark Datasets 72 Spark deep learning support 79–80 CaffeOnSpark 80 Deeplearning4j (DL4j) 80 Spark machine learning support 78 example of MLlib 79 keystone ML 79 Spark MLlib 78–9 Spark resilient distributed datasets (RDDs) 70–1 Spark System optimization 73 decentralized task scheduling 73 scheduler optimization 73 Spark worker node 70 Apache-Spark-distributed processing environment 319 Apache Spark framework 212 Apache Sqoop 62 export from Sqoop 63 Sqoop import 62–3 Apache Storm 90, 167–8, 177 architecture of 90–2 Apache Zookeeper 94 ZDM—access control list 96–7 Zookeeper data model 95–6 approximate kernel fuzzy c-means (akFCM) 238 approximate kernel possibilistic c-means (akPCM) 238 AprioriPMR 220 artificial immune system (AIS) 246 artificial intelligence (AI) 335 for accessibility 337 accessible data visualization 339 AI-based exoskeletons 339 assisting deaf and hard of hearing 338–9 enabling smart environment through IoT for persons with disabilities 339–40 perception porting 337–8 artificial neural network (ANN) 182 ARtool 221 association rule mining/pattern mining Hadoop MapReduce-based conference papers 220–3 Hadoop MapReduce-based journal papers 217–19 Spark-based conference papers 223–4 Spark-based journal papers 219–20 Auto-Associative Extreme Learning Machine (AAELM) 225 auto-associative neural network (AANN) 236 autoregressive integrated moving average (ARIMA) 178–9 Azure data lake environment 114–15 backpropagation neural network (BPNN) 225 batched online sequential ELM (BPOS-ELM) training 229 batch processing 166 Bayesian network (BN) classifiers 230 Bayesian predictive methods 177 best practices for data lakes 117–18 BigAnt 221 BIGCONNECT data lake 116–17 BigCross dataset 250 big data 298–9 in cloud computing 299 merits and demerits 299 Big Data Analytical 3–5 big data fabric 109–10 Big Data Operational 3–5 Big Data phenomenon 2–5 big data processing platforms 210 Big data storage systems 85 Alluxio 88 Amazon Dynamo 88–9 Amazon Simple Storage Services, S3 88 Cassandra 89 Hadoop distributed file system 86–7 HBase 88 Index Hive 89–90 Microsoft Azure Blob StorageWASB 88 binary Q-join 135–7 BirdVis 194 BMSWebView2 224 bounded data streams 166 Brinkhoff 239 broadcast variables in spark 71–2 BSON 13 budgeted mini-batch parallel gradient descent algorithm (BMBPGD) 235 CaffeOnSpark 80 CAIDA UCSD dataset 227 canopy-FCM 241 Car Evaluation 249 Cassandra 14–15, 89 categorical data 183 CDMC2012 246, 252 Celestial Spectral Dataset 217 CFI-Stream 189 CFS 251 change point detection (CPD) 186 Chebyshev function link ANN (CFANN) 182 Chest-clinic network 250 Chi-FRBCS-BigData 233 Chi-FRBCS-BigData-Ave 233 Chi-FRBCSBigData-Max 233 classification Hadoop MapReduce-based conference papers 231–5 Hadoop MapReduce-based journal papers 227–30 Spark-based conference papers 235–7 Spark-based journal papers 230–1 Cleveland Heart Disease 252 Client document 29 Closed Enumerated Tree (CET) 189 closed frequent item sets (CFIs) 189 CloStream 189 349 cloud, fog and edge computing with the IoT application architectural view of 312–14 cloud computing with IoT applications 299 applications of IoT 303–5 cloud computing importance 302–3 cloud offloading strategies 303 merits and demerits 305 ClueWeb dataset 250 cluster computing 318 distributed computing frameworks Apache Hadoop 321–2 Apache Spark 322 MapReduce 321 Microsoft’s DMTK and CNTK 322–3 gaps identified in the existing research work 323 peer-to-peer computing 321 using commodity hardware (Raspberry Pi) load balancing algorithms 320–1 utility computing 321 clustering Hadoop MapReduce-based conference papers 239–44 Hadoop MapReduce-based journal papers 237–9 Spark-based conference papers 244–6 Spark-based journal papers 239 clustering toolkit (CLUTO) 238 CNTK (Cognitive Toolkit) 319, 323 commercial-based data lakes 114 Amazon Web Services (AWS) 115 Azure data lake environment 114–15 developing data lake with IBM 115 Galaxy data lake (GDL) 115–16 computational approach to models 176 computational paradigm 191–6 concept adapting VFDT (CVFDT) 186 concept drift detection 177–9, 183–7 350 Handbook of big data analytics, volume 1: methodologies concept generation in formal concept analysis using Spark 82–5 Connect-4 221, 223, 249 connected car 304 connected health 304 Connectionist 247 connectivity-based marketing 304 Connector API 172 Consumer API 172 cooperative PSO 250 counting triangles 138–40 Covertype 238, 249 Cubes document 28 cyber security, challenges of 313–14 DARPA 246 databases, impact of Big Data on Big Data phenomenon 2–5 data distribution models 15 combining sharding and replication 18–19 replication 17–18 sharding 15–16 design examples using NoSQL databases 19–30 NoSQL databases aggregate-oriented 12–13 Cassandra 14–15 disadvantages of 11–12 MongoDB 13–14 scalability in relational databases limitations of relational databases 7–9 relational databases 6–7 data cataloging 114 data distribution models 15 combining sharding and replication 18–19 replication 17–18 sharding 15–16 data lake, architecture of 110 data cataloging 114 data ingestion layer 111 data lake governance 113–14 data scientists responsibilities in data lakes 112–13 ingress layer 112 metadata management 113 process layer 111–12 raw data layer 110–11 data lake challenges 120 data lake in big data analytics 105 architecture of data lake 110 data cataloging 114 data ingestion layer 111 data lake governance 113–14 data scientists responsibilities in data lakes 112–13 ingress layer 112 metadata management 113 process layer 111–12 raw data layer 110–11 commercial-based data lakes 114 Amazon Web Services (AWS) 115 Azure data lake environment 114–15 developing data lake with IBM 115 Galaxy data lake (GDL) 115–16 data lakes pitfalls 108 data warehouses and data lakes, differences between 107–8 open source-based data lakes 116 best practices for data lakes 117–18 BIGCONNECT data lake 116–17 Delta lake 116 taxonomy of data lakes 108 big data fabric 109–10 data reservoirs 109 data silos 108–9 data swamps 109 data lakes (DLs) Big data fabric 109–10 data reservoirs 109 data silos 108–9 data swamps 109 pitfalls 108 Index taxonomy of 108 data processing infrastructure 159 analytics layer 159 filtering layer 159 formality layer 159 storing layer 160 data processing in Spark 75 example of GraphX 77–8 features of Spark GraphX 76–7 Spark GraphX 76 Spark streaming 75–6 data reservoirs 109 data scientist 112–13 data scientists responsibilities in data lakes 112–13 data silos 108–9 data stream analytics (DSA) 175 computational paradigm 191–6 concept drift detection 183–7 mining frequent item sets in data stream 187–91 outlier detection 180–3 prediction and forecasting 178–80 scope and approach 177–8 data stream ingestion (DSI) 192 data streaming processing approaches, evaluation of 172 data stream models 176 data stream processing (DSP) 192–3 data stream processing systems (DSPSs) 177, 191–2 data stream resource management (DSRM) 193–4 Data Stream Resource Manager 193 data stream storage (DSS) 194 data swamps 109 data transport (DT) 192 data visualization tools data warehouses and data lakes, differences between 107–8 deaf and hard of hearing, assisting 338–9 deep learning (DL) 178, 182 Deeplearning4j (DL4j) 80 Delta lake 116 351 density-based forecasting 177 Design examples using NoSQL databases 19–30 directed acyclic graph (DAG) 191–2 DIrect Update (DIU) tree 189 disaster recovery data centre (DRDC) 194 distributed computing frameworks 319, 321–3 distributed data processing 319 distributed density-based clustering (DDC) algorithm 243 distributed density peaks clustering algorithm with locality-sensitive hashing (LSH-DDP) 238 distributed ELM (DELM) 228 distributed grid-based clustering (DGC) algorithm 243 distributed keyword search (DKS) 252 distributed processing 183 distributed stream processing engines 90 Apache Flink 92–3 Apache Kafka 93 Architecture of Kafka 93–4 Apache Storm 90–2 distribution of papers 269 DMTK (Distributed Machine Learning Toolkit) 319, 322–3 drift adaptation 185 drift detection 183, 185 in rule-based systems 187 drift understanding 185, 187 Driver program 69 dynamic neighborhood selection (DNS) clustering algorithm 242 ECBDL’14 235 e-Commerce customer dataset 326 ECUESpam 235 edge computing 297 benefits of 310–11 future of 312 in IoT applications 311–12 352 Handbook of big data analytics, volume 1: methodologies EdgeJoin 141–2 efficient parallel ELM (PELM) 224 ego-Facebook dataset 222 ego-Twitter dataset 222 Elapsed-Time-based Dynamic Passes Combined-counting 219 Electrical 240 EMSPOS 223 Enron dataset 238 Ensemble KF 179 Epsilon 235, 251 Erdo˜s–Re´nyi algorithm 231 estDec 188 Euclidean distance 181 extended Fiduccia–Mattheyses algorithm 238 extended Kalman filter (EKF) 178 extreme learning machine (ELM) 226 failure recovery 97–8 fast-mRMR 251 Fault Tolerance and Recovery 194 fault tolerance optimization 74 feature extraction 176 FedCSIS AAIA’14 DM competition dataset 234 FiDoop-HD 217 field grouping 168 field-programmable gate array (FPGA) 215 filtering procedure for the selection of articles to review 216 “5Vs” Flink 194 Flink model 166 Flink zkNN (F-zkNN) algorithm 233 Flixster dataset 248 FLOating Rough Approximation (FLORA) 186 Flume 111 fog computing 297, 305 definition 307 layered architecture gateway 310 monitoring layer 309 physical and virtualisation layer 308–9 preprocessing layer 309 security layer 309 transport layer 310 network architecture 308 fog computing framework for Big Data processing 317 future work 332 implementation details Spark fog cluster evaluation 326–9 using resource constraint device (Raspberry Pi) 324–6 literature survey 319 cluster computing 320–1 distributed computing frameworks 321–3 gaps identified in the existing research work 323 peer-to-peer computing 321 utility computing 321 results and discussion 329–32 system description 323–4 fog–IoT–cloud architecture 313 fog nodes 297 forecasting 177–80 formal concept analysis (FCA) concept generation in 46–53 Fourier Inspired Windows for Concept Drift (FIWCD) 186 frequent item-set mining (FIM) 182, 187–8, 190–1, 219 frequent item sets (FIS) mining 187–91 frequent patterns (FPs) 217 Frequent Patterns Mining (FPM) 221 Friedman test 228 FRULER 225 FSM-H 217, 221 future directions 260–8 fuzzy-based clustering algorithms 245 fuzzy clustering 161 fuzzy c-means clustering algorithm implemented with MapReduce framework (MR-FCM) 238 Index fuzzy rule-based classification system (FRBCS) 228 Galaxy data lake (GDL) 115–16 GA optimized DT algorithm (MR-GAOT) 234 Gauss Distribution Set 238 general-purpose graphics processing unit (GPGPU) 215 genetic algorithms (GAs) 222 genome-sequence-encoding schema 236 Gisette 235 global grouping 168 glowworm swarm optimization (GSO) 241 Google 240 Google MillWheel 194 Google Search 299 GraphGen 217, 221 graph queries using MapReduce 138 counting triangles 138–40 subgraph enumeration 140–1 EdgeJoin 141–2 SEED Join 145–6 Star Join 143–4 TwinTwig Join 144–5 graph visualization tools 194 GraphX 164 Hadoop 66–7 Hadoop-based implementation of the k-NN (H-zkNN) 233 Hadoop-based KFCM (H-KFCM) 245 Hadoop distributed file system 86–7 Hadoop MapReduce-based conference papers 251–2 Hadoop MapReduce-based journal papers 249–50 HaLoop framework 53 caching in HaLoop 55 concept generation in FCA using HaLoop 56 fault tolerance 55 programming model of HaLoop 54 353 task scheduling in HaLoop 54–5 Handwritten dataset 251 hard clustering 161 HBase 88 Heart-Statlog 251 Hepatitis 252 Higgs 251 high-availability clusters 318 high-performance clusters 318 high utility itemset (HUI)-Miner algorithm 219 Hill Valley 247 Hitachi LDL 118 Hive 89–90 HK-medoids 241 Hoeffding-based decision trees 186 horizontal scaling 210 human–computer interaction (HCI) 336 Hypercube 240 Hypothesis testing (HT) 186 i2MapReduce 250 IaaS (infrastructure as a service) model 300 IBM Quest Dataset Generator 221 IBM WebSphere MQ (IBM MQ) 192 IDS-MRCPSO 247 imbalanced classes 177 IM-K-means algorithm 241 Improved MapReduce Apriori (IMRApriori) algorithm 220 Improved Parallel Association Rule Based on Combination 223 incremental algorithms 159 incremental learning algorithm for ESVM (IESVM) 227 industrial Internet 304 InfoGain 251 InputSplit 43 Internet of Things (IoT) 297 devices 210 intrusion detection (ID) 227 intrusion detection system (IDS) 246 Ionosphere 247 354 Handbook of big data analytics, volume 1: methodologies I/O optimization 74 data compression and sharing 74 data shuffling 74 Isolet 251 JAWBONE 303 join tree algorithm 238 Kalman filter (KF) 178 Karp–Papadimitriou–Shenker (KPS) algorithm 189 KDD99 240, 251 KDDB 251 KDD Cup 1999 dataset 228 KDDCUP99 246 Kent Ridge Breast Cancer 223 K-means clustering algorithm 243–4 K-medoids clustering 240 K-modes 243 k-nearest neighbor (k-NN) 226 k-NN algorithm 233 Kosarak dataset 224 Kruskal–Wallis test 228 Kyoto 246, 252 lambda architecture 162–3 Lambda Architecture (LA) 191 LC-CloStream algorithm 190 Letter Recognition dataset 234 linear regression algorithm 326 literal FCM (LFCM) 245 live captions 338–9 load balancing algorithms 320–1 load balancing clusters 318 logistic regression machine learning algorithm 326 Lookout 338 LOSSY COUNTING 188, 190 Lumada data lake (LDL) 105 machine cluster machine learning (ML) 209 in data lakes 120 models 176 MapReduce 211–12, 321 application implemented using 46–53 architecture of 39 executing map phase 42–3 reduce phase execution 45–6 shuffling and sorting 43–5 preliminaries 127 MapReduce, and Voting-based instance selection (MRVIS) 229 MapReduce-based Apriori algorithms 219 mass processing systems master–slave model 17–18 MATLAB 7.0 246 maximal frequent item sets (MFI) 189 Medical Expenditure Panel Survey 226 medium window (MW) 186 memory management 97 memory optimization 73–4 Mesos 194 metadata management 113 MicroBlog-Latent Dirichlet Allocation (MB-LDA) 251 Microsoft Azure Blob Storage-WASB 88 Microsoft Message Queuing 192 Microsoft’s DMTK and CNTK 322–3 mining frequent item sets 177–8 MixSim R Package 241 MOMENT 189 MongoDB 13–14 Movielens dataset 248 MrAdam 221 MRK-means 239 MR-PFP 222 MRQAR 220 MR-Rocchio algorithm 234 MRSMRS 222 multi-label k-NN (ML-k-NN) 231 multi-objective ABC (MOABC) algorithm 252 multiple linear regression (MLR) 225–6 Index multi-way joins using MapReduce 127–8 sequential join 129–30 shares approach 130–2 sharesSkew 132–5 Q-Join 135 binary Q-join 135–7 multi-way Q-join 137–8 multi-way spatial join 147–52 multi-way Q-join 137–8 Mushroom 221, 223 Mushrooms 249 Musk 247 mutual information feature selection method based on spark framework (sf-MIFS) 236 Netflix competition dataset 251 neural network (NN) 179, 182 NewMOMENT 189 NewSQL databases 1998 DARPA Intrusion Detection Evaluation Set 219 NoSQL databases aggregate-oriented 12–13 Cassandra 14–15 disadvantages of 11–12 MongoDB 13–14 NSL-KDD 246, 252 one-hot-encoding 183 OpenFlow protocol 320 OpenPlanet 225 Open source-based data lakes 116 best practices for data lakes 117–18 BIGCONNECT data lake 116–17 Delta lake 116 outlier and anomaly detection 162 outlier detection 180–3 outlier detection/intrusion detection system Hadoop MapReduce-based conference paper 247 Hadoop MapReduce-based journal paper 246 355 Spark-based conference papers 247–8 Spark-based journal papers 246 Oxford Buildings Dataset 238 PaaS (platform as a service) model 300 PaMPa-HD 223 Pampas 223 parallel backpropagation neural network (PBPNN) 230 Parallel Binary Bat Algorithm 246 Parallel Block FP-Growth (PBFP-Growth) 222 parallel ensemble of online sequential ELM (PEOS-ELM) algorithm 229 parallel feature reduction algorithm 250 parallel genetic algorithm based on Spark (PGAS) 250 Parallel Highly Informative k-ItemSet (PHIKS) 217, 222 parallel high utility itemset (PHUI)Miner algorithm 219 parallelized version of extreme SVM (PESVM) 227 parallel methods of FIM (PARMA) 219 parallel mining algorithm for the constrained FP (PACFP) 219 parallel processing 318 parallel PSO (PPSO) 243 parallel random forest (PRF) algorithm 230 Parallel Weighted Itemset (PaWI) Mining algorithm 223 particle swarm optimization (PSO) 236 partition key 15 PatchWork 245 Patient Treatment Time Prediction algorithm 225 peer-to-peer (P2P) computing 319, 321 356 Handbook of big data analytics, volume 1: methodologies PELT (pruned exact linear time) algorithm 186 perception porting 337–8 Phrase Nets 194 Pig data model 60 Pima Indian Diabetes 251 Pi Stack 320 Poker 251 PokerHand 228, 233, 249 PostgreSQL 194 prediction and forecasting 178–80 primary key 15 processing high-dimensional data 182 Producer API 170 production data centre (PR-DC) 194 Programming layer in Spark 80 DataFrames in Spark SQL 81–2 PySpark 80 SparkR 80–1 Spark SQL 81 proximal SVM 228 PSOAANN 236 PubChem 217, 221 PubChem website 220 public cloud 299 PySpark 80 quantitative analysis of drift 184 query optimization strategies for big data 125 graph queries using MapReduce 138 counting triangles 138–40 subgraph enumeration 140–6 MapReduce preliminaries 127 multi-way joins using MapReduce 127–8 sequential join 129–30 shares approach 130–2 sharesSkew 132–5 Q-Join 135–138 multi-way spatial join 147–52 RabbitMQ 192 Raspberry Pi 320, 324–6 reading quorum 19 real-time data processing 157 data streaming processing approaches, evaluation of 172 lambda architecture 162–3 real-time data processing topology 159 choosing platform 159 data processing infrastructure 159 entry points 159 streaming processing 160–1 stream mining 161 classification 161 clustering 161 frequent 162 outlier and anomaly detection 162 stream processing approach for Big Data 163 Apache Flink 164–7 Apache Flume 168–9 Apache Kafka 169–72 Apache Samza 167 Apache Spark 163–4 Apache Storm 167–8 real-time data processing topology 159 choosing platform 159 data processing infrastructure 159 analytics layer 159 filtering layer 159 formality layer 159 storing layer 160 entry points 159 recommendation Hadoop MapReduce-based conference papers 248–9 Hadoop MapReduce-based journal paper 248 Spark-based conference papers 249 Record Linkage Comparison Patterns (RLCPs) dataset 228 RecordReader 43 recurrence of drift 187 Index Redis 194 Reduced-Apriori (R-Apriori) 224 regression-based prediction and forecasting systems 179 regression/prediction/forecasting Hadoop MapReduce-based conference papers 225–6 Hadoop MapReduce-based journal papers 224–5 Spark-based conference papers 226–7 Spark-based journal papers 225 relational databases 6–9 vs NoSQL databases 11 RELIEF-F 251 replication 17–18 combining sharding and 18–19 resilient distributed datasets (RDDs) 70–1, 164, 322 caching 71 restricted Boltzmann machines (RBMs) 182 Retail dataset 224 Reuters dataset 238 review methodology 214–17 Round Robin load balancing algorithm 320 R programming language RuleMR 233 SaaS (software as a service) model 300 scalability in relational databases limitations of relational databases 7–9 relational databases 6–7 scalable fast evolutionary algorithm for clustering (SF-EAC) 243–4 Scalable Random Sampling with iterative optimization FCM algorithm (SRSIO-FCM) 245 Scala programming language 326 scale up 210 scaling 210 SDSS star spectrum datasets 219 357 Secure Shell (SSH) connection 324 SEED Join 145–6 Seeing AI 338 self-organizing map (SOM) clustering 244 semantic-driven subtractive clustering method 238 semi-supervised learning 180–1 sequential join 129–30 sequential minimal optimization (SMO) 233 S-FRULER 225 sharding 15–16 combining replication and 18–19 shares approach 130–2 sharesSkew 132–5 shuffle grouping 168 Shuttle dataset 251 SignWriting 339 SIMPLifying and Ensembling (SIMPLE) framework 239 Sina Weibo dataset 250 singular value decomposition with stochastic gradient descent (SVD-SGD) 248 16S rRNA dataset 237 SKIP-LCSS 190 small window (SW) 186 smart city 303 smart environment, enabling through IoT for persons with disabilities 339–40 smart farming 304–5 smart gateway 310 smart grids 303 smart homes 303 smart retail 304 smart supply chain 304 SMOTE algorithm 235 SMRF algorithm 231 SMR-Kmeans 242 Snort logs 219 social media sites 298 soft clustering 161 software-defined networking (SDN) 320 358 Handbook of big data analytics, volume 1: methodologies SPAB-DKMC algorithm 243 SpamBase 227, 246, 252 SPARK 191 Spark-based conference paper 252 Spark-based journal papers 250–1 Spark-based KFCM (S-KFCM) 244–5 Spark cluster manager 70 Spark Context 69 Spark Core 69 Spark Datasets 72 Spark deep learning support 79–80 CaffeOnSpark 80 Deeplearning4j (DL4j) 80 Spark fog cluster evaluation 326–9 Spark-gram 224 Spark GraphX 76 Spark machine learning support 78 example of MLlib 79 keystone ML 79 Spark MLlib 78–9 Spark ML 322 Spark MLlib 78–9, 164, 322 SparkR 80–1 Spark SQL 81, 164 Spark Streaming 164 Spark System optimization 73 decentralized task scheduling 73 scheduler optimization 73 Spark worker node 70 specific programming languages splitting-based data balancing method (SplitBal) 230 Sqoop 111 Stanford Network Analysis Project (SNAP) 222 Star Join 143–4 static algorithms 318 statistical models 176, 180 STORM 191 StoryFlow 194 stream grouping 168 streaming processing 160–1 stream mining 161 classification 161 clustering 161 frequent 162 outlier and anomaly detection 162 STREAM MINING 189 stream processing approach for Big Data 163 Apache Flink 164–7 Apache Flume 168–9 Apache Kafka 169–72 Apache Samza 167 Apache Spark 163–4 Apache Storm 167–8 Streams API 172 subgraph enumeration 140–1 EdgeJoin 141–2 SEED Join 145–6 Star Join 143–4 TwinTwig Join 144–5 summarization 176 summingbird framework 163 supervised learning 180 support vector machine (SVM) 182 SVM-RFE 251 T10I4D100K 224 T25I10D10K 224 TDL for digital banking 118–19 Temenos data lake (TDL) 105 text visualization 194 Q-Join 135 binary Q-join 135–7 multi-way Q-join 137–8 time-sensitive computing 307 time-series data 178 top-k CFIs 189–90 TreeNetViz 194 20NewsGroup 235 TwinTwig Join 144–5 Twister framework 56 architecture of 57–8 fault tolerance 58–9 Twitter dataset 231, 248 Twitter Search API 236 TWMINSWAP-IS 190 Index UML diagram 22, 26 unbounded data streams 165 unsupervised learning 180–1 URL-Reputation 235 user BN (UBN) 250 USPS 251 utility computing 321 value Variable Size-based Fixed Passes Combined-counting 219 variety velocity veracity Vertical-Apriori MapReduce (VAMR) 222 vertical scaling 210 very fast decision tree (VFDT) 186 VoltDB 194 volume Washington State Inpatient Database 226 359 WBDC 247 wearable robotics 339 wearables 303 weighted ensemble classifier based on DELM (WE-DELM) 228 Weighted Label Propagation Algorithm with Probability Threshold (P-WLPA) algorithm 235 Wikipedia articles collection dataset 224 WikiTalk dataset 250 windowing technique 186 Wine 247 Wisconsin Breast Cancer 251 Word Clouds 194 writing quorum 19 Yet Another Resource Negotiator (YARN) 194 YouTube Faces dataset 245 ZooKeeper 194 ... Introduction The impact of Big Data on databases Antonio Sarasa Cabezuelo 1.1 The Big Data phenomenon 1.1.1 Big Data Operational and Big Data Analytical 1.1.2 The impact of Big Data on databases 1.2 Scalability... COMPUTING SERIES 37 Handbook of Big Data Analytics IET Book Series on Big Data? ??Call for Authors Editor-in-Chief: Professor Albert Y Zomaya, University of Sydney, Australia The topic of big data has emerged... 3.1.2 Data lakes pitfalls 3.2 Taxonomy of data lakes 3.2.1 Data silos 3.2.2 Data swamps 3.2.3 Data reservoirs 3.2.4 Big data fabric 3.3 Architecture of a data lake 3.3.1 Raw data layer 3.3.2 Data

Tiêu đề	Handbook of Big Data Analytics
Tác giả	Vadlamani Ravi, Aswani Kumar Cherukuri
Người hướng dẫn	Professor Albert Y. Zomaya
Trường học	University of Sydney
Thể loại	book
Năm xuất bản	2021
Thành phố	London

Định dạng
Số trang	390
Dung lượng	11,81 MB