Big Data Computing A Guide for Business and Technology Managers Chapman & Hall/CRC Big Data Series SERIES EDITOR Sanjay Ranka AIMS AND SCOPE This series aims to present new research and applications in Big Data, along with the computational tools and techniques currently in development The inclusion of concrete examples and applications is highly encouraged The scope of the series includes, but is not limited to, titles in the areas of social networks, sensor networks, data-centric computing, astronomy, genomics, medical data analytics, large-scale e-commerce, and other relevant topics that may be proposed by potential contributors PUBLISHED TITLES BIG DATA COMPUTING: A GUIDE FOR BUSINESS AND TECHNOLOGY MANAGERS Vivek Kale BIG DATA OF COMPLEX NETWORKS Matthias Dehmer, Frank Emmert-Streib, Stefan Pickl, and Andreas Holzinger BIG DATA : ALGORITHMS, ANALYTICS, AND APPLICATIONS Kuan-Ching Li, Hai Jiang, Laurence T Yang, and Alfredo Cuzzocrea NETWORKING FOR BIG DATA Shui Yu, Xiaodong Lin, Jelena Mišic, ´ and Xuemin (Sherman) Shen Big Data Computing A Guide for Business and Technology Managers Vivek Kale CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2017 by Vivek Kale CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Printed on acid-free paper Version Date: 20160426 International Standard Book Number-13: 978-1-4987-1533-1 (Hardback) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http:// www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging-in-Publication Data Names: Kale, Vivek, author Title: Big data computing : a guide for business and technology managers / author, Vivek Kale Description: Boca Raton : Taylor & Francis, CRC Press, 2016 | Series: Chapman & Hall/CRC big data series | Includes bibliographical references and index Identifiers: LCCN 2016005989 | ISBN 9781498715331 Subjects: LCSH: Big data Classification: LCC QA76.9.B45 K35 2016 | DDC 005.7 dc23 LC record available at https://lccn.loc.gov/2016005989 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com To Nilesh Acharya and family for unstinted support on references and research for my numerous book projects This page intentionally left blank Contents List of Figures xxi List of Tables xxiii Preface xxv Acknowledgments xxxi Author .xxxiii Computing Beyond the Moore’s Law Barrier While Being More Tolerant of Faults and Failures 1.1 Moore’s Law Barrier .2 1.2 Types of Computer Systems 1.2.1 Microcomputers .4 1.2.2 Midrange Computers 1.2.3 Mainframe Computers 1.2.4 Supercomputers .5 1.3 Parallel Computing .6 1.3.1 Von Neumann Architectures 1.3.2 Non-Neumann Architectures 1.4 Parallel Processing 1.4.1 Multiprogramming 10 1.4.2 Vector Processing 10 1.4.3 Symmetric Multiprocessing Systems 11 1.4.4 Massively Parallel Processing 11 1.5 Fault Tolerance 12 1.6 Reliability Conundrum 14 1.7 Brewer’s CAP Theorem 15 1.8 Summary 18 Section I Genesis of Big Data Computing Database Basics 21 2.1 Database Management System 21 2.1.1 DBMS Benefits 22 2.1.2 Defining a Database Management System .23 2.1.2.1 Data Models alias Database Models 26 2.2 Database Models 27 2.2.1 Relational Database Model 28 2.2.2 Hierarchical Database Model .30 2.2.3 Network Database Model .32 2.2.4 Object-Oriented Database Models .32 2.2.5 Comparison of Models 33 2.2.5.1 Similarities 33 2.2.5.2 Dissimilarities 35 vii viii Contents 2.3 2.4 2.5 2.6 2.7 2.8 Database Components 36 2.3.1 External Level 37 2.3.2 Conceptual Level 37 2.3.3 Physical Level 38 2.3.4 The Three-Schema Architecture 38 2.3.4.1 Data Independence 39 Database Languages and Interfaces 40 Categories of Database Management Systems .42 Other Databases 44 2.6.1 Text Databases 44 2.6.2 Multimedia Databases 44 2.6.3 Temporal Databases .44 2.6.4 Spatial Databases 45 2.6.5 Multiple or Heterogeneous Databases 45 2.6.6 Stream Databases 45 2.6.7 Web Databases 46 Evolution of Database Technology 46 2.7.1 Distribution .47 2.7.2 Performance 47 2.7.2.1 Database Design for Multicore Processors 48 2.7.3 Functionality 49 Summary 50 Section II Road to Big Data Computing Analytics Basics 53 3.1 Intelligent Analysis 53 3.1.1 Intelligence Maturity Model .55 3.1.1.1 Data 55 3.1.1.2 Communication 55 3.1.1.3 Information .56 3.1.1.4 Concept 56 3.1.1.5 Knowledge 57 3.1.1.6 Intelligence 58 3.1.1.7 Wisdom 58 3.2 Decisions 59 3.2.1 Types of Decisions 59 3.2.2 Scope of Decisions 61 3.3 Decision-Making Process 61 3.4 Decision-Making Techniques 63 3.4.1 Mathematical Programming 63 3.4.2 Multicriteria Decision Making .64 3.4.3 Case-Based Reasoning 64 3.4.4 Data Warehouse and Data Mining 64 3.4.5 Decision Tree 64 3.4.6 Fuzzy Sets and Systems 65 Contents 3.5 3.6 3.7 3.8 ix Analytics .65 3.5.1 Descriptive Analytics 66 3.5.2 Predictive Analytics 66 3.5.3 Prescriptive Analytics 67 Data Science Techniques 68 3.6.1 Database Systems 68 3.6.2 Statistical Inference 68 3.6.3 Regression and Classification .69 3.6.4 Data Mining and Machine Learning 70 3.6.5 Data Visualization 70 3.6.6 Text Analytics 71 3.6.7 Time Series and Market Research Models 72 Snapshot of Data Analysis Techniques and Tasks 74 Summary 77 Data Warehousing Basics 79 4.1 Relevant Database Concepts 79 4.1.1 Physical Database Design .80 4.2 Data Warehouse 81 4.2.1 Multidimensional Model 83 4.2.1.1 Data Cube 84 4.2.1.2 Online Analytical Processing 84 4.2.1.3 Relational Schemas 87 4.2.1.4 Multidimensional Cube .88 4.3 Data Warehouse Architecture 91 4.3.1 Architecture Tiers 91 4.3.1.1 Back-End Tier 91 4.3.1.2 Data Warehouse Tier 91 4.3.1.3 OLAP Tier 93 4.3.1.4 Front-End Tier 93 4.4 Data Warehouse 1.0 93 4.4.1 Inmon’s Information Factory .93 4.4.2 Kimbal’s Bus Architecture 94 4.5 Data Warehouse 2.0 95 4.5.1 Inmon’s DW 2.0 95 4.5.2 Claudia Imhoff and Colin White’s DSS 2.0 96 4.6 Data Warehouse Architecture Challenges 96 4.6.1 Performance 98 4.6.2 Scalability 98 4.7 Summary 100 Data Mining Basics 101 5.1 Data Mining 101 5.1.1 Benefits 103 5.2 Data Mining Applications 104 5.3 Data Mining Analysis 106 5.3.1 Supervised Analysis 106 5.3.1.1 Exploratory Analysis 106 5.3.1.2 Classification 107 481 Index Descriptive analytics, 66 Design autonomy, 141 Directed acyclic graph (DAG), 305 scheduler, 305 Directed/supervised model, 102 Direct marketing campaigns, 104 Discriminant analysis, 75 Distributed applications, 281, 297 Distributed common object model (DCOM), 163 Distributed computing, Distributed database management system (DDBMS), 44, 138–139, 141 Distributed databases, 138–152 characteristics, 140–141 autonomy, 141 availability and reliability, 140–141 scalability and partition tolerance, 141 transparency, 140 goal, 151 Distributed memory abstraction, 304 systems, 127, 127f Distributed Shared Memory (DSM), 304 Distributed systems, 123–138, 233, 380 advantages, 142 architectural styles, 129 characteristic global features, 123–124 concurrency control and recovery in, 146–148 data replication and allocation, 146 defined, 123, 128 disadvantages, 142–145 distributed computing, 128–138 software architectural styles, 130–135 system architectural styles, 129–130 technologies for, 135–138 parallel computing, 125–128 query processing and optimization in, 148–149 rules for, 151–152 stack layers, 128–129 application, 129 hardware, 128 middleware, 129 operating system, 128 transaction management in, 149–151 Distribution transparency, 140 Document CouchDB, 268 databases, 253, 266–274 data model, 43 level sentiment analysis, 399 MongoDB, 268–274 CRUD operations, 272 data model, 270–272 distributed systems characteristics, 272–274 features, 269–270 text analysis, 393 features, 393–395 Document-centric databases, 249 Document-oriented databases, 247–248 Drill, 302 optimizer, 302 Drill-down operation, 86 Drill-up operation, 86 Durability transaction property, 17, 26, 296 Durkheim, E., 382–383 Dynamic connectivity, ESB, 171–172 Dynamic, discoverable, metadata driven, 157 Dynamism, 342 Dynamo system, 265 E Ego/Egocentric network, 390–391 Eigenvector centrality, 385 EJB, see Enterprise JavaBeans (EJB) Elasticity, 180 Electronic Frontier Foundation’s TRUSTe, 339 Elementary set, 121 End-to-end security capabilities, 173 Enhanced Data Rate for GSM Evolution (EDGE), 405 Enhanced observed time difference (E-OTD), 436, 437t Enterprise application integration (EAI), 160, 168 Enterprise applications architecture, 358f elements, 354–355 in J2EE, 355f Enterprise data warehouse (EDW), 91, 97 Enterprise JavaBeans (EJB) container, 356 Entity Beans, 357 Session Beans, 357 Enterprise service bus (ESB), 167–175, 169f characteristics, 170–175 event-driven nature, 174–175 key capabilities, 171–174 scalability, 174 Entity Beans EJB, 357 Entity–relationship (ER) model, 27, 32, 80, 82 Environmental sensors, 462 482 ESB, see Enterprise service bus (ESB) Estimation predictive model, 102 Ethical challenge, LBS, 441 ETL, see Extraction-transformation-loading (ETL) Euler, Leonhard, 378 Event-driven architectures, 171 nature of ESB, 174–175 SOA, 175 Eventual consistency, 252 Evolutionary algorithms (EA), 120–121 Execution autonomy, 141 Executive information systems (EIS), 50 Existential queries, 34–35 Explicit knowledge, 57 Exploratory analysis, 106 Exploratory techniques, 327 Extensible Markup Language (XML), 156, 160, 163, 256 vocabulary, 165 Extraction-transformation-loading (ETL), 79, 91–93, 97 F Facebook, 386–387 Fast data, 279 Fault tolerant system, 12–13 Feature extractor, 117 Fielding’s REST, see REpresentational state transfer (REST) File, 231 Financial decision pattern, 455–457 assets, 456 cash flow planning, 455–456 profitability, 456 ratios, 455–456 Firewalls, 373 Fitness function, 121 Flickr, 361, 389 Flip-flop circuit, Floating-point operations, Flume, 300 Flynn, Michael J., 125 Flynn’s taxonomy, 126 Folksonomy, 361 Forecasting, 77 Formal scientific method, 68 Forms-based interfaces, 41–42 Fragmentation transparency, 140 Framework, 279 Index Fraud detection, 106 Free format, 393 Friendster, 383, 389 Front-end tiers, 92f, 93 Functionality migration, 49 Functional programming, 308–311 advantages, 310–311 characteristics, 309–310 disadvantages, 311 paradigm, 227–229 data parallelism vs task parallelism, 228 parallel architectures and computing models, 228 Futon, web interface, 268 Fuzzy decision trees, 65 Fuzzy sets/systems, 65, 120, 120f G Generally acceptable accounting principles (GAAPs), 457 General Packet Radio Services (GPRS), 405 General purpose computing on graphics processing units (GPGPU), 128 Generations of communication systems, 402–406 1G: analog, 402 2.5G: GPRS, EDGE, and CDMA 2000, 405 2G: CDMA, TDMA, and GSM, 402–405 3G: wCDMA, UMTS, and iMode, 406 4G, 406 Generative grammar, 71 Genetic algorithms (GAs), 119 Gensim, 314–315 Geocoding, 436 Geographic databases, 45 Geographic information systems (GISs), 438 Geography markup language (GML), 424 Global deadlock, 143 Global-level density, 384 Global pointers, 161 Global Positioning System (GPS), 426, 445 Global System for Mobile Communications (GSM), 404–405 Global transaction manager, 149 GNU S language, 319 Goal programming, 60 Google, 14–15 Bigtable, 232–233, 233t, 260 MapReduce, 229–233, 230f Google+, 388 Google Analytics, 375 Google Android, 407–408 Index Google File System (GFS), 256 Governance, 333, 336 resultant functions, 336 services, 336–341 privacy, 338–339 security, 337–338 security risks, 340–341 trust, 339–340 Gradient-descent algorithm, 304 Granovetter, Mark, 391 Graph databases, 254, 274–277 Neo4j, 254, 275–277 OrientDB, 274–275 data model, 43 stores/databases, 248 theory, 378 Graphical model, context, 451 Graphical user interface (GUI), 41–42, 321 GraphX, 308 Greenplum, data store, 243 GSM (Global System for Mobile Communications), 404–405 GUI (graphical user interface), 41–42, 321 H Hadoop, 227, 234, 326 business problem resolving, 235 cluster, 227, 279 data analytics, 236–237 distribution, 240–243, 279 CDH, 243 criteria, 240–242 HDP, 243 MapR, 243 pivotal HD, 243 ecosystem, 281–282, 282f, 283t frameworks, 282 framework platform, 279 MapReduce, 284–291, 285f changes, 287 enhancements and extensions, 286 iterative process, termination, 288 key problems, 286–287 processing, 284–286 processes, types, 280–281 Hadoop alias YARN, 238–240 HDFS storage, 239 MapReduce processing, 239–240, 240f 483 Hadoop Distributed File System (HDFS), 242, 257, 293–295 block placement policy, 290 characteristics, 293–295, 295f file system, characteristics, 238 and mapreduce, common architectural principles, 237–238 HadoopHBase, 246 Hadoop++ system, 289 Handset-based approach, 436 Hashing, 27 Hash partitioning method, 218 HAWQ technology, 243 HBase, 281, 295–297 architecture, 296–297 columns and column groups, 296–297 rows, 296 tables, 296 CRUD operations, 262–263 data model and versioning, 260–262 storage and distributed system concepts, 263 HDFS, see Hadoop Distributed File System (HDFS) Heartbeats method, 194 hi5, 389 Hierarchical cell structures (HCSs), 403 Hierarchical model/database model, 28, 30–32, 30f Hierarchical schema, 34 High-level data models, 27 High-level defense system, 116–117 High-level DMLs, 40 High-performance commercial computing, High-performance computing (HPC), 1, 240 High-performance data access, 255 Hive, 281, 297–298 designing goals, 297 metastore, 301 Hive Query Language (HiveQL), 298 Horizontal partitioning method, 100, 255 Horizontal scalability, 141 Hortonworks Data Platform (HDP), 243 Host-based defense systems, 116 Hotspots, 447 HousingMaps.com, 361 HTTP (hypertext transfer protocol), 165–166, 381 Hub-and-spoke approach, 168, 170 Human intuition, 60 Hybrid clouds, 186 Hybrid OLAP (HOLAP), 86 Hyperic HQ, 334 Hyperlink mechanism, 47 Hypertext transfer protocol (HTTP), 165–166, 381 484 I IaaS (Infrastructure as a Service), 182, 183f Idealized network, Illustrative CAA, 452–453 adaptable, 453 device-aware, 453 location-aware, 452 personalized, 453 time-aware, 452 Image analysis, 106 HDFS, 294 recognition, 106 iMapReduce framework, 287 IMM (Intelligence Maturity Model), 55–59, 56t Immune-system-based approach, 115 Impact assessments, 54 Impala, 301 paraquet format, 301 Impala daemons (impalad), 301 Imperative programming, 309 Improved repartition join operation, 288 Incident management, cloud service provider, 329–330 Indexing techniques, 27, 81 Index-organized table (IOT), 249 Indoor localization technique, 436 Information IMM, 56 overload, 363 science, 55 sources, 116 Information systems (IS), 49–50 Infrastructure as a Service (IaaS), 182, 183f Ingestion process, 283 In-memory computing, 221 In-memory databases, 48 systems, 244 Inmon, Bill, 93 Inmon’s DW 2.0, 95 Inmon’s information factory, 93–94 Input–output mapping function, 102 Input variables, data mining, 102 Instruction-level parallelism, 126 Integrated circuits (ICs), 13 Integrated Development Environment (IDE), 318 Integrated information systems, 153 Integration broker, 170, 174 Integration capabilities, ESB, 173 Intelligence, IMM, 58 Intelligence Maturity Model (IMM), 55–59, 56t Index Intelligent analysis, 53–59 IMM, 55–59, 56t communication, 55–56 concept, 56 data, 55 information, 56 intelligence, 58 knowledge, 57–58 wisdom, 58–59 Interactive analytics, 304 Interactive Python command shell, 314 Internet, 381 Internet of Things (IoT), 469–471 applications, 470 categories based on technological artifacts, 471 goal, 470 Internet protocol (IP), 15, 116 Interoperability, 333–334 Interpreter architecture style, 134 Intrarack communication, 260 Intrusion detection systems (IDSs), 117 Inventory turnover period, 456 IoT, see Internet of Things (IoT) IPython, 314 Isolation property, 17, 26, 296 Iterative algorithms, 305 Iterative jobs, 303 J J2EE, see Java Enterprise Edition (J2EE) platform Java Connector Architecture, 357 Java database connectivity (JDBC), 91 Java Enterprise Edition (J2EE) platform, 353 application development advantages in, 353–354 reference architecture in, 356–357 access to EIS tier, 357 distributed Java components, 357 Entity Bean EJBs as business object components, 357 JSP and Java Servlets as user interaction components, 356 MVC mapping to, 357–358, 358f Session Bean EJBs as service-based components, 356–357 tier in, 356 JavaMail API, 357 Java Message Service (JMS), 357 Java Naming and Directory Interface (JNDI), 357 JavaScript Object Notation (JSON), 253 485 Index JavaServer Pages (JSP), 356 Java Servlets, 356 Java Virtual Machine (JVM), 315 Jclouds library, 302 JDBC API, 357 JobTracker, 284–286 responsibilities, 286 Journal, HDFS, 294 JSON (Java Script Object Notation), 43 K Kafka, 299–300 Kappa architecture, 347 Kerberos cryptography system, 116 Keyspace, column database, 257–258 Key structure model, context, 451 Key-value databases, 253, 263–266 amazon dynamo, 265–266 DynamoDB data model, 266 riak, data model, 264–265 Key-value data model, 43 Key-Value Stores (K-V Stores)/databases, 246, 349 Kimbal’s bus architecture, 94–95 Knowledge discovery, 101 Knowledge, IMM, 57–58 Knowledge-intensive process, 392 Konigsberg Bridge problem, 378 L Lambda architecture, 348–349, 348f Lambda calculus, 308 Latent Dirichlet Allocation (LDA), 315 Latent Semantic Analysis (LSA), 315 Lateration principle, 427f Layered architectural style, 133 Lazy record construction technique, 290 LBS, see Location-based services (LBS) LBSN (location-based social networks), 441–443 Least-squares regression, 69 LIF (Location Interoperability Forums), 424 Linear predictor, 69 Linear programming, 63 Linear regression, 69–70 Linkedin, 386 Lisp, functional programming language, 228 Load balancing, 333 Local deadlock, 143 Location enablement technologies, 437t transparency, 140 Location-based context, 449 Location-based services (LBS) characteristics, 435 classification, 423–424, 424t overview, 432–434 positioning technologies, 436, 437t, 438t QoS requirements, 434t system architecture, 437–438 challenges, 439–441 components, 439 types, 433 Location-based social networks (LBSN), 441–443 Location-based systems satellite-based, 423 sources cellular, 424–425 classification, 424 mobility data, 429–432 multireference point, 426 overview, 423–424 tagging, 427–429 Location Interoperability Forums (LIF), 424 Log file analysis, 371 veracity of, 374–375 unique visitors, 374 visit duration, 375 visitor count, 374–375 Logging/audit trails, CSP, 330 Logical context, 449 Logical data independence, 36, 39–40 Logical design, database system, 80 Logic-based model, context, 451 Longitudinal/panel data analysis, 73 Long-running process and transaction capabilities, ESB, 172–173 Lower-level linkage mechanism, 32 Low-level models, 27 Low-level/procedural DMLs, 40–41 M Machine learning, 70 models, 297 system, 114–117, 115f categories, 115–116 cybersecurity systems, 116–117 goal, 115 Machine-to-Machine (M2M) system, 471 Mainframe computers, 3, Management and monitoring capabilities, ESB, 173 Mappings, 39 486 MapReduce, 15, 303, 347–348 Google, 229–333 Hadoop, 284–291, 285f changes, 287 enhancements and extensions, 286 iterative process, termination, 288 key problems, 286–287 processing, 284–286 programming model, 222 system, 245 Map reduce job, 284 Map task, 230 Market basket analysis, 104–105 Marketing mix models, 73–74 Markup scheme model, context, 451 Marz, Nathan, 348 Mashups, 361 Massively parallel processing (MPP), 5, 11–12, 216, 301 database systems, 244 Master node, 239 Master server, 263 Master–slave architecture, 294 model, 10–11 replication, 218 technique, 99 Master–worker model, 228 Mathematical programming, 63 Matplotlib, 314 Mauchly–Eckert–von Neumann concept, Medical diagnosis, risk factors, 105–106 Memtable, 258 Menu-based interfaces, 41 Message-driven architectures, 171 Message exchange patterns (MEPs), 156 Message oriented middleware (MOM), 137 Message-oriented model, 162 Message Passing Interface (MPI), 240 Metadata, 24 repository, 92 Metcalfe, Robert, 379 Metcalfe’s law, 379 Micro batches, 283 Microcomputers, Microprocessor pipeline, 132 Microsoft Windows operating system, 116 Middleware, 135–136, 380 message communication modes, 136–137 technology, 138, 167 Midrange computers, 4–5 Minicomputers, Mission-critical applications, 13 Index MLP (Mobile Location Protocol), 424 Mobile analytics classification, 420–421 clustering, 418–419 site, 418 streaming, 421 text, 419–420 context-aware, 414–416 context support for UI, 415–416 ontology-based context model, 415 overview, 414–415 devices, apps, 42 field cloud services, 412–414 generations of communication systems, 402–406 MWS, 408–412 mobile field cloud services, 412–414 OS, 406–408 Apple iOS, 408 BlackBerry OS, 407 comparison, 407t Google Android, 407–408 overview, 406 Symbian, 406–407 Windows Phone, 408 web, 363 Mobile Location Protocol (MLP), 424 Mobile Web 2.0, 416–418 Mobile web services (MWS), 408–412 federated context, 411 identity, 410–411 policy, 410 requirements, 410–411 Mobility data mining, 430–432 reconstruction, 430–431 trajectory mapping, 431 raw log entries, 430 reconstruction, 430–431 visualization, 431–432 Model deployment phase, data mining, 113–114 Model evaluation phase, data mining, 113 Modeling phase, data mining, 112–113 Model–View–Controller (MVC) architecture, 357–358, 358f MongoDB, 268–274 CRUD operations, 272 data model, 270–272 distributed systems characteristics, 272–272 features, 269–270 Mongod, database process, 269 487 Index Moore’s law, 48 barrier, 2–4, 2f, 3f Motion/location sensors, 462 MPP, see Massively parallel processing (MPP) Multiattribute decision making (MADM), 64 Multichannel access application, 160 Multicriteria decision making (MCDM), 64 Multidatabase system (MDBS), 139 Multidimensional cube, 84f, 88–90, 90f Multidimensional model, DW, 83–90 Multidimensional OLAP (MOLAP), 86 Multilevel decision-making (MLDM) problems, 60 Multimaster replication technique, 99 Multimedia databases, 44 Multimedia Markup Language (MML), 438 Multiple/heterogeneous databases, 45 Multiple instruction, multiple data stream (MIMD) architecture, 125 Multiple instruction, single data stream (MISD) architecture, 125 Multiple invocation styles, 158 Multiple QoS capabilities, endpoint discovery, 172 Multiprogramming system, 10 Multireference point systems, 426 Multitenancy, 192–193 Multiuser DBMS, 25 Multivariate analysis, 106 Mutation process, 121 MWS, see Mobile web services (MWS) N NameNode, 239, 294 Name server, 231 Namespace, 262 National Institute of Standards and Technology (NIST), 177 Native connectors, 300 Natural language interfaces, 42 Natural language processing (NLP), 71, 314, 399 Natural Language Toolkit (NLTK), 314 Nelson, Ted, 381 Neo4j, 254, 275–277 data model, 276–277 indexing and node identifier, 277 optional schema, 277 features, 275–276 Netflix, 297 Netscape, 360 Network(s), 378–380 analytics and mediation, 235 cellular ID, 447 complete/whole vs ego, 390 computer, 380–382 internet, 381 WWW, 381–382 one mode vs two mode, 390 principles, 379–380 Metcalfe’s law, 379 power law, 379 small worlds networks, 379–380 schema, 34 servers, transparency, 140 Network-based defense systems, 116 Network-based event, 116 Network/device-based context, 449 Network model/database model, 28, 30f, 32 NetworkX, 314 Neural networks (NN), 75 Neurons, 119 New information and communication technologies (NICTSs), 439 NodeManager, 280, 293 Node/vertex, graph database, 274 Nondatabase context, 23 non Neumann architectures, Nonredundant allocation, 146 Nonrelational database, 251 Nonuniform error recovery methods, 22 not only SQL (NoSQL), 26, 48, 217, 220, 251 databases, 245–249 column-oriented stores/databases, 246 comparison, 248–249 document-oriented databases, 247–248 graph stores/databases, 248 K-V Stores/databases, 246 subcategories, 245–246 systems, characteristics, 254–256 categories, 252–254 data models and query languages, 256 distributed systems and distributed databases, 254–256 NumPy, 313 Nutch, 233–234 O Object-oriented architectural style, 133 database, 24–25, 33 models, 32–34 model, 80 context, 451 system, 162 488 Object query language (OQL), 33 Object-relational databases, 24–25 Observational unit, 72 Observed time difference of arrival (OTDOA), 436 Odersky, Martin, 315 OLAP, see Online analytical processing (OLAP) Oliphant, Travis, 313 OLTP, see Online transaction processing (OLTP) One mode vs two mode networks, 390 One-to-many relationships, 34–35 Online analytical processing (OLAP), 79, 81, 346–347 multidimensional model, 84–86, 85f tiers, 92f, 93 vs OLTP, 82t Online transaction processing (OLTP), 25, 81, 82t, 98, 346 Ontology, 56, 364, 415 Ontology-based context model, 415 Open database connection (ODBC), 91 Open Geospatial Consortium (OGC), 424 Open linking and embedding for databases (OLEDB), 91 Open Mobile Alliance (OMA), 411–412, 424 Operating systems (OS), 406–408 Apple iOS, 408 BlackBerry OS, 407 comparison, 407t Google Android, 407–408 layers, 406 overview, 406 security, 344 Symbian, 406–407 Windows Phone, 408 Operational databases systems, 81–83, 83t Operational decisions, 54–55 Operational Intelligence Analysis, 54–55 Operations management, big data systems, 328–346 big data and cloud operations characteristics, 332 cloud governance, risk, and compliance, 341–346 core portfolio of functionalities, 328–332; see also Cloud service provider (CSP) core services, 332–334 data governance, 333–334 discovery and replication, 332–333 load balancing, 333 resource management, 333 Index governance services, 336–341 management services, 334–336 authorization and authentication, 335 deployment and configuration, 334 fault tolerance, 335–336 metering and billing, 335 monitoring and reporting, 334 SLA management, 334–335 Opinion mining, 72; see also Sentiment, analysis Opportunity and threat (O&T) assessments, 54 Oracle forms, 42 Orchestration, cloud service provider, 330 Organizational decision making, 60 OrientDB, 274–275 Original equipment manufacturers (OEMs), 164 Orkut, 389 OS, see Operating systems (OS) Outdoor localization technique, 436 Output Delivery System (ODS), 321 Output variables, data mining, 102 Overfitting, 76 P PaaS (Platform as a Service), 182–183, 184f Pandas package, 313 Parallel architectures, 215–216, 215f computing, 6–9, 125–128, 127f non Neumann architectures, von Neumann architectures, 8–9, 9f execution threads, 228 processing, 9–12 MPP, 11–12 multiprogramming, 10 SMP systems, 11 vector processing, 10–11 Parallelism, 125 Parallelizing query processing, Parquet format, 301 Partitioning techniques, 99–100 and types, 268 Partition tolerance, 141, 251 Patches, 116 Pattern/trend analyses, 54 Pay-as-you-consume model billing, 335 Pay-as-you-go subscription billing, 335 Per-application ApplicationMaster, 292 Perez, Fernando, 314 Personal computer (PC), Personal networks, 390 Per-split semijoin operation, 289 Pervasive computing system, 470 489 Index Physical context, 449 Physical data independence, 36, 40 models, 27 Physical database design, 80–81 Physical layer, database, 36 Physical parent–child relationship, 32 Physical security, 343 Physiological sensors, 462 Pig, 281, 298–299 complex types, 299 designing goals, 299 latin, 299 Pipe-and-filter style, 132 Pipelining, 126–127, 132 Pivot operation, 86 Platform as a Service (PaaS), 182–183, 184f Points of interest (POIs) discovery, 417–418 Poisson regression, 69 Policies, trust, 340 Policy management, cloud service provider, 329 Positioning determination technology (PDT), 423 component, 439 Power law, 379 Predatabase information processing, 22 Predictable SLAs, 159 Predictive analytics, 66–67 Predictors/predictive model, 102 Preference list, 265–266 Prescriptive analytics, 67 Primary key/partition key, table, 258 types, 266 Primitive linking schemes, 34 Privacy-preserving Data Mining (PPDM), 117 Privacy seal programs, 339 Private cloud computing, 185 Probes method, 194 Procedure-oriented language, 156 Processor-intensive program, 9–10 Program-data independence, 24 Programming flexibility, 299 paradigms, types, 309 Program-operation independence, 24–25 Program-to-program protocols, 163 Proxy server, 373 Public cloud computing, 185–186 Pub-sub (publish, subscribe) framework, 299 Pulling, context acquisition, 450 Pure functions, 227 Pushed, context acquisition, 450 P-value, 68–69 PyPy, 315 Python, 313–315 Beautiful Soup, 314 Gensim, 314 IPython, 314 matplotlib, 314 NetworkX, 314 NLTK, 314 NumPy, 313 pandas package, 313 PyPy, 315 Scikit-Learn, 313 SciPy, 313 stats models, 314 Q Quality-assurance process, 14 Query, 369–370 language, 41 optimization techniques, 81 router, sharding, 273 Query–update trade-off, 80 Quick ratio, 455 R Rack switches, 260 Radio-frequency identification (RFID), 429, 471 Randomized testing, 67 Range partitioning, 218 Ranking systems, 362 Rapid application integration, 160 Rating systems, 362 Ratios of financial decision pattern, 455–456 creditors to purchases, 456 current, 455 debtors to sales, 456 inventory turnover period, 456 quick, 455 Rattle GUI, 320 Reactive security solutions, 117 Read after write (RAW), 127 Read committed level, 296 Read repair, 258–259 Real simple syndication (RSS) technologies, 360 Real time analytics, 349 Real-time processing, 283 Record-at-a-time DMLs, 41 Record-based data models, 27 Redis, key/value data store, 253 Redundancy-based design technique, 14 490 Redundant arrays of independent disks (RAID), 14 Redundant hardware, 14 Redundant software components, 14 Řehůřek, Radim, 314 Reference architecture, 354–356 business object, 356 in J2EE, 356–357 access to EIS tier, 357 distributed Java components, 357 Entity Bean EJBs as business object components, 357 JSP and Java Servlets as user interaction components, 356 Session Bean EJBs as service-based components, 356–357 service-based, 355 user interaction, 355 Referring URLs, 371–372 Regression, 70 analysis, 75, 77, 107–108 and classification, 69–70 Regulatory-criteria-driven requirements, 331 Relational database management system (RDBMS) system, 244 Relational marketing, 105 Relational model, 28–29, 80 Relational OLAP (ROLAP), 86 Relational schema, 34, 87–88, 87f, 88f, 89f Relations, graph databases, 254 Relationship, graph database, 274 Reliability, 140 conundrum, 14–15 Reliable messaging, ESB, 172 Remote Method Invocations (RMIs), 165 Remote procedure call (RPC), 163 Replica set concept, 272 Replication, 255 factor, 295 schema, 146 transparency, 140 Repository architectural style, 131 Representational/implementation data models, 27 Representational model of document, 393–394 REpresentational state transfer (REST), 166–167 Reputation system, 362 trust, 340 Research In Motion® (RIM), 407 Resilient Distributed Datasets (RDDs), 303, 306 Resource Description Framework (RDF), 364 Resource management, cloud, 33 Index Response time, 80 Restart paradigm, see Checkpoint REST-oriented architecture (ROA), 167 Riak, data model, 264–265 Richardson, Leonard, 314 Rich Internet applications (RIA), 364 RightScale Cloud Management Platform, 334 Roll-down operation, 86 Roll-up operation, 86 Rotation operation, 86 Rough sets, soft computing, 119, 121–122, 122f Round-robin partitioning, 218 Round-robin scheduling (RR scheduling), 10 Row partitioning, 218–219 Row vs column-oriented data layouts R, programming language, 319–321 analytical features, 319–321 business analytics, 320–321 business dashboard and reporting, 320 data mining, 320 general, 319 Rule-based architecture style, 134 Rule Interchange Format (RIF), 364 Rules management, cloud service provider, 330 S SaaS (Software as a Service), 183–184 Safety-critical applications, 13 Sales forecasts, 73 Sarbanes–Oxley Act (SOX), 457 SAS, see Statistical Analysis System (SAS) Scala, see Scalable Language (Scala) Scalability, 141, 181, 190–192, 213 capabilities of ESB, 173–174 DW, 98–100 Scalable Language (Scala), 315–318 advantages, 316–317 functional programs, 317 immutability, 310, 316–317 interoperability with Java, 316 null pointer uncertainty, 317 parallelism, 316 static typing and type inference, 316 benefits, 318 better fit, 318 increased productivity, 318 natural evolution from java, 318 characteristics, 315–316 compiler, 318 Scalable Vector Graphics (SVG), 166 Scale of data, 210t Scale-out, Index Scaling-out of distributed system, 273 Scheduler, 293 Schema evolution, 39 Scikit-learn, 313 SciPy, 313 Search, create, read, update, and delete (SCRUD) operations, 256 Search Log Analysis (SLA), 368 data analysis, 369–371 query, 369–370 session, 369 term, 370–371 process, 368–371 data analysis, 369–371 data collection, 368 data preparation, 368–369 Search logs, 367 Sector/competitor assessments, 54 Secure Assure, 339 Security capabilities of ESB, 173 deperimeterization, 342–343 Segmentation process, 76, 103 Self-service, cloud service providers, 329 Semantic analysis problem, 399 Semantic engine, cloud service provider, 330 Semantic web, 363–364 Semijoin operation, 148, 288 Semistructured data processing, 347 Semistructured decision problems, 60 Sentence level sentiment analysis, 399 Sentiment analysis, 72, 397–400 applications, 400 classes types, 400 initial work in, 400 levels, 399 and NLP, 398–400 opinions types, 399 lexicon, 400 Service contracts, 157 well-defined, 158 discovery, 332 enablement capabilities, ESB, 172 hijacking, 338 implementation, 155 and service contracts, granularity, 158–159 Service-based architecture, 355 Service-level agreements (SLA), 335, 340–341 management, 334–335 cloud service provider, 329 Service-Oriented Applications, 359 491 Service-oriented architecture (SOA), 153–156 applications, 159–161 BPM, 160–161 multichannel access, 160 rapid application integration, 160 benefits, 156–157 characteristics, 157–159 design services with performance, 159 dynamic, discoverable, metadata driven, 157 loosely coupled, 158 multiple invocation styles, 158 predictable SLAs, 159 services and service contracts, granularity, 158–159 standard based, 158 stateless, 159 well-defined service contracts, 158 defining, 155–156 ingredients, 161–167 objects, 161 resources, 162–163 services, 161–162 layers, 202–203 business processes, 202 business services, 202–203 domains, 202 infrastructure services, 203 operational systems, 203 service realizations, 203 and RESTful services, 166–167 vendor implementations, 165 and web services, 163–165, 164f Service-oriented cloud computing, 200–203 Servlet API, 356 Session Beans EJB, 357 Set-at-a-time/set-oriented DMLs, 40 Set-containment approach, 35 Sharding, 140 of files, 255, 273 Shard key, 273 Shared-everything architecture, 98–99 Shared memory, 215–216 system, 127, 127f Shared nothing, 215–216 architecture, 301 distributed processing paradigm, 236 Shopping pattern analysis, 235 Shuffle and sort phase, 230 Simple object access protocol (SOAP), 163, 165 Single instruction, multiple data mode (SIMD) mode, 228 492 Single instruction, multiple data stream (SIMD) architecture, 125 Single instruction, single data stream (SISD) architecture, 125 Single Point of Failures (SPOF), 242 Single program, multiple data (SPMD) model, 125, 228 SLA, see Search Log Analysis (SLA); Servicelevel agreements (SLA) Slave nodes, 239 Slave servers, 280 Slice and dice operation, 86 Small and Medium Enterprises (SMEs), 188–190, 189 vs Large Enterprises, 189t–190t Small worlds networks, 379–380 SMEs, see Small and Medium Enterprises (SMEs) SN, see Social Networks (SN) SNA (Social Networks Analysis), 389–391, 397 Snowflake schemas, 87, 88f SOA, see Service-oriented architecture (SOA) Social computing, 377 Social media, 398 marketing analysis, 235 Social Networks (SN), 377, 382–389 analysis, metrics of, 384–385 bridge, 384 Classmates, 389 defined, 384 Facebook, 386–387 Flickr, 389 Friendster, 389 Google+, 388 hi5, 389 Linkedin, 386 Orkut, 389 Twitter, 387–388 YouTube, 389 Social Networks Analysis (SNA), 389–391, 397 Social tagging, 361 Soft computing, 118–122 ANNs, 119 EA, 120–121 fuzzy systems, 120, 120f rough sets, 121–122, 122f vs traditional hard computing, 118t Software architectural styles, distributed system, 130–135 call and return, 132–133 data-centered, 131 data-flow, 131–132 independent components, 134–135 Index virtual, 133–134 framework, 279 Software as a Service (SaaS), 183–184 Software product line (SPL), 464 Sorted String Tables (SSTables), 258, 260 Space redundancy, 13 Space–time trade-off factor, 80 Spark, 303–308 benefits, 307–308 components, 305 concepts, 306–307 action, 307 resilient distributed datasets, 303, 306 shared variables, 306 SparkContext, 306 transformations, 306–307 streaming, 308 Spatial databases, 45 Spatiotemporal database, 45 Spurious relationships, 72 SQL, see Structured query language (SQL) Sqoop, 282, 300 Standard based protocol, 158 Standard repartition join operation, 288 Star schemas, 87, 87f Statistical Analysis System (SAS), 321–323 DATA step, 321 other software products, 322–323 procedures, 322 Statistical inference technique, 68–69 Stats models, 314 Stimulus-response analysis, 54 Stock market prediction, 400 Storage and processing strategies, 244–245 big data processing methods, characteristics, 244–245 big data storage methods, characteristics, 244 Storage definition language (SDL), 40 Strategic decisions, 61 Strategic intelligence, 53 Strategic Intelligence Analysis (SIA), 54 Stream databases, 45–46 Streaming, 283 Stream processing (SP) technology, 349 Structured decision problem, 60 Structured query language (SQL), 28–29, 35, 68, 80 dialect, 301 Subject-oriented DW, 82 Supercomputers, Supervised analysis, 106–108 classification, 107 exploratory, 106 493 Index regression, 107–108 time series, 108 Supervised learning method, 116 Supervisory Control and Data Acquisition (SCADA), 471 Supporting iterative processing, 286–288 Support system (DSS), 63 Support vector machine (SVM), 115 Symbian, 406–407 Symbol-based machine learning, 115 Symmetric multiprocessing/multiprocessor (SMP), 127, 216 systems, 11 Synapses, 119 System architectural, distributed system, 129–130 N-tier, 129–130 peer-to-peer, 130 System architecture, big data, 216–217 BASE, 217–218 functional decomposition, 217–218 master–slave replication, 218 T Tablet, bigtable, 232 Tacit knowledge, 57 Tactical decisions, 54, 61 Tactical Intelligence Analysis (TIA), 54 Tagging system, 427–429 bar codes, 428 bluetooth, 428 RFID, 429 Tag soups, 314 Task-level parallelism, 126 TaskTracker, 286 Temporal databases, 44–45 Term co-occurrence, 370 Terms-by-documents matrix, 71 Test statistic, 68 Text analysis, 391–397 document, 392–395 domain knowledge, 395–396 functions, 396–397 patterns and trends search, 396 analytics, 71–72 databases, 44 mining, 105 Third normal form (3NF), 93–94 Thread, 48 Three-level ANSI architecture, 38 Three-phase commit (3PC) protocol, 150–151 Three-schema architecture, 37f, 38–40 Time-based context, 449 Time Division Multiple Access (TDMA), 403–404 Timeline, 387 Time redundancy, 13 Time series and market research models, 72–74 Time-series regression, 72 Time slice, 10 Tools and techniques, big data applications, developing, 222–223 in-memory computing, 221 NoSQL data management, 220–221 processing approach, 215–216 row partitioning or sharding, 218–219 row vs column-oriented data layouts, 219–220 system architecture, 216–217 Top-down architectural style, 132–133 Topic-based routing, 172 Top layer, database, 36 Traditional file processing, 24 Traffic pattern recognition, 235 Transaction logs, 367 management in distributed databases, 149 2PC protocol, 149–150 3PC protocol, 150–151 throughput, 80 Transformation capabilities, ESB, 172 Transmission control protocol (TCP), 15 network packets, 116 Trojan index, 289–290 Trojan join, 290 Trust boundary, 342 Trusted applications, 344 Twitter, 387–388 Two mode network, 390 Two-phase commit (2PC) protocol, 144, 149–150 U Ubiquitous computing system, 470 Undirected data mining odel, 103 Uniform resource identifier (URI), 162, 166 Unique Identification Authority of India (UIDAI), 223 Unique query, 370 Univariate analysis, 106 Universal description, discovery, and integration (UDDI), 163, 165 Universal Mobile Telecommunication System (UMTS), 406 494 Universal queries, 35 Universal resource locator (URL), 381 Unix shell pipes, 132 Unknown risk profile, 338 Unstructured data processing, 347 Unstructured decision problem, 60 Un-supervised analysis, 108–109 association rules, 108 clustering, 108–109 description and visualization, 109 Unsupervised data mining model, 103 Unsupervised learning method, 116 User-defined function (UDFs), 289 User interaction architecture, 355 User state-based context, 449 V Value constellation analysis, 54 Vertical partitioning method, 100 Vertical scalability, 141 View definition language (VDL), 41 Vilfredo Pareto’s principle, 379 Virtual architectures, distributed system, 133–134 Virtual chains, 31 Virtualization, 341, 345 cloud computing leverages, 342 system, 153 technology, 196–200 advantages, 198–199 challenges, 199–200 characteristics, 197–198 components, 197 types, 196 Virtual machine (VM), 133, 196 architectural styles, 134 security, 345 Virtual machine monitor (VMM), 196, 345 Visualization technique, 75 VM, see Virtual machine (VM) VMM (virtual machine monitor), 196, 345 von Neumann architectures, 8–9, 9f W Warehouse, data, 64, 81–90, 82t, 83t, 325 architecture, 91–93 challenges, 96–100 tiers, 91–93 basics, 79–100 Index database concepts, 79–81 physical database design, 80–81 features, 82 multidimensional model, 83–90 data cube, 84, 84f multidimensional cube, 84f, 88–90, 90f OLAP, 84–86, 85f relational schemas, 87–88 Weakly structured documents, 393 Web analysis, 371–376 internet technologies, 373 tools, 375–376 veracity of log files data, 374–375 applications, 364–367 characteristics, 364–365 dimensions, 365–367 community analysis, 46 databases, 46 document, 369 evolution, 359–364 mobile, 363 RIA, 364 semantic, 363–364 Web 1.0, 359 Web 2.0, see Web 2.0 Web 3.0, 362–363 mining, 105 services, SOA, 163–165, 164f roles, 155 usage mining, 46 Web 1.0, 359 vs Web 2.0, 362 Web 2.0, 359–362, 376 mashups, 361 RSS technologies, 360 social tagging, 361 user contributed content, 361–362 vs Web 1.0, 362 weblogs or blogs, 359–360 Wikis, 360 Web 3.0, 362–363 Weblogs, 359–360 Web ontology language (OWL), 364 Web services description language (WSDL), 163, 165 Whirr, 302 wideband-CDMA (wCDMA), 406 Wide-column systems, 256 Wiki, 360 Williams, Graham, 320 Windows Phone, 408 495 Index Wireless application protocol (WAP), 402 Wireless mobile application, 401–402 classifications, 402 input, 401 memory, 401 processing power, 401 screen, 402 Wireless networks evolution, 404t Wireless Sensor Networks (WSN), 471 Wisdom, IMM, 58–59 Word2vec, 315 Word-level representation of document, 394 Worker nodes server, 280 Workload-driven architecture, 96 World Wide Web (WWW), 46, 381–382 WriteConcern parameter, 269 Write-once-read-many access model, 293 WWW, see World Wide Web (WWW) X XML, see Extensible Markup Language (XML) XML-based language, 165 Y Yalamanchi, Ramu, 389 Yet Another Resource Negotiator (YARN), 242, 291–293, 292f responsibilities, 291 YouTube, 389 Z ZooKeeper, 297 distributed applications, 281, 297 Zuckerberg, Mark, 386 ... Characteristics of Big Data Computing Systems 213 9.1.3 Big Data Appliances 214 9.2 Tools and Techniques of Big Data 215 9.2.1 Processing Approach 215 9.2.2 Big Data System... 207 9.1 Big Data 207 9.1.1 What Is Big Data? .208 9.1.1.1 Data Volume 208 9.1.1.2 Data Velocity 210 9.1.1.3 Data Variety 211 9.1.1.4 Data Veracity... 14 Big Data DevOps Management .325 14.1 Big Data Systems Development Management 326 14.1.1 Big Data Systems Architecture 326 xvi Contents 14.1.2 14.2 14.3 14.4 Big Data