C S R Prabhu · Aneesh Sreevallabh Chivukula · Aditya Mogadala · Rohit Ghosh · L M Jenila Livingston Big Data Analytics: Systems, Algorithms, Applications Big Data Analytics: Systems, Algorithms, Applications C S R Prabhu Aneesh Sreevallabh Chivukula Aditya Mogadala Rohit Ghosh L M Jenila Livingston • • • • Big Data Analytics: Systems, Algorithms, Applications 123 C S R Prabhu National Informatics Centre New Delhi, Delhi, India Aditya Mogadala Saarland University Saarbrücken, Saarland, Germany Aneesh Sreevallabh Chivukula Advanced Analytics Institute University of Technology, Sydney Ultimo, NSW, Australia Rohit Ghosh Qure.ai Goregaon East, Mumbai, Maharashtra, India L M Jenila Livingston School of Computing Science and Engineering Vellore Institute of Technology Chennai, Tamil Nadu, India ISBN 978-981-15-0093-0 ISBN 978-981-15-0094-7 https://doi.org/10.1007/978-981-15-0094-7 (eBook) © Springer Nature Singapore Pte Ltd 2019 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Foreword Big Data phenomenon has emerged globally as the next wave of technology, which will influence in a big way and contribute to better quality of life in all its aspects The advent of Internet of things (IoT) and its associated Fog Computing paradigm is only accentuating and amplifying the Big Data phenomenon This book by C S R Prabhu and his co-authors is coming up at the right time This book fills in the timely need for a comprehensive text covering all dimensions of Big Data Analytics: systems, algorithms, applications and case studies along with emerging research horizons In each of these dimensions, this book presents a comprehensive picture to the reader in a lucid and appealing manner This book can be used effectively for the benefit of students of undergraduate and post-graduate levels in IT, computer science and management disciplines, as well as research scholars in these areas It also helps IT professionals and practitioners who need to learn and understand the subject of Big Data Analytics I wish this book all the best in its success with the global student community as well as the professionals Dr Rajkumar Buyya Redmond Barry Distinguished Professor, Director, Cloud Computing and Distributed Systems (CLOUDS) Lab, School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia v Preface The present-day Information Age has produced an overwhelming deluge of digital data arriving from unstructured sources such as online transactions, mobile phones, social networks and emails popularly known as Big Data In addition, with the advent of Internet of things (IoT) devices and sensors, the sizes of data that will flow into the Big Data scenario have multiplied many folds This Internet-scale computing has also necessitated the ability to analyze and make sense of the data deluge that comes with it to help intelligent decision making and real-time actions to be taken based on real-time analytics techniques The Big Data phenomenon has been impacting all sectors of business and industry, resulting in an upcoming new information ecosystem The term ‘Big Data’ refers to not only the massive volumes and variety of data itself, but also the set of technologies surrounding it, to perform the capture, storage, retrieval, management, processing and analysis of the data for the purposes of solving complex problems in life and in society as well, by unlocking the value from that data more economically In this book, we provide a comprehensive survey of the big data origin, nature, scope, structure, composition and its ecosystem with references to technologies such as Hadoop, Spark, R and its applications Other essential big data concepts including NoSQL databases for storage, machine learning paradigms for computing, analytics models connecting the algorithms are all aptly covered This book also surveys emerging research trends in large-scale pattern recognition, programming processes for data mining and ubiquitous computing and application domains for commercial products and services Further, this book expands into the detailed and precise description of applications of Big Data Analytics into the technological domains of Internet of things (IoT), Fog Computing and Social Semantic Web mining and then into the business domains of banking and finance, insurance and capital market before delving into the issues of security and privacy associated with Big Data Analytics At the end of each chapter, pedagogical questions on the comprehension of the chapter contents are added This book also describes the data engineering and data mining life cycles involved in the context of machine learning paradigms for unstructured and structured data The relevant developments in big data stacks are discussed with a vii viii Preface focus on open-source technologies We also discuss the algorithms and models used in data mining tasks such as search, filtering, association, clustering, classification, regression, forecasting, optimization, validation and visualization These techniques are applicable to various categories of content generated in data streams, sequences, graphs and multimedia in transactional, in-memory and analytic databases Big Data Analytics techniques comprising descriptive and predictive analytics with an emphasis on feature engineering and model fitting are covered For feature engineering steps, we cover feature construction, selection and extraction along with preprocessing and post-processing techniques For model fitting, we discuss the model evaluation techniques such as statistical significance tests, cross-validation curves, learning curves, sufficient statistics and sensitivity analyses Finally, we present the latest developments and innovations in generative learning and discriminative learning for large-scale pattern recognition These techniques comprise incremental, online learning for linear/nonlinear and convex/multi-objective optimization models, feature learning or deep learning, evolutionary learning for scalability and optimization meta-heuristics Machine learning algorithms for big data cover broad areas of learning such a supervised, unsupervised and semi-supervised and reinforcement techniques In particular, supervised learning subsection details several classification and regression techniques to classify and forecast, while unsupervised learning techniques cover clustering approaches that are based on linear algebra fundamentals Similarly, semi-supervised methods presented in the chapter cover approaches that help to scale to big data by learning from largely un-annotated information We also present reinforcement learning approaches which are aimed to perform collective learning and support distributed scenarios The additional unique features of this book are about 15 real-life experiences as case studies which have been provided in the above-mentioned application domains The case studies provide, in brief, the experiences of the different contexts of deployment and application of the techniques of Big Data Analytics in the diverse contexts of private and public sector enterprises These case studies span product companies such as Google, Facebook, Microsoft, consultancy companies such as Kaggle and also application domains at power utility companies such as Opower, banking and finance companies such as Deutsche Bank They help the readers to understand the successful deployment of analytical techniques that maximize a company's functional effectiveness, diversity in business and customer relationship management, in addition to improving the financial benefits All these companies handle real-life Big Data ecosystems in their respective businesses to achieve tangible results and benefits For example, Google not only harnesses, for profit, the big data ecosystem arising out of its huge number of users with billions of web searches and emails by offering customized advertisement services, but also is offering to other companies to store and analyze the big datasets in cloud platforms Google has also developed an IoT sensor-based autonomous Google car with real-time analytics for driverless navigation Facebook, the largest social network in the world, deployed big data techniques for personalized search and advertisement So LinkedIn also deploys big data techniques for effective service delivery Preface ix Microsoft also aspires to enter the big data business scenario by offering services of Big Data Analytics to business enterprises on its Azure cloud services Nokia deploys its Big Data Analytics services on the huge buyer and subscriber base of its mobile phones, including the mobility of its buyers and subscribers Opower, a power utility company, has deployed Big Data Analytics techniques on its customer data to achieve substantial benefits on power savings Deutsche Bank has deployed big data techniques for achieving substantial savings and better customer relationship management (CRM) Delta Airlines improved its revenues and customer relationship management (CRM) by deploying Big Data Analytics techniques A Chinese city traffic management was achieved successfully by adopting big data methods Thus, this book provides a complete survey of techniques and technologies in Big Data Analytics This book will act as basic textbook introducing niche technologies to undergraduate and postgraduate computer science students It can also act as a reference book for professionals interested to pursue leadership-level career opportunities in data and decision sciences by focusing on the concepts for problem solving and solutions for competitive intelligence To the best of our knowledge, big data applications are discussed in a plethora of books But, there is no textbook covering a similar mix of technical topics For further clarification, we provide references to white papers and research papers on specific topics New Delhi, India Ultimo, Australia Saarbrücken, Germany Mumbai, India Chennai, India C S R Prabhu Aneesh Sreevallabh Chivukula Aditya Mogadala Rohit Ghosh L M Jenila Livingston Acknowledgements The authors humbly acknowledge the contributions of the following individuals toward the successful completion of this book Mr P V N Balaram Murthy, Ms J Jyothi, Mr B Rajgopal, Dr G Rekha, Dr V G Prasuna, Dr P S Geetha, Dr J V Srinivasa Murthy, all from KMIT, Hyderabad, Dr Charles Savage of Munich, Germany, Ms Rachna Sehgal of New Delhi, Dr P Radhakrishna of NIT, Warangal, Mr Madhu Reddy, Hyderabad, Mr Rajesh Thomas, New Delhi, Mr S Balakrishna, Pondicherry, for their support and assistance in various stages and phases involved in the development of the manuscript of this book The authors thank the managements of the following institutions for supporting the authors: KMIT, Hyderabad KL University, Guntur VIT, Chennai Advance Analytics Institute, University of Technology, Sydney, (475), Sydney, Australia xi About This Book Big Data Analytics is an Internet-scale commercial high-performance parallel computing paradigm for data analytics This book is a comprehensive textbook on all the multifarious dimensions and perspectives of Big Data Analytics: the platforms, systems, algorithms and applications, including case studies This book presents data-derived technologies, systems and algorithmics in the areas of machine learning, as applied to Big Data Analytics As case studies, this book covers briefly the analytical techniques useful for processing data-driven workflows in various industries such as health care, travel and transportation, manufacturing, energy, utilities, telecom, banking and insurance, in addition to the IT sector itself The Big Data-driven computational systems described in this book have carved out, as discussed in various chapters, the applications of Big Data Analytics in various industry application areas such as IoT, social networks, banking and financial services, insurance, capital markets, bioinformatics, advertising and recommender systems Future research directions are also indicated This book will be useful to both undergraduate and graduate courses in computer science in the area of Big Data Analytics xiii 398 Appendices Lower critical values of chi-square distribution with νdegrees of freedom Probability of exceeding the critical value ν 0.90 0.95 0.975 0.99 0.999 0.016 0.004 0.001 0.000 0.000 0.211 0.103 0.051 0.020 0.002 0.584 0.352 0.216 0.115 0.024 1.064 0.711 0.484 0.297 0.091 1.610 1.145 0.831 0.554 0.210 2.204 1.635 1.237 0.872 0.381 2.833 2.167 1.690 1.239 0.598 3.490 2.733 2.180 1.646 0.857 4.168 3.325 2.700 2.088 1.152 10 4.865 3.940 3.247 2.558 1.479 11 5.578 4.575 3.816 3.053 1.834 In the table, we can see values of χα2 for various values of υ (degrees of freedom) where χα2 is such that the area under the chi-square distribution to its right is equal to α The chi-square distribution is not symmetrical A random variable having the F-distribution Theorem 6.5 If S12 and S22 are the variances of independent random samples of size n1 and n2 , respectively, taken from two normal populations having the same variance, then s2 F = s12 is a random variable having the F-distribution with the parameters υ1 = n −1 and υ2 = n − F-distribution determines whether the ratio of two sample variance S and S too small or too large The F-distribution is related to the beta distribution, and its two parameters υ1 and υ2 are called the numerator and denominator degrees of freedom F 0.05 and F 0.01 for the various combinations of values of υ1 and υ2 are given in the F-distribution table Appendices 399 F values for ® = 0:05 d1 d2 161.4 199.5 215.7 224.6 230.2 234.0 18.51 19.00 19.16 19.25 10.13 9.55 9.28 9.12 7.71 6.94 6.59 6.61 5.79 5.41 5.99 5.14 5.59 4.74 5.32 4.46 19.3 236.8 238.9 240.5 19.33 19.35 19.37 19.38 9.01 8.94 8.89 8.85 8.81 6.39 6.26 6.16 6.09 6.04 6.00 5.19 5.05 4.95 4.88 4.82 4.77 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.35 4.12 3.97 3.87 3.79 3.73 3.68 4.07 3.84 3.69 3.58 3.50 3.44 3.39 Fα (υ1 , υ2 ) is the value of F with υ1 and υ2 DOF such that the area under the F-distribution curve to the right of Fα is α F1−α (υ1 , υ2 ) = (1) (2) (3) (4) Fα (υ2 , υ1 ) F-distribution is always positive The F-distribution curve lies entirely in first quadrant, and it is unimodel Testing for the equality of variances of two normal population F-test is used to determine whether two independent estimates of the population variance differ significantly or whether the two samples may be regarded as drawn from the normal population having the same variance (5) (σ A )2 = (σ B )2 = σ (S A )2 (6) F = (S )2 B 400 Appendices S 2A = S B2 = ¯ (xi − x) n1 − (yi − y¯ )2 n2 − The degrees of freedom are υ1 = n − 1, υ2 = n − The numerator variance must be always greater than the denominator variance That is S 2A > S B2 Hypothesis concerning the variance of a normal population Suppose we want to test a random sample X i (i = 1, 2, 3, …) has been drawn from a normal population with a specified variance σ forms a chi-square distribution with (n − 1) degree of Test statistic χ = ns σ2 freedom R Language There are many in-built functions for statistical analysis in R Most of them are part of R package The in-built functions take R vector and other arguments as an input for giving the result The in-built functions that we will discuss now are mean, median and mode Mean It is calculated by taking the summation of all the values and dividing with the number of total number of values in a data series Syntax—mean (A, trim = 0, na.rm = FALSE, …) Following is the description of the parameters used • • • • A is the input vector trim is used to drop some observations from both end of the sorted vector na.rm is used to remove the missing values from the input vector Example y x result.mean print(result.mean) [1] NA Example > x result.mean print(result.mean) [1] 3.5 Median The ‘median’ is the ‘middle’ value in the set of numbers To find the median, your numbers have to be sorted first and then find the middle number Syntax—median (A, na.rm = FALSE) With an even amount of numbers, we find the middle number in different way In that case, we find the middle pair of numbers, by adding them together and dividing by two Following is the description of the parameters used • A is the input vector • na.rm is used to remove the missing values from the input vector 402 Appendices Example > x median(x) [1] Example > x median(x) [1] 3.5 Mode The mode is the value that has maximum number of occurrences in a set of data Mode can have both character data and numeric values R does not have a standard in-built function to calculate mode So we create a user function to calculate mode of a dataset in R Example getmode