Big data technologies and applications

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	405
Dung lượng	8,41 MB

Nội dung

Borko Furht · Flavio Villanustre Big Data Technologies and Applications Big Data Technologies and Applications Borko Furht Flavio Villanustre • Big Data Technologies and Applications 123 Borko Furht Department of Computer and Electrical Engineering and Computer Science Florida Atlantic University Boca Raton, FL USA ISBN 978-3-319-44548-9 DOI 10.1007/978-3-319-44550-2 Flavio Villanustre LexisNexis Risk Solutions Alpharetta, GA USA ISBN 978-3-319-44550-2 (eBook) Library of Congress Control Number: 2016948809 © Springer International Publishing Switzerland 2016 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface The scope of this book includes leading edge in big data systems, architectures, and applications Big data computing refers to capturing, managing, analyzing, and understanding the data at volumes and rates that push the frontiers of current technologies The challenge of big data computing is to provide the hardware architectures and related software systems and techniques which are capable of transforming ultra large data into valuable knowledge Big data and data-intensive computing demand a fundamentally different set of principles than mainstream computing Big data applications typically are well suited for large-scale parallelism over the data and also require extremely high degree of fault tolerance, reliability, and availability In addition, most big data applications require relatively fast response The objective of this book is to introduce the basic concepts of big data computing and then to describe the total solution to big data problems developed by LexisNexis Risk Solutions This book comprises of three parts, which consists of 15 chapters Part I on Big Data Technologies includes the chapters dealing with introduction to big data concepts and techniques, big data analytics and relating platforms, and visualization techniques and deep learning techniques for big data Part II on LexisNexis Risk Solution to Big Data focuses on specific technologies and techniques developed at LexisNexis to solve critical problems that use big data analytics It covers the open source high performance computing cluster (HPCC Systems®) platform and its architecture, as well as, parallel data languages ECL and KEL, developed to effectively solve big data problems Part III on Big Data Applications describes various data-intensive applications solved on HPCC Systems It includes applications such as cyber security, social network analytics, including insurance fraud, fraud in prescription drugs, and fraud in Medicaid, and others Other HPCC Systems applications described include Ebola spread modeling using big data analytics and unsupervised learning and image classification With the dramatic growth of data-intensive computing and systems and big data analytics, this book can be the definitive resource for persons working in this field as researchers, scientists, programmers, engineers, and users This book is intended for a wide variety of people including academicians, designers, developers, v vi Preface educators, engineers, practitioners, and researchers and graduate students This book can also be beneficial for business managers, entrepreneurs, and investors The main features of this book can be summarized as follows: This book describes and evaluates the current state of the art in the field of big data and data-intensive computing This book focuses on LexisNexis’ platform and its solutions to big data This book describes the real-life solutions to big data analytics Boca Raton, FL, USA Alpharetta, GA, USA 2016 Borko Furht Flavio Villanustre Acknowledgments We would like to thank a number of contributors to this book The LexisNexis contributors include David Bayliss, Gavin Halliday, Anthony M Middleton, Edin Muharemagic, Jesse Shaw, Bob Foreman, Arjuna Chala, and Flavio Villanustre The Florida Atlantic University contributors include Ankur Agarwal, Taghi Khoshgoftaar, DingDing Wang, Maryam M Najafabadi, Abhishek Jain, Karl Weiss, Naeem Seliva, Randal Wald, and Borko Furht The other contributors include I Itauma, M.S Aslan, and X.W Chen from Wayne State University; Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao, and Athanasios V Vasilakos from Lulea University of Technology in Sweden; and Akaterina Olshannikova, Aleksandr Ometov, Yevgeni Koucheryavy, and Thomas Olsson from Tampere University of Technology in Finland Without their expertise and effort, this book would never come to fruition Springer editors and staffs also deserve our sincere recognition for their support throughout the project vii Contents Part I Big Data Technologies Introduction to Big Data Borko Furht and Flavio Villanustre Concept of Big Data Big Data Workflow Big Data Technologies Big Data Layered Architecture Big Data Software Splunk LexisNexis’ High-Performance Computer Cluster (HPCC) Big Data Analytics Techniques Clustering Algorithms for Big Data Big Data Growth Big Data Industries Challenges and Opportunities with Big Data References Big Data Analytics Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao and Athanasios V Vasilakos Introduction Data Analytics Data Input Data Analysis Output the Result Summary Big Data Analytics Big Data Input Big Data Analysis Frameworks and Platforms Researches in Frameworks and Platforms Comparison Between the Frameworks/Platforms of Big Data 5 6 9 10 11 13 14 16 17 17 19 22 24 25 26 27 30 ix x Contents Big Data Analysis Algorithms Mining Algorithms for Specific Problem Machine Learning for Big Data Mining Output the Result of Big Data Analysis Summary of Process of Big Data Analytics The Open Issues Platform and Framework Perspective Input and Output Ratio of Platform Communication Between Systems Bottlenecks on Data Analytics System Security Issues Data Mining Perspective Data Mining Algorithm for Map-Reduce Solution Noise, Outliers, Incomplete and Inconsistent Data Bottlenecks on Data Mining Algorithm Privacy Issues Conclusions References Transfer Learning Techniques Karl Weiss, Taghi M Khoshgoftaar and DingDing Wang Introduction Definitions of Transfer Learning Homogeneous Transfer Learning Instance-Based Transfer Learning Asymmetric Feature-Based Transfer Learning Symmetric Feature-Based Transfer Learning Parameter-Based Transfer Learning Relational-Based Transfer Learning Hybrid-Based (Instance and Parameter) Transfer Learning Discussion of Homogeneous Transfer Learning Heterogeneous Transfer Learning Symmetric Feature-Based Transfer Learning Asymmetric Feature-Based Transfer Learning Improvements to Heterogeneous Solutions Experiment Results Discussion of Heterogeneous Solutions Negative Transfer Transfer Learning Applications Conclusion and Discussion Appendix References 31 31 33 36 37 40 40 40 40 41 41 42 42 42 43 43 44 45 53 53 55 59 60 61 64 68 70 71 72 73 74 79 82 83 83 85 88 90 92 93 Contents Visualizing Big Data Ekaterina Olshannikova, Aleksandr Ometov, Yevgeni Koucheryavy and Thomas Olsson Introduction Big Data: An Overview Big Data Processing Methods Big Data Challenges Visualization Methods Integration with Augmented and Virtual Reality Future Research Agenda and Data Visualization Challenges Conclusion References 101 101 103 104 107 109 119 121 123 124 133 133 136 138 140 141 144 147 147 148 149 150 152 153 159 159 160 161 161 162 162 163 164 164 167 169 Deep Learning Techniques in Big Data Analytics Maryam M Najafabadi, Flavio Villanustre, Taghi M Khoshgoftaar, Naeem Seliya, Randall Wald and Edin Muharemagc Introduction Deep Learning in Data Mining and Machine Learning Big Data Analytics Applications of Deep Learning in Big Data Analytics Semantic Indexing Discriminative Tasks and Semantic Tagging Deep Learning Challenges in Big Data Analytics Incremental Learning for Non-stationary Data High-Dimensional Data Large-Scale Models Future Work on Deep Learning in Big Data Analytics Conclusion References Part II xi LexisNexis Risk Solution to Big Data The HPCC/ECL Platform for Big Data Anthony M Middleton, David Alan Bayliss, Gavin Halliday, Arjuna Chala and Borko Furht Introduction Data-Intensive Computing Applications Data-Parallelism The “Big Data” Problem Data-Intensive Computing Platforms Cluster Configurations Common Platform Characteristics HPCC Platform HPCC System Architecture HPCC Thor System Cluster HPCC Roxie System Cluster Innovative Mobile Application for Ebola Spread 381 The interested location in our case the Ebola affected sites are specified using the longitude and latitude and the proximity of these locations are adjusted by adding a radius of 500 m as shown in Fig 14.17 Fig 14.17 Some features of innovative mobile application 382 14 Modeling Ebola Spread and Using HPCC/KEL System Fig 14.18 Sample query for returning data of users having last name as VED Firstly, to use geofence feature our mobile application requested to access fine location by adding this permission in the manifest file of the application To use the intent service for listening the geofence transitions, the IntentService element was added The geofences are created using the API’s class builder which also helps in the setting the desired radius, duration and transition types for the geofence We have set two triggers one for entering the geofence and the other for exiting These triggers tell the location services that the respective trigger should be fired if the device is within the geofence Stopping geofence monitoring when it is no longer needed or desired can help save battery power and CPU cycles on the device The Geofencing can be stopped by removing the geofence from location service Figure 14.18 below shows the geofence created around a particular location with a radius of 500 m As you can see from the above image the user is currently present in the geofence area and hence will receive a notification alerting him/her that they have entered an Ebola infected area and another notification will be pushed as soon as the user exits the geofence alerting him/her that they have left the Ebola infected area This feature will help the users by keeping them aware of the infected areas and so to take necessary precautions when entering these areas Web Service Through ECL A SOAP enabled service is created using ECL which is a declarative, data centric programming language designed in 2000 by the organization LexisNexis Risk Solutions to allow developers to process big data across high performance computing cluster For the current purpose we are using a dummy data with 1000,000 records which has attributes like PersonID, FirstName, LastName, MiddleName, Gender, Street, City, State, Zip We have designed a Roxie query, Roxie is an HPCC Systems cluster specifically designed to service standard queries, providing a throughput rate of thousand-plus responses per second This service has been published on hThor which is an R&D platform designed for iterative, interactive development and testing of Roxie queries Innovative Mobile Application for Ebola Spread 383 Fig 14.19 Sample output of the service Image below shows a sample xml query created to return the details of all users having a particular LastName (Fig 14.18) This query returned 1000 users having last name as VED in milliseconds A sample of the result is shown in Fig 14.19 This service can be parsed with the mobile application and will help in handling big data with a high throughput rate Conclusion The ongoing epidemic in West Africa offers a unique opportunity to improve our current understanding of the transmission characteristics of the Ebola virus disease in humans, including the duration of immunity among Ebola survivors and the case fatality ratio, as well as the effectiveness of various control interventions Ending the epidemic requires approximately 70 % of the persons with Ebola to be treated either in an ETU or at home or in community setting such that there is a reduced risk for disease transmission There are a lot of public health challenges faced during the prediction of number of future cases and if the preventive measures are not scaled-up cases will continue to double in approximately every 20 days However, this epidemic can be controlled using various models described in the paper As many consumers are now receiving news from real-time social media platforms, it is important to have quantitative methods like SIR, ISIR, SIS and SEIZ to distinguish news from rumors as misinformation on social platform can sometime resemble as a genuine news There was an argument that digital health tech could have played a better role in stopping the Ebola outbreak had there been a quick ground response With this outbreak, many developers and manufacturers were able to test their apps in the field and had a very positive results Now the developers have to analyze the data and find better and innovative ideas to bring these technologies together in the fight against Ebola Innovative mobile application uses large amount of data from various different sources which are fed into the decision support system that will model the spread 384 14 Modeling Ebola Spread and Using HPCC/KEL System pattern of Ebola virus and create dynamic graphs and predictive diffusion models of the impact of virus on either a specific person or a specific community LexisNexis has provided the big data needed to develop and model this program with the help of their expertise in big data analytics The data provided by LexisNexis is completely secured and are in compliance with the Centers for Disease Control and Prevention and the National Institute of Standards and Technology for transmission of public health information The model created leads to more precise predictions of disease propagation, it also helps in identifying the individuals who are possibly infected by the virus and perform a trace back analysis to locate the possible source of infection in a particular social group All the data is being presented in form of report to the responsible government agencies Acknowledgments This work has been funded by the NSF Award No CNS 1512932 RAPID: Modelling Ebola Spread and Developing Decision Support System Using Big Data Analytics, 2015–2016 References Browne C, Huo X, Magal P, Seydi M, Seydi O, Webb G Model of 2014 Ebola Epidemic in West Africa with contact tracing Kouadio KI, Clement P, Bolongei J, Tamba A, Gasasira AN, Warsame A, Okeibunor JC, Ota MO, Tamba B, Gumede N, Shaba K, Poy A, Salla M, Mihigo R, Nshimirimana D Epidemiological and surveillance response to Ebola virus disease outbreak in Lofa County, Liberia Lutwama JJ, Kamugisha J, Opio A, Nambooze J, Ndayimirije N, Okware S Containing Hemorrhagic Fever Epidemic The Ebola experience in Uganda https://en.wikipedia[9].org/wiki/Ebola_virus_disease#Onset http://www.imedicalapps.com/2015/04/lessons-ebola-outbreak-using-apps-fight-infectiousdiseases/ https://www.whitehouse.gov/ebola-response https://en.wikipedia.org/wiki/Basic_reproduction_number http://www.healthmap.org/site/diseasedaily/article/estimating-fatality-2014-west-africanebola-outbreak-91014 http://epidemic.bio.ed.ac.uk/ebolavirus_fatality_rate 10 Jin F, Dougherty E, Saraf P, Cao Y, Ramakrishnan N Epidemiological modeling of news and rumors on Twitter 11 Anastassopoulou SC, Russo L, Grigoras C, Mylonakis E Modelling the 2014 Ebola virus epidemic—agent based simulations, temporal analysis and future predictions for Liberia and Sierra Leone 12 Merler S, Ajeli M, Fumanelli L, Gomes MFC, Pastore y Piontti A, Rossi L, Chao DL, Longini IM, Halloran ME, Vespignani A Spatiotemporal spread of the 2014 outbreak of Ebola virus disease in Liberia and the effectiveness of non-pharmaceutical interventions: a computational modelling analysis 13 http://ebolaresources.org/ebola-mobile-apps.htm 14 Meltzer MI, Atkins CY, Santibanez S, Knust B, Petersen BW, Ervin ED, Nichol ST, Damon IK, Washington ML Estimating the future Number of cases in the Ebola Epidemic— Liberia and Sierra Leone, 2014–2015 References 385 15 Zhang Z, Wang H, Wang C, Fang H Modeling epidemics spreading on social contact networks 16 http://www.who.int/csr/disease/ebola/one-year-report/factors/en/ 17 Haas CN On the quarantine period for Ebola virus PLoS Curr 2014; Edition doi:10.1371/ currents.outbreaks.2ab4b76ba7263ff0f084766e43abbd89 18 https://data.hdx.rwlabs.org/dataset/number-of-health-care-workers-infected-with-edv 19 http://jmsc.hku.hk/ebola/2015/03/23/fighting-ebola-does-the-mobile-application-help/ 20 http://who.int/hrh/news/2015/ebola_report_hw_infections/en/ 21 http://apps.who.int/ebola/ebola-situation-reports 22 http://www.who.int/mediacentre/factsheets/fs104/en/ 23 http://www.who.int/features/factfiles/malaria/en/ 24 http://www.gleamviz.org/model/ 25 http://www.math.washington.edu/*morrow/mcm/mcm15/38725paper.pdf 26 Chowell G, Nishiura H Transmission dynamics and control of Ebola virus disease (EVD): a review 27 http://www.cdc.gov/vhf/ebola/transmission/index.html 28 http://www.who.int/mediacentre/factsheets/fs103/en/ 29 Centola D The social origins of networks and diffusion Am J Sociol 2015;120(5):1295– 1338 http://doi.org/10.1086/681275 Chapter 15 Unsupervised Learning and Image Classification in High Performance Computing Cluster I Itauma, M.S Aslan, X.W Chen and Flavio Villanustre Introduction Representing objects using lower dimensional, representative, and discriminative features is an ongoing research topic that has many important rami cations This concept leads to important questions that need to be answered For example: How to design an optimal and fast optimization method that can avoid local minimums and converge? • • • • • • How How How How How How to to to to to to speed up the computational process using hardware systems? choose system/network parameters? use unlabeled data more efficiently? normalize the learned features before classification? avoid over-fitting? extract features from labeled data? In this chapter, some of these questions are answered Hand-engineering approaches have been proposed to extract good features from data to be used in the classification stages In addition to the labor-intensive techniques that not scale well to new problems, there have been many methods proposed (such as sparse coding [2] and sparse auto-encoders [3]) that can automatically learn better feature representations compared to the hand-engineered ones Although those unsupervised methods achieve good performance if required settings are satisfied, one of the major drawbacks is their complexity Many of those methods also require careful selection of multiple hyper parameters like learning rates, momentum, sparsity penalties, weight decay, and many other parameters that must be chosen through cross-validation, thus increasing running times dramatically Coates et al [1] compared sparse auto-encoders, sparse restricted Boltzmann ma-chines, © Springer International Publishing Switzerland 2016 B Furht and F Villanustre, Big Data Technologies and Applications, DOI 10.1007/978-3-319-44550-2_15 387 388 15 Unsupervised Learning and Image Classification in High … Gaussian mixture models, and K-means learning methods Surprisingly, the best results were achieved using the K-means method that has been used in image processing, but that has not been widely practiced for deep unsupervised feature learning To obtain the best results of the K-means method, a selection of the best number of centroids from the data is needed In this study, we extend the use of the K-means algorithm with multimodal learning and recognition framework in the High Performance Computing Cluster environment for any dimensional data There is a high demand for new ideas to deal with the feature learning and classification stages on high dimensional data The high dimensionality of un-labeled data requires new developments in learning methods In spite of recent advances in representation learning, most of the current methods are limited when dealing with large scale unlabeled data Complex deep architecture and expensive training time are mostly responsible for lack of good feature representations for large scale data As a solution to dealing with high dimensional data, researchers in the machine learning community have adopted the use of GPUs and parallel programming techniques to speed up computationally intensive algorithms Furthermore, important studies have been carried out to propose more efficient optimization methods to speed up the convergence (such as [4]) In response to these various ideas and platforms, we investigate High Performance Computing Cluster (HPCC SystemsR) as a new environment to assess our framework’s effectiveness in terms of computation time and classification accuracy Background and Advantages of HPCC SystemsR HPCC SystemsR is a massively parallel processing computing platform used for solving Big Data problems A multi-node system leverages the full power of massively parallel processing (MPP) While the single-node system is fully functional, it does not take advantage of the true power of an HPCC SystemsR platform which has the ability to perform operations using MPP Algorithms are implemented in HPCC SystemsR with a language called Enterprise Control Language (ECL) ECL compiler generates highly optimized C ++ for execution It is open source and easy to setup Figure 15.1 shows an HPCC SystemsR multi-cluster setup The figure shows a THOR processing cluster which is similar to Google and Hadoop MapReduce platforms with respect to its function, filesystem, execution, and capabilities but o ers higher performance [5] In [6], Payne et al discussed the challenges of academic data in heterogeneous formats and diverse data sources They assessed HPCC SystemsR in the analysis of academic big data Based on their evaluation, HPCC SystemsR pro-vides mechanisms for ingesting and managing simple data such as CSV data as well as complex data Introduction 389 Fig 15.1 HPCC systems THOR cluster We chose HPCC SystemsR because of its scalability with respect to code reuse irrespective of the size of the dataset and number of clusters It provides programming abstraction and parallel runtime to hide complexities of fault tolerance and data parallelism One of the goals of this study is to show that researchers are able to run their proposed methods on HPCC SystemsR even using a single core computer We expect a faster training time if the algorithms are tested on a multinode HPCC SystemsR platform We leave the use of a system combining multiple computers for our future studies Contributions The use of HPCC SystemsR is adopted in the implementation of the feature learning and object classification tasks We show that (i) HPCC SystemsR enables researchers to leverage a multi-cluster environment to speed up the running time of any computationally intensive algorithm; (ii) it lowers the budget costs by using existing computers instead of designing an expensive system with GPUs; and (iii) it is scalable with respect to code reuse irrespective of the size of the dataset and number of clusters We implement a new feature learning and recognition framework using a multimodal strategy Our novel idea is to use the HPCC SystemsR platform that can handle identity recognition with high recognition accuracy rates For instance, by dividing a face image into several subunits, we can extract intra-region information more precisely We will discuss this in the next sections 390 15 Unsupervised Learning and Image Classification in High … Methods In this section, we describe the learning of object representations as well as the recognition framework Our framework consists of image reading in HPCC SystemsR platform, feature learning from unlabeled data, feature extraction from labeled data using the learned bases, and classification stages Our framework is shown in Fig 15.2 This figure shows the specific framework that we follow for the face databases such as Caltech-101, AR databases, and a subset of wild PubFig83 data with multimedia content For the Caltech-101 data, we use patches instead of facial regions We give details for each stage in the following sections Image Reading in HPCC Systems Platform This work is the first study on image classification using HPCC SystemsR, to the best of our knowledge First, we explain how we integrate databases into the HPCC SystemsR in which images are represented as Binary Large OBject (BLOB) BLOB support in ECL begins with the DATA value type which makes it perfect for housing BLOB data There are essentially three issues around working with this type of data: (i) How to get the data into the HPCC Systems THOR Cluster (Spraying) (ii) How to work with the data, once it is in HPCC (iii) How to get the data back out of HPCC SystemsR THOR Cluster (Despraying) Fig 15.2 The framework that we follow for the classification of AR data Methods 391 The BLOB spray is described in [7, 8] The image dataset should be sprayed in BLOB format There are different formats for spraying data such as delimited for Comma Separable Value (CSV), fixed for texts and blob for images We explored the BLOB spray option which will result in a dataset on the cluster where each record is one of the image datasets Typically, we use a pre x of both the name and length to de ne the record structure of the image dataset Since we use grayscale images, we convert all images to Comma Separable Value The following steps are followed to use the image database: (i) Extract patches or regions from images (ii) Normalize the patches (iii) Convert these patches to CSV (iv) Spray the CSV to HPCC SystemsR platform The next section describes how we learn the features from the data Feature Learning Most applications in image processing involve the use of high dimensional data The goal of unsupervised learning is to find a lower dimensional projection of the unlabeled data that preserves all the information in the data while reducing redundant dimension The problem in unsupervised learning is to find hidden structures in unlabeled data We implement a multimodal feature learning framework that runs the K-means learning method for each region of the data Using the multimodal learning, we are able to extract representations that capture intra-region changes more precisely The K-means clustering method obtains specialized bases for the corresponding region of the data Instead of estimating a single centroid of an image/data, feature learning for each divided region increases the deep representation that learns more representative information as we assess this point in our experimental results Coates et al [1] proved that the K-means method can achieve comparative or better results than other possible unsupervised learning methods In view of this, one objective was to extend the K-means method for multimodal learning and classification framework in HPCC Systems The algorithm takes the dataset X and outputs a function f: Rn!Rk that maps an input vector x(i) to a new feature vector of k features To extract high quality features in order to obtain a high classification accuracy, we ran the methods with respect to the key points of this stage such as using: (i) a good number of samples, (ii) choice of parameters, and (iii) number of bases K-means is a partitioning algorithm in which we construct various bases or centroids and evaluate based on specific criteria It is an unsupervised clustering method and partitioning algorithm where data are assigned in clusters defined by their centroid, based on their features and distance from the centroids The goal is to minimize the sum of the square errors (SSE) as can be seen in Eq (15.1) The SSE is used to make partitions and it is the sum of squared differences between each observation in its cluster as a centroid over all the k clusters The SSE strictly decreases after re-computing new centers in the K-means algorithm The new center 392 15 Unsupervised Learning and Image Classification in High … of a cluster comes from the average of all data points in this cluster, which minimizes the SSE [9]; as follows: SSE ¼ k X X jjx À mi jj2 ð15:1Þ i¼1 i2Ci where mi is the mean of points in Ci and x is the data point in cluster Ci Given two partitions, we choose the one with the smallest error Each cluster is represented by the center of the cluster which is the centroid Points are assigned to the cluster with the nearest centroid The distance between clusters is based on their centroids: À Á À Á dis Ki ; Kj ẳ dis Ci ; Cj ; 15:2ị where Ki and Kj are two groups of points or information, and Ci and Cj are the corresponding centroids Given k, the number of clusters, the K-means clustering algorithm is outlined as: • Select k points as initial centroids repeat • Form k clusters by assigning each point to its closest centroid • Re-compute the centroids of each cluster until convergence criterion is satisfied In order to specify the best k, we run a range of values The computational complexity is O(tkn) where n is the number of data points, k is the number of clusters and t is the number of iterations It is an efficient method since usually, k; t ( n The work [10] summarizes recent results and technical points that are needed to make elective use of K-means clustering for learning large-scale representations of images Figure 15.3 shows the centroids learned by K-means implemented in ECL from the AR dataset without whitening as an example \(AR Left eye Base)" \(AR Right eye Base)" Fig 15.3 Selected bases (or centroids) trained on AR images using K-means in HPCC systems Methods 393 Feature Extraction For each region, we train one K-means algorithm The learned and specialized bases are able to capture nonlinear structure of the corresponding image regions We use these bases for the feature extraction and dimensionality reduction of the labeled data The new projected data is calculated using the correlation information between the labeled data and estimated bases or centroid Let Xi be any image region and Ci is the corresponding learned bases using the K-means method The feature of labeled data corresponding to image regions is calculated as Yi = XiCti Then, these extracted features are fused together through concatenating one by one to get the multimodal representation as Y ẳ ẵY1 ; Y2 ; ; YM ; ð15:3Þ where M equals to the number of image region (and sometimes equals to the number of image region and multimedia data such as speech in addition to image information) The multimodal learning and classification idea improves the recognition rates as seen in the results A reason for this is that the extraction of intra-region information is estimated more precisely when the learning method focuses on a specific image regions separately Classification Classification is a supervised learning process that aims at accurately predicting some value or attribute of an object based on known facts about the object It involves deriving a rule or model from a training set which is then used to predict a test set In machine learning, all classification algorithms follow three logical steps Learning the model from a training set, testing with respect to obtaining measures of how well the classifier fits, and classifying which involves testing the model on new data in order to compute a classification accuracy We apply our multimodal object representation learning method to the object classification task To this, we train classification methods and configurations in our experiments Once the framework is trained, it can be used to identify a testing object The testing data should undergo the same procedure that the training data goes through Experiments and Results In this study, we assess our design on a subset of the Caltech-101 [11], AR [12], and a subset of PubFig83 database [13] that we add speech content in addition to face images Our goal is to assess our feature learning and classification framework in the 394 15 Unsupervised Learning and Image Classification in High … HPCC Systems platform Note that all data in our experiments are locally normalized to have the Gaussian distribution Evaluation on Caltech-101 The Caltech-101 database consists of 102 categories As a subset of Caltech-101, we use 10 classes (which have more than 60 images per class) for both unsupervised and supervised learning steps We randomly select up to 60 images per class and pre-process them as in [14]: The images are converted to gray-scale, then down-sampled and zero padded to 143 Â 143 pixels Finally, we normalize the images to have the standard Gaussian distribution In our study, we assess the performance of our method in HPCC Systems and compare directly with [1] We run methods using 3; 000 randomly selected patches of 16 Â 16 dimensional pixels In the unsupervised learning part, we train the entire unlabeled training set of images before the classification step We learn 32 bases in the unsupervised learning of all methods in two platforms For the supervised learning, we use 30 training and 30 testing images for each category To extract features from the labeled training samples, we follow the convolutional extraction process of Coates et al [1] We use stride with 16 Â 16 patches to obtain a dense feature extraction The non-linear mapping transforms the input patches into a new representation with 32 features using the learned bases Then we use pooling for dimensionality reduction 132 pooled features are used to train the classifiers We show the visualizations of the bases (or centroids) learned by K-means in Fig 15.4 We achieve 83:5 % identification accuracy using the C4.5 Decision tree classification method; whereas Coates et al [1] achieves only 80:7 % rate using the linear SVM Evaluation on AR Face Database We also test our proposed idea on AR [12] face database The aligned AR database contains 100 subjects (50 men and 50 women), with 26 different images per subject which totals 2; 600 images taken in two sessions In this database, there are facial expression (neutral, smile, anger, scream), illumination and occlusion (sunglass, scarf) challenges In our study, we use images without the occlusion challenges which totals to 1; 400 images for both the unsupervised learning and classification steps Figure 15.5 shows some example images from a subject First, we learn the base for each facial region separately We segment four essential facial regions with sizes of 39 Â 51 (left eye and right eye), 30 Â 60 (mouth), and 45 Â 42 (nose) We believe that better representations are obtained by running unsupervised learning for each region We also obtain the features of the labeled facial regions using the corresponding learned bases separately To this, we calculate the correlation between each labeled sample and each center vector (base) to get a vector of features We combine the features extracted from the four facial regions (and other possible modalities), and train the classifiers For the AR database, we follow a scenario described in [15] which reported one of the state-of-the-art recognition rates Each subject has 14 images with facial expression and illumination changes Various train-test image partitions are tested Experiments and Results 395 centroid 12 centroid 12 centroid (another run) 16 centroid Fig 15.4 Selected bases (or centroids) trained on Caltech101 images using K-means in HPCC Systems Fig 15.5 Example images from one subject in AR database with various facial expressions and illumination We conduct 10 runs for train-test procedure to get the average recognition rate for each partition Table 15.1 shows the face classification results of our proposed framework (using K-means and C4.5 Decision tree methods) in HPCC Systems and [1] We improve the classification results of [1] by 4:6 % From this result, we can 396 Table 15.1 Comparison of face recognition rates on AR database 15 Unsupervised Learning and Image Classification in High … Methods Acc (%) with train train Coates et al [1] Ours 74.3 78.9 infer that we learn more representative features and our classification method is better than [1] This proves that our framework developed in HPCC Systems achieves at least comparative or better results than its alternative Our preliminary results show that more research studies on HPCC Systems are beneficial to the machine learning society Identity Recognition on the Wild and Multimedia Database In recent years, several unconstrained databases have emerged in the literature for face identification or verification Unlike the traditional face databases which are com-posed of images taken in controlled environments, face images in unconstrained databases are generally collected from Internet sources In particular, these images contain unrestricted varieties of expression, pose, lighting, occlusion, resolution, etc Therefore, unconstrained face recognition is a very challenging task We prepare a data set from aligned version of wild PubFig83 database [13] We select 10 subjects which totals to 1; 000 face images Some example images are shown in Fig 15.6 For the images, we randomly select 50 images per subject as the training set, and the rest of the images are used as the testing set in the supervised learning step Four essential facial regions are used for facial representation learning We segment four essential facial regions with sizes of 32 Â 52 (left eye and right eye), 48 Â 76 (mouth), and 60 Â 48 (nose), which are further reduced by half, using bicubic interpolation Fig 15.6 Example images of 10 celebrities with various real-world changes on facial expression, pose, illumination, occlusion, resolution, etc ... enterprise-class availability Fig 1.1 Big data workflow Big Data Technologies Big Data Technologies Big Data technologies is a new generation of technologies and architectures designed to economically... to Big Data Borko Furht and Flavio Villanustre Concept of Big Data In this chapter we present the basic terms and concepts in Big Data computing Big data is a large and complex collection of data. .. Switzerland 2016 B Furht and F Villanustre, Big Data Technologies and Applications, DOI 10.1007/978-3-319-44550-2_1 Introduction to Big Data Table 1.1 Comparison between traditional and big data

Ngày đăng: 04/03/2019, 11:11