Big data concepts, theories, and applications

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	440
Dung lượng	9,81 MB

Nội dung

Shui Yu · Song Guo Editors Big Data Concepts, Theories, and Applications Big Data Concepts, Theories, and Applications Shui Yu • Song Guo Editors Big Data Concepts, Theories, and Applications 123 Editors Shui Yu School of Information Technology Deakin University Burwood, VIC, Australia ISBN 978-3-319-27761-5 DOI 10.1007/978-3-319-27763-9 Song Guo School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu City, Fukushima, Japan ISBN 978-3-319-27763-9 (eBook) Library of Congress Control Number: 2015958772 Springer Cham Heidelberg New York Dordrecht London © Springer International Publishing Switzerland 2016 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www springer.com) Preface Big data is one of the hottest research topics in science and technology communities, and it possesses a great potential in every sector for our society, such as climate, economy, health, social science, and so on Big data is currently treated as data sets with sizes beyond the ability of commonly used software tools to capture, curate, and manage We have tasted the power of big data in various applications, such as finance, business, health, and so on However, big data is still in her infancy stage, which is evidenced by its vague definition, limited application, unsolved security and privacy barriers for pervasive implementation, and so forth It is certain that we will face many unprecedented problems and challenges along the way of this unfolding revolutionary chapter of human history Big data is driven by applications and aims to obtain knowledge or conclusions directly from big data sets As an application-oriented field, it is inevitably needed to integrate domain knowledge into information systems, which is similar to traditional database systems, which possess a rigorous mathematical foundation, a set of design rules, and implementation mechanisms We imagine that we may have similar counterparts in big data We have witnessed the significant development in big data from various communities, such as the mining and learning algorithms from the artificial intelligence community, networking facilities from networking community, and software platforms from software engineering community However, big data applications introduce unprecedented challenges to us, and existing theories and techniques have to be extended and upgraded to serve the forthcoming real big data applications With a high probability, we need to invent new tools for big data applications With the increasing volume and complexity of big data, theoretical insights have to be employed to achieve the original goal of big data applications As the foundation of theoretical exploration, constant refinements or adjustments of big data definitions and measurements are necessary and demanded Ideally, theoretical calculation and inference will replace the current brute force strategy We have seen the effort from different communities in this direction, such as big data modeling, big task scheduling, privacy framework, and so on Once again, these theoretical attempts are still insufficient to most of the incoming big data applications v vi Preface Motivated by these problems and challenges, we proposed this book aiming to collect the latest research output in big data from various perspectives We wish our effort will pave a solid starting ground for researchers and engineers who are going to start their exploration in this almost uncharted land of big data As a result, the book emphasizes in three parts: concepts, theories, and applications We received many submissions and finally accepted twelve chapters after a strict selection and revision processing It is regretful that many good submissions have been excluded due to our theme and space limitation From our limited statistics, we notice that there is a great interest in security and application aspects of big data, which reflects the current reality of the domain: big data applications are valuable and expected, and security and privacy issue has to be appropriately handled before the pervasive practice of big data in our society On the other hand, the theoretical part of big data is not as high as we expected We fully believe the theoretical effort in big data is essential and highly demanded in problem solving in the big data age, and it is worthwhile to invest our energy and passion in this direction without any reservation Finally, we thank all the authors and reviewers of this book for their great effort and cooperation Many people helped us in this book project, we appreciate their guidance and support In particular, we would like to take this opportunity to express our sincere appreciation and cherished memory to late Professor Ivan Stojmenovic, a great mentor and friend At Springer, we would like to thank Susan LagerstromFife and Jennifer Malat for their professional support Melbourne, VIC, Australia Fukushima, Japan Shui Yu Song Guo Contents Big Continuous Data: Dealing with Velocity by Composing Event Streams Genoveva Vargas-Solar, Javier A Espinosa-Oviedo, and José Luis Zechinelli-Martini Big Data Tools and Platforms Sourav Mazumder 29 Traffic Identification in Big Internet Data 129 Binfeng Wang, Jun Zhang, Zili Zhang, Wei Luo, and Dawen Xia Security Theories and Practices for Big Data 157 Lei Xu and Weidong Shi Rapid Screening of Big Data Against Inadvertent Leaks 193 Xiaokui Shu, Fang Liu, and Danfeng (Daphne) Yao Big Data Storage Security 237 Mi Wen, Shui Yu, Jinguo Li, Hongwei Li, and Kejie Lu Cyber Attacks on MapReduce Computation Time in a Hadoop Cluster 257 William Glenn and Wei Yu Security and Privacy for Big Data 281 Shuyu Li and Jerry Gao Big Data Applications in Engineering and Science 315 Kok-Leong Ong, Daswin De Silva, Yee Ling Boo, Ee Hui Lim, Frank Bodi, Damminda Alahakoon, and Simone Leao vii viii Contents 10 Geospatial Big Data for Environmental and Agricultural Applications 353 Athanasios Karmas, Angelos Tzotsos, and Konstantinos Karantzalos 11 Big Data in Finance 391 Bin Fang and Peng Zhang 12 Big Data Applications in Business Analysis 413 Sien Chen, Yinghua Huang, and Wenqiang Huang Chapter Big Continuous Data: Dealing with Velocity by Composing Event Streams Genoveva Vargas-Solar, Javier A Espinosa-Oviedo, and José Luis Zechinelli-Martini Abstract The rate at which we produce data is growing steadily, thus creating even larger streams of continuously evolving data Online news, micro-blogs, search queries are just a few examples of these continuous streams of user activities The value of these streams relies in their freshness and relatedness to on-going events Modern applications consuming these streams need to extract behaviour patterns that can be obtained by aggregating and mining statically and dynamically huge event histories An event is the notification that a happening of interest has occurred Event streams must be combined or aggregated to produce more meaningful information By combining and aggregating them either from multiple producers, or from a single one during a given period of time, a limited set of events describing meaningful situations may be notified to consumers Event streams with their volume and continuous production cope mainly with two of the characteristics given to Big Data by the 5V’s model: volume & velocity Techniques such as complex pattern detection, event correlation, event aggregation, event mining and stream processing, have been used for composing events Nevertheless, to the best of our knowledge, few approaches integrate different composition techniques (online and post-mortem) for dealing with Big Data velocity This chapter gives an analytical overview of event stream processing and composition approaches: complex event languages, services and event querying systems on distributed logs Our analysis underlines the challenges introduced by Big Data velocity and volume and use them as reference for identifying the scope and limitations of results stemming from different disciplines: networks, distributed systems, stream databases, event composition services, and data mining on traces G Vargas-Solar ( ) • J.A Espinosa-Oviedo CNRS-LIG-LAFMIA, 681 rue de la Passerelle BP 72, Saint Martin d’Hères, 38402 Grenoble, France e-mail: genoveva.vargas@imag.fr; javier.espinosa@imag.fr J.L Zechinelli-Martini UDLAP-LAFMIA, Exhacienda Sta Catarina Mártir s/n, San Andrés Cholula, 72810 Cholula, Mexico e-mail: joseluis.zechinelli@udlap.mx © Springer International Publishing Switzerland 2016 S Yu, S Guo (eds.), Big Data Concepts, Theories, and Applications, DOI 10.1007/978-3-319-27763-9_1 12 Big Data Applications in Business Analysis 423 • Passengers Information (Submitted information of passengers : : : ) • Orders (Order information and the Information of the finalized flight : : : ) 12.3.3.3 Statistical Analysis The following rules are made for every single passenger, and all the specifics will be decided according to practical application context How many times does the passenger login in before the ticket purchase on average (in a certain time)? How many pages does the passenger browse before the purchase on average (take every page view as a unit)? How many times did the passenger click advertisement before the ticket purchase? How many advertisement hits after the ticket purchase? Distribution of access sources (Baidu, Google, direct access etc.) Distribution of passenger flight inquiries (in a certain time) 12.3.4 Display and Application of WebTrends Event Flow The access event flow on WebTrends also is included in passengers’ sequential analysis, so that all customers’ event behaviors can be integrated and displayed The event sequence diagram is based on different event types of each passenger, such as login, inquiry, order and its detail information The building methods are as follows: Count the customer events, and show them and their frequency in time sequence After the clicks of “recent events” or “all events”, the detail information in specific period can be seen, such as flight number, departure time, arrival city, and so on Figures 12.3 and 12.4 show examples of customer events 12.3.5 Customer Activity Analysis Using Pareto/NBD Model The Pareto/NBD model was originally proposed by Schmittlein et al [7] This model calculates customer activity level and predicts their future transactions based on their purchasing behavior The model assumes that customers make purchases at any time with a steady rate for a period of time, and then they may drop out The mathematical assumptions of this model are listed as below [7]: While active, the repeat-buying rate of customer behavior follows Poisson distribution The transaction rate of different customers follows a gamma distribution ( , ˛) is the shape parameter and ˛ denoted the scale parameter 424 S Chen et al Customer Events Overview: Current Time: 2014-10-27 Recent Events In Days In Days Purchases : Page views : Purchases : Page views : In 15 Days In Months Purchases : Page views : Purchases : Page views : In Months In Months Purchases : Page views : Purchases : Page views : All Events Fig 12.3 Passenger event overview Occurring Time Event Details 2013-03-10 10:12:31 2013-03-10 10:12:31 We chat Check-in 2013-01-14 09:33:23 2013-01-14 09:33:23 Log in member card 692212812028 though B2C website 2013-01-11 09:33:17 2013-01-11 09:33:17 Transfer service 2013-01-11 09:33:44 2013-01-11 09:33:44 Buffet service 2013-01-08 09:33:41 2013-01-08 09:33:41 Self-help luggage service 2013-01-07 09:33:24 2013-01-07 09:33:24 Counter check-in service 2013-01-04 09:33:18 2013-01-04 09:33:18 Ipad service 2012-12-25 09:33:50 2012-12-25 09:33:50 Book flight 20130117 TV5510 ZSSS-ZGGG though B2C website 2012-12-25 09:33:36 2012-12-25 09:33:36 Check flight 20130117 TV5510 ZSSS-ZGGG though B2C website 2012-11-27 09:33:41 2012-11-27 09:33:41 Click flight 20130117 TV5510 ZSSS-ZGGG though B2C website Fig 12.4 Passenger event details The dropout rate obeys exponential distribution Heterogeneity in dropout rates across customers follows a gamma distribution (s, ˇ) s is the shape parameter and ˇ denoted the scale parameter The transaction rate and the dropout rate are independent across customers The Pareto/NBD model requires only each customer’s past purchasing history: “regency” (last transaction time) and “frequency” (how many transactions in a specified time period) The information can be described as X D x; t; T/, where 12 Big Data Applications in Business Analysis 425 x is the number of transactions observed in the time period (0, T] and t is the time of the last transaction With these two key summary statistics, the Pareto/NBD model can derive P(active), theˇ probabilityÁ of observing x transactions in a time period ˇ of length t, and E Y.t/ˇX D X; t; T , the expected number of transactions in the period (T, T C t] for an individual with observed X D x; t; T/[8] With passenger activity and other conditions, airlines could analyze the influence factors of activity degree which could be used to improve passenger activity Three pieces of information were selected from the database where large passengers’ records stored ORD_FAR.Far_Idnum:Customer id ORD.Ord_Bok_Time:Booking time ORD_CAS.CASH_TOTAL_TICKETPRICE:ticket price In the database provided by China Southern Airlines, we put the passenger data from 2013-01-01 to 2013-06-30 into the Pareto/NBD model, and forecast the purchase number of each passenger in July and August, 2013 The Pareto/NBD model was implemented with R language Figure 12.5 shows a density distribution of the passengers’ activity We can find that the activity of most passengers is between f0.1, 0.2g Table 12.3 lists the range of the passengers’ activity The total number of passengers is 202,370 Based on the passengers’ activity the number of flying times predicted, airlines could make more effective marketing strategy for their customers This customer activity analysis has several important implications Customer segmentation could be done based on the passengers’ activity degree For example, customers could be divided into highly active, active and inactive Then, airlines can carry out targeted management With the average spent by passengers and predicted flying numbers, airlines could calculate the revenue this passenger would bring to them and predict future returns Density density.default(x = Pactive_results) 0.0 0.2 0.4 0.6 0.8 N = 202370 Bandwidth = 0.01493 Fig 12.5 Density distribution of the passengers’ activity 1.0 426 S Chen et al Table 12.3 The scope of activity and corresponding number of passengers The scope of activity P(Active) [0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] Number of passengers 8004 96,269 31,634 19,538 10,788 The scope of activity P(Active) (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1] Number of passengers 8337 5990 5722 6562 9526 Combining passenger active degree with life cycle length, airlines can calculate and estimate the customer lifetime value to allocate marketing resources and provide the basis for effective customer management 12.3.6 Customer Segmentation by Clustering Analysis Clustering analysis can be applied to segment airline passengers and explore their purchase behavior [9] In this case of China Southern Airlines, the company website has around 6.2 million effective booking data, with passengers involved reaching the number of nearly million in the past years A sample data set is retrieved from the company website, which includes 2.5 million booking data First, 12 variables of airline passenger behavior are selected for principal component analysis Table 12.4 shows the 12 variables involved in principal component analysis (PCA) Using PCA method, 12 analysis variables were integrated and transformed to nine principal component factors The accumulative contribution of the nine extracted factors is over 0.98, indicating these factors carry more than 98 % of information which can be provided by the original 12 variables Among the original variables, total mileage is affected by two indicators—frequency and average flight distance, average ticket price is affected by average discount and average flight distance, sum of consumption is affected by frequency, average discount and average flight distance, so the PCA analysis reveals that they can’t act as separate factors The result of PCA analysis is shown in Table 12.5 Therefore, only nine factors were selected for further cluster analysis Then, K-mean value analysis was conducted to explore passenger groups Iteration experiment was used to select the combination of a group of clustering numbers and random seed which yields the best grouping result Iteration experiment generates the best grouping number, and eight passenger groups with typical purchase characteristics are identified The groups are described and labeled in Table 12.6 Next, through comparing the mean values of group characteristics, we can identify the advantages and disadvantages of targeting different passenger groups The following five findings would be useful for the airline company’s business decision 12 Big Data Applications in Business Analysis 427 Table 12.4 Airline passenger lifetime value and purchase behavior Characteristics Indicators Passenger lifetime Number of booking legs value characteristics Sum of consumption Average ticket price paid Average discount Number of days as of the last booking up to today Total flight mileage Average flight mileage Behavior characteristics Average number of days for booking upfront Average booking time Rate of weekend flight Rate of holiday flight Rate of flight for an international expo Descriptions Number of take-off and landing city pairs for client bookings Gross purchase sum Quotient of purchase sum and number of flights Price published for each city pair Difference between the last booking date and analysis date Sum of flight mileage of each city pair Quotient of Sum of flight mileage and number of flights Average of difference between purchase date and flight date Average of each purchase time point Quotient of number of weekend flight and total flights number Inclusive of certain days before and after the holiday Destination being Guangzhou in period of outward voyage, while departure from Guangzhou in the period of back trip Table 12.5 The result of PCA analysis Factors F1: Frequency F2: Average flight distance F3: Average discount F4: Advance booking F5: Last flight trip F6: Holiday flight trips F7: Booking period F8: Weekend flight trips F9: Flight trips to an international expo F10 F11 F12 Eigenvalue 2.89 1.89 1.46 1.20 1.00 0.99 0.98 0.80 0.57 Contribution 0.241 0.157 0.122 0.099 0.083 0.082 0.082 0.067 0.048 Accumulative contribution 0.241 0.398 0.520 0.619 0.702 0.784 0.866 0.873 0.981 0.10 0.07 0.03 0.008 0.006 0.005 0.989 0.995 1.000 Group 1, Group and Group have fewer numbers of bookings, with middle level of ticket price, supposed to be the ordinary mass groups, while the difference among the three groups is about the rate of flights on weekends, holiday and workdays 428 S Chen et al Table 12.6 Passenger clusters Clustering Label Group Ordinary business client Group Happy flight—not lost Group Expo event Group Occasional high-end flight Group The masses—flight trips on weekends Happy flight—already lost Group Group Group The masses—flight trips on holiday High-end, flight often Advantage characteristics None Big number of days for advance booking Higher rate of flight for an expo event High average ticket price, High average price Higher rate of flight on weekends None Higher rate of flight on holiday Big number of booking, high average price, big sum of consumption Disadvantage characteristics Few number of booking, lower rate of flight trips on weekends Few number of booking, the lowest discount The smallest group Few number of booking None Few number of booking, the lowest discount, the longest time interval since the last booking up to now None None Group and are groups purchasing discounted tickets, the difference is that Group is still active, while Group is basically lost already Group bears similar characteristics with A-type group supposedly Group are travelers who flow in the same direction with those attending an expo, so we infer quite many of them are participants of the event Group have a fewer number of booking, but with a higher price, while a small number of days for advance booking, suggesting this is a group with occasional travel needs, paying attention to prices seldom, so it could be a high-end group Group have a big number of booking, with a high average price, and are a highend group who fly often indeed This high-end group are in pursuit of trends, and enjoy new technology while traveling We can see this high-end group tend to handle special check-in, showing an obvious higher rate than other groups, at 40 %, especially online check-in and SMS check-in 12.3.7 Recency-Frequency-Monetary (RFM) Analysis Recency-Frequency-Monetary method is considered as one of the most powerful and useful models to implement consumer relationship management Bult and Wansbeek [10] defined the variables as: (1) R (Recency): the period since the 12 Big Data Applications in Business Analysis 429 last purchase; a lower value corresponds to a higher probability of the customer’s making a repeat purchase; (2) F (Frequency): number of purchases made within a certain period; higher frequency refers to greater loyalty; (3) M (Monetary): the money spent during a certain period; a higher value means that the company should focus more on that customer [9] This case study adopted an extended RFM model to analyze the airline passenger behavior The extended RFM model incorporated average discount factor as an additional variable, because average discount factor is an important indicator to measure the price level of passenger’s airline purchase The average discount factor defined here is ratio of purchase price to published price of the airplane seat Therefore, the extended RFM model involves four variables: number of days from the last order date to modeling (R), number of flight trips (F), sum of consumption (M), and average discount factor (D) In this way, a traveler’s ID generates the consolidated data Principal component analysis was used to score individual travelers based on the RFMD variables, and 16 consumer groups were identified The findings could help marketers to recognize those most valuable consumers and establish profitable consumer relationships The procedure of the RFM analysis is described below 12.3.7.1 Exploratory Data Analysis This step involves taking a closer look at the data available for investigation Exploratory data analysis consists of data description and verifying the quality of data from the airline company’s databases Tables 12.7 and 12.8 provide a general understanding of the passenger data set Table 12.7 reveals that the difference between the maximum and the minimum of the two variables: number of flight trips and sum of consumption is huge The data distribution plot also indicates that the original data is heavily right-skewed Therefore, using the original data directly in our modeling will be a big problem In order to fix this data problem, logarithmic transformation is used regarding number of flight trips, sum of consumption and average discount factor We also take the opposite number regarding the difference of dates from the last order date to modeling date, and then standardize the data to remove dimension’s influence Table 12.7 Descriptive data analysis of RFMD variables Modeling variables R: Number of days from the last order date to modeling F: Number of flight trips M: Sum of consumption D: Average discount factor N 1,624,293 Mean 188.34 SD 175.07 Minimum Maximum 730 1,624,293 1,624,293 1,624,293 2.25 1729 0.62 2.11 2062 0.22 18 0.02 128 173,190 3.4 430 S Chen et al Table 12.8 Correlation matrix of RFMD variables R: Number of days from the last order date to modeling F: Number of flight trips R F 0.006 0.006

Ngày đăng: 04/03/2019, 10:05