1. Trang chủ
  2. » Ngoại Ngữ

6462 getting started with UDK

208 127 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 208
Dung lượng 2,81 MB

Nội dung

Advances in Computer Vision and Pattern Recognition T Ravindra Babu M Narasimha Murty S.V Subrahmanya Compression Schemes for Mining Large Datasets A Machine Learning Perspective Advances in Computer Vision and Pattern Recognition For further volumes: www.springer.com/series/4205 T Ravindra Babu r M Narasimha Murty S.V Subrahmanya r Compression Schemes for Mining Large Datasets A Machine Learning Perspective T Ravindra Babu Infosys Technologies Ltd Bangalore, India S.V Subrahmanya Infosys Technologies Ltd Bangalore, India M Narasimha Murty Indian Institute of Science Bangalore, India Series Editors Prof Sameer Singh Rail Vision Europe Ltd Castle Donington Leicestershire, UK Dr Sing Bing Kang Interactive Visual Media Group Microsoft Research Redmond, WA, USA ISSN 2191-6586 ISSN 2191-6594 (electronic) Advances in Computer Vision and Pattern Recognition ISBN 978-1-4471-5606-2 ISBN 978-1-4471-5607-9 (eBook) DOI 10.1007/978-1-4471-5607-9 Springer London Heidelberg New York Dordrecht Library of Congress Control Number: 2013954523 © Springer-Verlag London 2013 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect to the material contained herein Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Preface We come across a number of celebrated text books on Data Mining covering multiple aspects of the topic since its early development, such as those on databases, pattern recognition, soft computing, etc We did not find any consolidated work on data mining in compression domain The book took shape from this realization Our work relates to this area of data mining with a focus on compaction We present schemes that work in compression domain and demonstrate their working on one or more practical datasets in each case In this process, we cover important data mining paradigms This is intended to provide a practitioners’ view point of compression schemes in data mining The work presented is based on the authors’ work on related areas over the last few years We organized each chapter to contain context setting, background work as part of discussion, proposed algorithm and scheme, implementation intricacies, experimentation by implementing the scheme on a large dataset, and discussion of results At the end of each chapter, as part of bibliographic notes, we discuss relevant literature and directions for further study Data Mining focuses on efficient algorithms to generate abstraction from large datasets The objective of these algorithms is to find interesting patterns for further use by the least number of visits of entire dataset, ideal being a single visit Similarly, since the data sizes are large, effort is made in arriving at a much smaller subset of the original dataset that is a representative of entire data and contains attributes characterizing the data The ability to generate an abstraction from a small representative set of patterns and features that is as accurate as that can be obtained with entire dataset leads to efficiency in terms of both space and time Important data mining paradigms include clustering, classification, association rule mining, etc We present a discussion on data mining paradigms in Chap In our present work, in addition to data mining paradigms discussed in Chap 2, we also focus on another paradigm, viz., the ability to generate abstraction in the compressed domain without having to decompress Such a compression would lead to less storage and improve the computation cost In the book, we consider both lossy and nonlossy compression schemes In Chap 3, we present a nonlossy compression scheme based on run-length encoding of patterns with binary-valued features The scheme is also applicable to floating-point-valued features that are suitv vi Preface ably quantized to binary values The chapter presents an algorithm that computes the dissimilarity in the compressed domain directly Theoretical notes are provided for the work We present applications of the scheme in multiple domains It is interesting to explore when one is prepared to lose some part of pattern representation, whether we obtain better generalization and compaction We examine this aspect in Chap The work in the chapter exploits the concept of minimum feature or item-support The concept of support relates to the conventional association rule framework We consider patterns as sequences, form subsequences of short length, and identify and eliminate repeating subsequences We represent the pattern by those unique subsequences leading to significant compaction Such unique subsequences are further reduced by replacing less frequent unique subsequences by more frequent subsequences, thereby achieving further compaction We demonstrate the working of the scheme on large handwritten digit data Pattern clustering can be construed as compaction of data Feature selection also reduces dimensionality, thereby resulting in pattern compression It is interesting to explore whether they can be simultaneously achieved We examine this in Chap We consider an efficient clustering scheme that requires a single database visit to generate prototypes We consider a lossy compression scheme for feature reduction We also examine whether there is preference in sequencing prototype selection and feature selection in achieving compaction, as well as good classification accuracy on unseen patterns We examine multiple combinations of such sequencing We demonstrate working of the scheme on handwritten digit data and intrusion detection data Domain knowledge forms an important input for efficient compaction Such knowledge could either be provided by a human expert or generated through an appropriate preliminary statistical analysis In Chap 6, we exploit domain knowledge obtained both by expert inference and through statistical analysis and classify a 10-class data through a proposed decision tree of depth of We make use of 2class classifiers, AdaBoost and Support Vector Machine, to demonstrate working of such a scheme Dimensionality reduction leads to compaction With algorithms such as runlength encoded compression, it is educative to study whether one can achieve efficiency in obtaining optimal feature set that provides high classification accuracy In Chap 7, we discuss concepts and methods of feature selection and extraction We propose an efficient implementation of simple genetic algorithms by integrating compressed data classification and frequent features We provide insightful discussion on the sensitivity of various genetic operators and frequent-item support on the final selection of optimal feature set Divide-and-conquer has been one important direction to deal with large datasets With reducing cost and increasing ability to collect and store enormous amounts of data, we have massive databases at our disposal for making sense out of them and generate abstraction that could be of potential business exploitation The term Big Data has been synonymous with streaming multisource data such as numerical data, messages, and audio and video data There is increasing need for processing such data in real or near-real time and generate business value in this process In Chap 8, Preface vii we propose schemes that exploit multiagent systems to solve these problems We discuss concepts of big data, MapReduce, PageRank, agents, and multiagent systems before proposing multiagent systems to solve big data problems The authors would like to express their sincere gratitude to their respective families for their cooperation T Ravindra Babu and S.V Subrahmanya are grateful to Infosys Limited for providing an excellent research environment in the Education and Research Unit (E&R) that enabled them to carry out academic and applied research resulting in articles and books T Ravindra Babu likes to express his sincere thanks to his family members Padma, Ramya, Kishore, and Rahul for their encouragement and support He dedicates his contribution of the work to the fond memory of his parents Butchiramaiah and Ramasitamma M Narasimha Murty likes to acknowledge support of his parents S.V Subrahmanya likes to thank his wife D.R Sudha for her patient support The authors would like to record their sincere appreciation for Springer team, Wayne Wheeler and Simon Rees, for their support and encouragement Bangalore, India T Ravindra Babu M Narasimha Murty S.V Subrahmanya Contents Introduction 1.1 Data Mining and Data Compression 1.1.1 Data Mining Tasks 1.1.2 Data Compression 1.1.3 Compression Using Data Mining Tasks 1.2 Organization 1.2.1 Data Mining Tasks 1.2.2 Abstraction in Nonlossy Compression Domain 1.2.3 Lossy Compression Scheme and Dimensionality Reduction 1.2.4 Compaction Through Simultaneous Prototype and Feature Selection 1.2.5 Use of Domain Knowledge in Data Compaction 1.2.6 Compression Through Dimensionality Reduction 1.2.7 Big Data, Multiagent Systems, and Abstraction 1.3 Summary 1.4 Bibliographical Notes References Data Mining Paradigms 2.1 Introduction 2.2 Clustering 2.2.1 Clustering Algorithms 2.2.2 Single-Link Algorithm 2.2.3 k-Means Algorithm 2.3 Classification 2.4 Association Rule Mining 2.4.1 Frequent Itemsets 2.4.2 Association Rules 2.5 Mining Large Datasets 1 2 3 7 9 11 11 12 13 14 15 17 22 23 25 26 ix x Contents 2.5.1 Possible Solutions 2.5.2 Clustering 2.5.3 Classification 2.5.4 Frequent Itemset Mining 2.6 Summary 2.7 Bibliographic Notes References 27 28 34 39 42 43 44 Run-Length-Encoded Compression Scheme 3.1 Introduction 3.2 Compression Domain for Large Datasets 3.3 Run-Length-Encoded Compression Scheme 3.3.1 Discussion on Relevant Terms 3.3.2 Important Properties and Algorithm 3.4 Experimental Results 3.4.1 Application to Handwritten Digit Data 3.4.2 Application to Genetic Algorithms 3.4.3 Some Applicable Scenarios in Data Mining 3.5 Invariance of VC Dimension in the Original and the Compressed Forms 3.6 Minimum Description Length 3.7 Summary 3.8 Bibliographic Notes References 47 47 48 49 49 50 55 55 57 59 Dimensionality Reduction by Subsequence Pruning 4.1 Introduction 4.2 Lossy Data Compression for Clustering and Classification 4.3 Background and Terminology 4.4 Preliminary Data Analysis 4.4.1 Huffman Coding and Lossy Compression 4.4.2 Analysis of Subsequences and Their Frequency in a Class 4.5 Proposed Scheme 4.5.1 Initialization 4.5.2 Frequent Item Generation 4.5.3 Generation of Coded Training Data 4.5.4 Subsequence Identification and Frequency Computation 4.5.5 Pruning of Subsequences 4.5.6 Generation of Encoded Test Data 4.5.7 Classification Using Dissimilarity Based on Rough Set Concept 4.5.8 Classification Using k-Nearest Neighbor Classifier 4.6 Implementation of the Proposed Scheme 4.6.1 Choice of Parameters 4.6.2 Frequent Items and Subsequences 60 63 65 65 66 67 67 67 68 73 74 79 81 83 83 84 84 85 85 86 87 87 87 88 Contents 4.6.3 Compressed Data and Pruning of Subsequences 4.6.4 Generation of Compressed Training and Test Data 4.7 Experimental Results 4.8 Summary 4.9 Bibliographic Notes References xi Data Compaction Through Simultaneous Selection of Prototypes and Features 5.1 Introduction 5.2 Prototype Selection, Feature Selection, and Data Compaction 5.2.1 Data Compression Through Prototype and Feature Selection 5.3 Background Material 5.3.1 Computation of Frequent Features 5.3.2 Distinct Subsequences 5.3.3 Impact of Support on Distinct Subsequences 5.3.4 Computation of Leaders 5.3.5 Classification of Validation Data 5.4 Preliminary Analysis 5.5 Proposed Approaches 5.5.1 Patterns with Frequent Items Only 5.5.2 Cluster Representatives Only 5.5.3 Frequent Items Followed by Clustering 5.5.4 Clustering Followed by Frequent Items 5.6 Implementation and Experimentation 5.6.1 Handwritten Digit Data 5.6.2 Intrusion Detection Data 5.6.3 Simultaneous Selection of Patterns and Features 5.7 Summary 5.8 Bibliographic Notes References Domain Knowledge-Based Compaction 6.1 Introduction 6.2 Multicategory Classification 6.3 Support Vector Machine (SVM) 6.4 Adaptive Boosting 6.4.1 Adaptive Boosting on Prototypes for Data Mining Applications 6.5 Decision Trees 6.6 Preliminary Analysis Leading to Domain Knowledge 6.6.1 Analytical View 6.6.2 Numerical Analysis 6.6.3 Confusion Matrix 89 91 91 92 93 94 95 95 96 99 100 103 104 104 105 105 105 107 107 108 109 109 110 110 116 120 122 123 123 125 125 126 126 128 129 130 131 132 133 134 8.6 Proposed Multiagent Systems 181 Fig 8.4 Multiagent system for data access and preprocessing The objective is to provide framework where different streams of data are accessed and preprocessed by autonomous agents, which also cooperate with fellow agents in generating integrated data The data thus provided is further processed to make it amenable for application of data mining algorithms other Although in the figure the horizontal arrows indicate exchange of information between the agents adjacent to each other, the exchange happens among all the agents They are depicted as shown for brevity The preprocessed information is thus aggregated by another agent and makes it amenable for further processing 8.6.4 Multiagent System for Agile Processing The proposed system for agile processing is part of Data Mining process of Fig 8.3 The system corresponds to the velocity part of Big Data The processing in big data can be real-time, near-real-time, or batch processing We briefly discuss some of the options for such processing The need for agility is emphasized in view of large volumes of data where conventional schemes may not provide the insights at such speeds The following are some such options • Pattern Clustering to reduce the dataset meaningfully through some validation and operate only on such a reduced set to generate abstraction of entire data • Focus on important attributes by removing redundant features, • Compress the data in some form and operate directly on such compressed datasets • Improve the efficiency of algorithms through massive parallel processing and Map-Reduce algorithms 182 Big Data Abstraction Through Multiagent Systems 8.7 Summary In the present chapter, we discuss the big data paradigm and its relationship with data mining We discussed the related terminology such as agents, multiagent systems, massive parallel databases, etc We propose to solve big data problems using multiagent systems We provide few cases for multiagent systems The systems are indicative 8.8 Bibliographic Notes Big Data is emerging as a research and scientific topic in peer reviewed literature in the recent years Cohen et al (2009) discusses new practices for Big Data analysis, called magnetic, agile, and deep (MAD) analysis The authors contrast big data scenario with Enterprise Data Warehouse and bring out many insights into new practices for analytics for big data Loukides (2011) discusses data science and related topics in the context of Big Data Russom (2011) provides an overview of Big Data Analytics based on industry practitioners’ survey and discusses current and recommended best practices Zikopoulos et al (2011) provide a useful discussion and insights into big data terminology Halevi et al (2012) provide an overview on various aspects of big data and its trends There are multiple commercial big data systems such as Hadoop Dean and Ghemawat (2004) provide the Map-Reduce algorithm A insightful discussion on the Map-Reduce algorithm and PageRank can be found in Rajaraman and Ullman (2012) The ageRank scheme was originally discussed by Brin and Page (1998) and Page et al (1999) A insightful discussion on PageRank computation can be found in Manning et al (2008) Also the work makes interesting comments on limitations on data mining A discussion by Herodotou et al (2011) on proposal for an automatic tuning of Hadoop provides insights on challenges in big data systems A case for parallel database management systems (DBMS) against Map-Reduce for large scale data analysis is discussed in the work by Pavlo et al (2009) A contrasting view on superiority of Map-Reduce to parallel DBMS is provided by Abouzeid et al (2009) Patil (2012) discusses data products and data science aspects in his work Weiss (2000) provides an extensive overview of multiagent systems The edited work contains theoretical or practical aspects of the multiagent systems Ferber (1999) provides an in-depth account of various characteristics of multiagent systems Cao et al (2007) discuss agent-mining integration for financial services Tozicka et al (2007) suggest a framework for agent-based machine learning and data mining A proposal and implementation of a multiagent system as a divide-and-conquer approach for large data clustering and feature selection are provided by Ravindra Babu et al (2007) Ravindra Babu et al (2010) propose a large-data clustering scheme for data mining applications Agogino and Tumer (2006) and Tozicka et al (2007) form examples of agents supporting data mining Gurruzzo and Rosaci (2008) and Wooldridge and Jennings (1994) form examples of data mining supporting agents Fayyad et al provide an early overview of Data Mining References 183 References A Abouzeid, K Bajda-Pawlikowski, D Abadi, A Silberschatz, A Rasin, HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads, in VLDB’09, France (2009) A Agogino, K Tumer, Efficient agent-based clustering ensembles, in AAMAS’06 (2006), pp 1079–1086 S Brin, L Page, The anatomy of large-scale hyper-textual Web search engine Comput Netw ISDN Syst 30, 107–117 (1998) L Cao, C Zhang, F-Trade: an agent-mining symbiont for financial services, in AAMAS’07, Hawaii, USA (2007) J Cohen, B Dolan, M Dunlap, MAD skills: new analysis practices for big data, in VLDB’09, (2009), pp 1481–1492 J Dean, S Ghemawat, MapReduce: simplified data processing on large clusters, in OSDI’04: 6th Symposium on Operating Systems Design and Implementation (2004), pp 137–149 U.M Fayyad, G Piatetsky-Shapiro, P Smyth, R Uthurusamy, Advances in Knowledge Discovery and Data Mining (AAAI Press/MIT Press, Menlo Park/Cambridge, 1996) J Ferber, Multi-agent Systems: An Introduction to Distributed Artificial Intelligence (AddisonWesley, Reading, 1999) S Gurruzzo, D Rosaci, Agent clustering based on semantic negotiation ACM Trans Auton Adapt Syst 3(2), 7:1–7:40 (2008) G Halevi, Special Issue on Big Data Research Trends, vol 30 (Elsevier, Amsterdam, 2012) H Herodotou, H Lim, G Luo, N Borisov, L Dong, F.B Cetin, S Babu, Starfish: a self-tuning system for big data analytics, in 5th Biennial Conference on Innovative Data Systems Research (CIDR’11) (USA, 2011), pp 261–272 M Loukides, What is data science, O’ Reillly Media, Inc., CA (2011) http://radar.oreilly.com/r2/ release-2-0-11.html/ C.D Manning, P Raghavan, H Schutze, Introduction to Information Retrieval (Cambridge University Press, Cambridge, 2008) L Page, S Brin, R Motwani, T Winograd, The PageRank citation ranking: bringing order to the Web Technical Report Stanford InfoLab (1999) J.J Patil, Data Jujitsu: the art of turning data into product, in O’Reilly Media (2012) A Pavlo, E Paulson, A Rasin, D.J Abadi, D.J DeWitt, S Madden, M Stonebraker, A comparison of approaches to large-scale data analysis, in SIGMOD’09 (2009) A Rajaraman, J.D Ullman, Mining of Massive Datasets (Cambridge University Press, Cambridge, 2012) T Ravindra Babu, M Narasimha Murty, S.V Subrahmanya, Multiagent systems for large data clustering, in Data Mining and Multi-agent Integration, ed by L Cao (Springer, Berlin, 2007), pp 219–238 Chapter 15 T Ravindra Babu, M Narasimha Murty, S.V Subrahmanya, Multiagent based large data clustering scheme for data mining applications, in Active Media Technology ed by A An et al LNCS, vol 6335 (Springer, Berlin, 2010), pp 116–127 P Russom, iBig data analytics TDWI Best Practices Report, Fourth Quarter (2011) J Tozicka, M Rovatsos, M Pechoucek, A framework for agent-based distributed machine learning and data mining, in Autonomous Agents and Multi-agent Systems (ACM Press, New York, 2007) Article No 96 G Weiss (ed.), Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence (MIT Press, Cambridge, 2000) M Wooldridge, N.R Jennings, Towards a theory of cooperative problem solving, in Proc of Workshop on Distributed Software Agents and Applications, Denmark (1994), pp 40–53 P.C Zikopoulos, C Eaton, D deRoos, T Deutsch, G Lapis, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data (McGraw Hill, Cambridge, 2011) Appendix Intrusion Detection Dataset—Binary Representation Network Intrusion Detection Data was used during KDD-Cup99 contest Even 10 %-dataset can be considered large as it consists of 805049 patterns, each of which is characterized by 38 features We use this dataset in the present study, and hereafter we refer to this dataset as a “full dataset” in the current chapter In the current chapter, we apply the algorithms and methods developed so far on the said dataset and demonstrate their efficient working With this, we aim to drive home the generality of the developed algorithms The appendix contains data description and preliminary analysis A.1 Data Description and Preliminary Analysis Intrusion Detection dataset (10 % data) that was used during KDD-Cup99 contest is considered for the study The data relates to access of computer network by authorized and unauthorized users The access by unauthorized users is termed as intrusion Different costs of misclassification are attached in assigning a pattern belonging to a class to any other class The challenge lies in detecting intrusion belonging to different classes accurately minimizing the cost of misclassification Further, whereas the feature values in the data used in the earlier chapters contained binary values, the current data set assumes floating point values The training data consists of 41 features Three of the features are binary attributes, and the remaining are floating point numerical values For effective use of these attributes along with other numerical features, the attributes need to be assigned proper weights based on the domain knowledge Arbitrary weightages could adversely affect classification results In view of this, only 38 features are considered for the study On further analysis, it is observed that values of two of the 38 features in the considered 10 %-dataset are always zero, effectively suggesting exclusion of these two features (features numbered 16 and 17, counting from feature 0) The training data consists of 311,029 patterns, and the test data consists of 494,020 patterns They are tabulated in Table A.1 A closer observation reveals that not all feaT Ravindra Babu et al., Compression Schemes for Mining Large Datasets, Advances in Computer Vision and Pattern Recognition, DOI 10.1007/978-1-4471-5607-9, © Springer-Verlag London 2013 185 186 Intrusion Detection Dataset—Binary Representation Table A.1 Attack types in training data Description No of patterns No of attack types No of features Training data 311,029 23 38 Test data 494,020 42 38 Table A.2 Attack types in training data Class No of types Attack types normal normal dos back, land, neptune, pod, smurf, teardrop u2r buffer–overflow, loadmodule, perl, rootkit r2l ftp-write, guess-password, imap, multihop, phf, spy, warezclient, warezmaster probe ipsweep, nmap, portsweep, satan Table A.3 Additional attack types in test data Additional attack types snmpgetattack, processtable, mailbomb, snmpguess, named, sendmail, named, sendmail, httptunnel, apache2, worm, sqlattack, ps, saint, xterm, xlock, upstorm, mscan, xsnoop Table A.4 Assignment of unknown attack types using domain knowledge Class Attack type dos processtable, mailbomb, apache2, upstorm u2r sqlattack, ps, xterm r2l snmpgetattack, snmpguess, named, sendmail, httptunnel, worm, xlock, xsnoop probe saint, mscan tures are frequent, which is also brought out in the preliminary analysis We make use of this fact during the experiments The training data consists of 23 attack types, which form 4-broad classes The list is provided in Table A.2 As noted earlier in Table A.1, test data contained more classes than those in the training data, as provided in Table A.3 Since the classification of test data depends on learning from training data, the unknown attack types (or classes) in the test data have to be assigned one of a priori known classes of training data This is carried out in two ways, viz., (a) assigning unknown attack types with one of the known types by Nearest-neighbor assignment within Test Data, or (b) assigning with the help of domain knowledge Independent exercises are carried out to assign unknown classes by both the methods The results obtained by both these methods differ significantly In view of this, assignments based on domain A.1 Data Description and Preliminary Analysis Table A.5 Class-wise numbers of patterns in training data of 494,020 patterns Table A.6 Class-wise distribution of test data based on domain knowledge Table A.7 Cost matrix 187 Class Class-label normal 97,277 u2r 52 dos 391,458 r2l 1126 probe 4107 Class Class-label normal u2r 70 dos 229,853 r2l 16,347 probe 4166 Class type normal No of patterns No of patterns 60,593 u2r dos r2l probe normal 2 u2r 2 dos 2 r2l 2 probe 2 knowledge are considered, and test data is formed accordingly Table A.4 contains assigned types based on domain knowledge One important observation that can be made from the mismatch between NN assignment and Table A.4 is that the class boundaries overlap, which leads to difficulty in classification Table A.5 contains the class-wise distribution of training data Table A.6 provides the class-wise distribution of test data based on domain knowledge assignment In classifying the data, each wrong pattern assignment is assigned a cost The cost matrix is provided in Table A.7 Observe from the table that the cost of assigning a pattern to a wrong class is not uniform For example, the cost of assigning a pattern belonging to class “u2r” to “normal” is Its cost is more than that of assigning a pattern from “u2r” to “dos”, say Feature-wise statistics of training data are provided in Table A.8 The table contains a number of interesting statistics They can be summarized below • Ranges of mean values (Column 2) of different features are different • Standard deviation (Column 3), which is a measure of dispersion, is different for different feature values • Minimum value of each feature is 0.0 (Column 4) 188 Intrusion Detection Dataset—Binary Representation Table A.8 Feature-wise statistics Feature Mean value No (1) (2) SD Min Max (3) (4) 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 707.745756 988,217.066787 33,039.967815 0.006673 0.134805 0.005510 0.782102 0.015520 0.355342 1.798324 0.010551 0.007793 2.012716 0.096416 0.011020 0.036482 0.0 0.0 0.037211 213.147196 246.322585 0.380717 0.381016 0.231623 0.232147 0.388189 0.082205 0.142403 64.745286 106.040032 0.410779 0.109259 0.481308 0.042134 0.380593 0.380919 0.230589 0.230140 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 47.979302 3025.609608 868.529016 0.000045 0.006433 0.000014 0.034519 0.000152 0.148245 0.010212 0.000111 0.000036 0.011352 0.001083 0.000109 0.001008 0.0 0.0 0.001387 332.285690 292.906542 0.176687 0.176609 0.057433 0.057719 0.791547 0.020982 0.028998 232.470786 188.666186 0.753782 0.030906 0.601937 0.006684 0.176754 0.176443 0.058118 0.057412 (5) 58,329 693,375,616 5,155,468 3.0 3.0 30.0 5.0 1.0 884.0 1.0 2.0 993.0 28.0 2.0 8.0 0.0 0.0 1.0 511.0 511.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 255.0 255.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Bits (VQ) (6) Resoln (494021) (7) Suprt (8) 16 30 23 4 10 4 10 0 9 4 4 4 8 4 4 4 4 1.4e−5 6.0e−10 7.32e−8 0.06 0.06 0.06 0.03 0.08 0.06 9.8e−4 0.06 0.12 9.8e−4 3.2e−2 0.12 0.04 0 0.06 2.0e−3 2.0e−3 0.06 0.06 0.06 0.06 0.06 0.06 0.06 3.9e−3 3.9e−3 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 12,350 378,679 85,762 22 1238 3192 63 63 73,236 2224 55 12 585 265 51 454 0 685 494,019 494,019 89,234 88,335 29,073 29,701 490,394 112,000 34,644 494,019 494,019 482,553 146,990 351,162 52,133 94,211 93,076 35,229 341,260 A.2 Bibliographic Notes Table A.9 Accuracy of winner and runner-up of KDD-Cup99 189 Class Winner Runner-up normal 99.5 99.4 dos 97.1 97.5 r2l 8.4 7.3 u2r 13.2 11.8 probe 83.3 84.5 Cost 0.2331 0.2356 • Maximum values of different features are different (Column 5) • Feature-wise support is different for different features (Column 8) The support is defined here as the number of times a feature assumed a nonzero value in the training data • If the real values are to be mapped to integers, the numbers of bits required along with corresponding resolution are provided in Columns and The observations made are used later in the current chapter through various sections Further, dissimilarity measure plays an important role The range of values for any feature within a class or across the classes is large Also the values assumed by different features within a pattern are also largely variant This scenario suggests use of the Euclidean and Mahalanobis distance measures We applied both the measures while carrying out exercises on samples drawn from the original dataset Based on the study on the random samples, the Euclidean distance measure provided a better classification accuracy We made use of the Euclidean measure subsequently We classified test patterns with complete dataset With full data, NNC provided a classification accuracy of 92.11 % The corresponding cost of classification cost is 0.254086 This result is useful in comparing possible improvements with proposed algorithms in the book Results reported during KDD-Cup99 are provided in Table A.9 A.2 Bibliographic Notes KDD-Cup data (1999) contains the 10 % and full datasets provided during KDDCup challenge in 1999 References KDD-Cup99 Data, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (1999) Glossary b Number of binary features per block, 69 c Sample size, 96 d Number of features, 48, 69 H Final hypothesis in AdaBoost, 140 k Number of clusters, 11, 96 n Number of patterns, 11, 48, 69 Pc Probability of cross-over, 59 Pi Probability of initialization, 59 Pm Probability of mutation, 59 q Number of blocks per pattern, 69 R ∗ (ω) True risk, 61 Remp (ω) Empirical risk, 61 r Length of subsequence, 69 ΩL Lower approximation of class, Ω, 86 ΩU Upper approximation of class, Ω, 86 v Value of block, 69 X Set of patterns, ε Minimum support, 69 εj Error after each iteration in AdaBoost, 140 η Dissimilarity threshold for identifying nearest neighbor to a subsequence, 69 T Ravindra Babu et al., Compression Schemes for Mining Large Datasets, Advances in Computer Vision and Pattern Recognition, DOI 10.1007/978-1-4471-5607-9, © Springer-Verlag London 2013 191 192 Glossary ψ Minimum frequency for pruning a subsequence, 69 ζ Distance threshold for computing leaders in leader clustering algorithm, 100, 140 hi ith hypothesis at iteration j in AdaBoost, 140 X Set of compressed patterns, Index A AdaBoost, 125, 126, 128–131 Agent, 177 mining interaction, 177 Agent-mining, 173 Anomaly detection, 59 Antecedent, 22 Appendix, 185 Apriori algorithm, 23 Association rule, 25 mining, 2, 4, 12, 22, 39, 68, 123 Apriori algorithm, Average probability of error, 17 Axis-parallel split, 131 B Bayes classifier, 17 Bayes rule, 17 Big data, 8, 173 analytics, 174 Big data analytics, 173 Bijective function, 53 Binary classifier, 20, 125 Binary pattern, Binary string, 155 BIRCH, 28, 123 Block length, 69 value of, 69 Boosting, 128 Breadth first search (BFS), 152 Business intelligence, 174 C Candidate itemset, 24 CART, 131 Central tendency, 161 Centroid, 13, 96, 101 CLARA, 96, 123 CLARANS, 123 Class label, 17 Classification AdaBoost, see AdaBoost binary, 126, 138, 140 binary classifier, 125 decision tree, see Decision tree definition, divide and conquer, 35 incremental, 34 intermediate abstraction, 37 kNNC, see k-nearest neighbor classifier multicategory, 126 NNC, see Nearest neighbor classifier one vs all, 144 one vs one, 126, 144 one vs rest, 126 rough set, 86 SVM, see machine at Support vector Classification accuracy, 6, 98, 110, 122, 129, 156 Classification algorithm, Cluster feature tree, 28 Cluster representative, 1, 11, 13, 96, 100, 108, 112, 136 Clustering, 4, 95, 96 algorithms, 13 CLARA, see CLARA CLARANS, see CLARANS CNN, see Condensed nearest neighbor (CNN) definition, hierarchical, 13 incremental, 28 intermediate representation, 33 T Ravindra Babu et al., Compression Schemes for Mining Large Datasets, Advances in Computer Vision and Pattern Recognition, DOI 10.1007/978-1-4471-5607-9, © Springer-Verlag London 2013 193 194 Clustering (cont.) k-Means, 101 leader, 28 PAM, see PAM partitional, 13 Clustering algorithm, 1, 97 Clustering feature, 28 Clustering method, 126 CNN, 126 Compact data representation, 48 Compressed data distance computation, 51 representation, 49 Compressed data representation, 49, 50 Compressed pattern, 86 Compressed training data, 91 Compressing data, Compression Huffman coding, 68, 74, 76, 77 lossless, 2, 76 run length, 49, 170 lossy, 2, 77 run length, 76 Computation time, 47, 121 Computational requirement, Condensed nearest neighbor (CNN), 7, 97, 136 Confidence, 25 Confusion matrix, 134 Consequent, 22 Constraint, 127 Convex quadratic programming, 127 Cost matrix, 120 Criterion function, 15 Crossover, 155 Curse of dimensionality, 96 D Data abstraction, 13, 48 Data analysis, 73, 125 Data compaction, 96 Data compression, Data matrix, 6, 35 Data mining, 2, 173 association rules, see mining at Association rule feature selection, 7, see also, selection at Feature, 59, 98–100 prototype selection, 7, 28, 43, 59, 95, 97, 100, 129, 140, 147, 179 prototypeselection, 99 Data mining algorithms, 11 Data science, 174 Data squashing, 96 Index Data structure, 28 Dataset hand written digits, 55, 75, 98, 110, 137, 138 intrusion detection, 110, 116, 122, 185 UCI-ML, 101, 123, 137, 144 Decision boundary, 21 Decision function, 128 Decision making, 1, 11 Decision rule, 128 Decision tree, 11, 130, 151, 152 Decision tree classifier, Dendrogram, 14 Dictionary, Dimensionality reduction, 3, 7, 96, 98, 147, 153, 154, 160, 171 Discriminative classifier, 17, 20 Discriminative model, 17 Dispersion, 161 Dissimilarity computation, 86 Distance threshold, 116, 139 Distinct subsequences, 104, 110 Divide and conquer, 5, 27, 31, 173, 175 Document classification, Domain, 53 Domain knowledge, 125, 126, 131 Dot product, E Edit distance, 47 Efficient algorithm, 130 Efficient hierarchical algorithm, Efficient mining algorithms, 27 Embedded scheme, 151 Empirical risk, 61 Encoding mechanism, 155 Ensemble, 128 Error-rate, 17 Euclidean distance, 11, 14, 54, 98 Exhaustive enumeration, Expected test error, 126 F Face detection, 177 Farthest neighbor, 19 Feature extraction, 148, 152 principal component analysis, 152, 153 random projection, 153 selection, 148, 149 genetic algorithm, 154 ranking, 149 ranking features, 150 Index Feature (cont.) sequential backward floating selection, 150 sequential backward selection, 149 sequential forward floating selection, 150 sequential forward selection, 149 stochastic search, 152 wrapper methods, 151 Feature extraction definition, Feature selection, 95 definition, Filter methods, 148 Fisher’s score, 7, 150 Fitness function, 155 Frequent features, 8, 99 feature selection, 160 Frequent item, 7, 23, 69, 83, 95, 100, 107 support, 99 Frequent item set, 95 Frequent items, 113 Frequent-pattern tree (FP-tree), 33, 48 Function, 53 G Generalization error, 96 Generation gap, 158 Generative model, 17 Genetic algorithm, 97, 123, 152, 154, 171 crossover, 156 probability, 166 mutation, 156 probability, 166 selection, 156 simple (SGA), 155, 157 steady state (SSGA), 158 Genetic algorithms (GAs), 6, 57 Genetic operators, 155 Global optimum, 7, 155 Growth function, 61 H Hadoop, 175 Hamming distance, 54, 98, 138 Handwritten digit data, Hard partition, Heterogeneous data, 174 Hierarchical clustering algorithm definition, High-dimensional, 4, 67, 96 High-dimensional dataset, High-dimensional space, 19, 22 195 Hybrid algorithms, 48 Hyperplane, 21, 127 I Improvement in generalization, 68 Incremental mining, 5, 27 Infrequent item, Initial centroids, 16 Inter-cluster distance, 13 Intermediate abstraction, 27 Intermediate representation, Intra-cluster distance, 13 K k-means algorithm, 15 K-means clustering, k-nearest neighbor classifier, 19, 54, 55, 105–107, 122, 129, 141, 144, 166, 169 K-nearest neighbor classifier (KNNC), 4, 78, 87 k-partition, 14 Kernel function, 128 KNNC, 83 Knowledge structure, L Lq norm, 54 Labelled training dataset, Lagrange multiplier, 127 Lagrangian, 20 Large dataset, 12, 97 Large-scale dataset, Leader, 13, 96, 100 clustering algorithm, 100 Leader clustering, 140 Leader clustering algorithm, Learn a classifier, Learn a model, Learning algorithm, 125, 128 Learning machine, 126 Linear discriminant function, 20 Linear SVM, 21 Linearly separable, 21 Local minimum, 16 Longest common subsequence, 47 Lossy compression, 11, 13, 96 Lower approximation, 6, 86 M Machine learning, 7, 8, 22 Machine learning algorithm, 11 Manhattan distance, 51–53, 160 Map-reduce, 5, 173 196 MapReduce, 174 Massive data, 173 Maximizing margin, 127 Maximum margin, 20 Minimum description length, 63 Minimum frequency, 69 Minimum support, 71 Mining compressed data, Minsup, 23 Multi-class classification, 129 Multiagent system, 173, 177, 179 agile processing, 181 attribute reduction, 179 data reduction, 178 heterogeneous data access, 180 Multiagent systems, Multiclass classification, 125 Multiple data scans, 48 Mutation, 155 Mutual information (MI), 7, 151 N Nearest neighbor, Nearest neighbor classifier (NNC), 4, 17 feature selection by wrapper methods, 151 Negative class, 17 NNC, No free lunch theorem, 125 Noise, 19 Non-linear decision boundary, 128 Non-negative matrix factorization, Nonlossy compression, Number of database scans, 11 Number of dataset scans, 27 Number of representatives, 116 Numerical taxonomy, O Objective function, 155 Oblique split, 131 One-to-one function, 53 Onto function, 53 Optimal decision tree, 131 Optimization problem, 156 Order dependence, 101 Outlier, 13 P PageRank, 176 dead ends, 176 link spam, 176 MapReduce, 176 spam mass, 176 spider traps, 176 Index teleport, 176 topic-sensitive, 176 TrustRank, 176 PAM, 96 Parallel hyperplanes, 20 Partitional algorithms, Pattern classification, 100 Pattern clustering, 95 Pattern matrix, 26 Pattern recognition, 2, 22, 130 Pattern synthesis, 36 Patterns representative, 136 Population of chromosomes, 155 Positive class, 17 Posterior probability, 17, 126 Prediction accuracy, 67 Principal component analysis, Prior probabilities, 17 Probability distribution, 11 Probability of crossover, 156 Probability of mutation, 156 Prototype selection, 96, 99 Prototypes CNN, 136 leader, 129 Proximity between a pair of patterns, 11 Proximity matrix, 14 Pruning of subsequences, 83 R Random number, 156 Random projections, Random surfer, 176 Range, 53 Regression, logistic, 44 Regression trees, 131 Representative pattern, 1, 7, 97, 138 Risk, 126 Robust to noise, Rough set, 6, 86 Rough set based scheme, 68 Roulette wheel selection, 156 Run dimension, 49 length encoded compression, 49 encoding, 47 string, 49 length, 49 Run length, 49 Run-length coding, Index S Sampling, 96 Scalability, 67 Scalability of algorithms, 96 Scalable mining algorithms, 12 Scan the dataset, Secondary storage, 11 Selection, 155 feature, 96 prototype, 96 Selection mechanism, 156 Selection of prototypes and features, 95 Semi-structured data, 174, 178 Sequence, 68 Sequence mining, 26 Set of prototypes, 95 Single database scan, 97, 100 Single-link algorithm, 4, 14 Singleton cluster, 13 Soft partition, SONAR, 141 Space organization, 59 Spacecraft health data, 55 Squared Euclidean distance, 37 Squared-error criterion, 15 Squashing, 48 State-of-the-art classifier, Storage space, 47, 121 Subsequence, 6, 69, 83 distinct, 70 length of, 69 Subset of items, Sufficient statistics, 47 Support, 68 minimum, 68, 83, 85, 87, 88, 100, 104, 163, 168 Support vector, 3, 20, 127, 136 machine, 20, 97, 131 Support vector machine (SVM), 4, 17, 125, 126 197 feature selection by wrapper methods, 151 Survival of the fittest, 155 SVM, T Termination condition, 155 Test pattern, 1, 12, 17, 126 Text mining, 27 Threshold, 17 Threshold value, 28 THYROID, 141 Training dataset, 136 Training phase, 18 Training samples, 126 Training set, 17 Tree CF, 28 decision, see Decision tree knowledge based (KBTree), 136 Tree classifier, 125 U UCI Repository, 141 Uncompressing the data, Unstructured data, 174, 178 Upper approximation, 6, 86 V Variety, 174 VC dimension, 6, 60 VC entropy, 61 Velocity, 174 Volume, 174 W Weak learner, 128 Weight vector, 17 WINE, 141 Wrapper methods, 148 ... deal with data mining and compression; specifically, we deal with using several data mining tasks directly on the compressed data 1.1 Data Mining and Data Compression Data mining is concerned with. .. selection of optimal feature set Divide-and-conquer has been one important direction to deal with large datasets With reducing cost and increasing ability to collect and store enormous amounts of data,... produce expected results in dealing with high-dimensional datasets Also, computational requirements in the form of time and space can increase enormously with dimensionality This prompts reduction

Ngày đăng: 05/10/2018, 12:50

w