1. Trang chủ
  2. » Giáo án - Bài giảng

data classification algorithms and applications aggarwal 2014 07 25 Cấu trúc dữ liệu và giải thuật

704 57 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 704
Dung lượng 7,22 MB

Nội dung

CuuDuongThanCong.com D ata C lassification Algorithms and Applications CuuDuongThanCong.com Chapman & Hall/CRC Data Mining and Knowledge Discovery Series SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and handbooks The inclusion of concrete examples and applications is highly encouraged The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues PUBLISHED TITLES ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY Michael J Way, Jeffrey D Scargle, Kamal M Ali, and Ashok N Srivastava BIOLOGICAL DATA MINING Jake Y Chen and Stefano Lonardi COMPUTATIONAL BUSINESS ANALYTICS Subrata Das COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE DEVELOPMENT Ting Yu, Nitesh V Chawla, and Simeon Simoff COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY, AND APPLICATIONS Sugato Basu, Ian Davidson, and Kiri L Wagstaff CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS Guozhu Dong and James Bailey DATA CLASSIFICATION: ALGORITHMS AND APPLICATIONS Charu C Aggarawal DATA CLUSTERING: ALGORITHMS AND APPLICATIONS Charu C Aggarawal and Chandan K Reddy CuuDuongThanCong.com DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH Guojun Gan DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada DATA MINING WITH R: LEARNING WITH CASE STUDIES Luís Torgo FOUNDATIONS OF PREDICTIVE ANALYTICS James Wu and Stephen Coggeshall GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, SECOND EDITION Harvey J Miller and Jiawei Han HANDBOOK OF EDUCATIONAL DATA MINING Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d Baker INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis Hristidis INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS Priti Srinivas Sajja and Rajendra Akerkar INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS AND TECHNIQUES Benjamin C M Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S Yu KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn KNOWLEDGE DISCOVERY FROM DATA STREAMS João Gama MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR ENGINEERING SYSTEMS HEALTH MANAGEMENT Ashok N Srivastava and Jiawei Han MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao Liu MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO CONCEPTS AND THEORY Zhongfei Zhang and Ruofei Zhang MUSIC DATA MINING Tao Li, Mitsunori Ogihara, and George Tzanetakis NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S Yu, Rajeev Motwani, and Vipin Kumar RAPIDMINER: DATA MINING USE CASES AND BUSINESS ANALYTICS APPLICATIONS Markus Hofmann and Ralf Klinkenberg CuuDuongThanCong.com RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS, AND APPLICATIONS Bo Long, Zhongfei Zhang, and Philip S Yu SERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY Domenico Talia and Paolo Trunfio SPECTRAL FEATURE SELECTION FOR DATA MINING Zheng Alan Zhao and Huan Liu STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION George Fernandez SUPPORT VECTOR MACHINES: OPTIMIZATION BASED THEORY, ALGORITHMS, AND EXTENSIONS Naiyang Deng, Yingjie Tian, and Chunhua Zhang TEMPORAL DATA MINING Theophano Mitsa TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N Srivastava and Mehran Sahami THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX DECOMPOSITIONS David Skillicorn CuuDuongThanCong.com D ata C lassification Algorithms and Applications Edited by Charu C Aggarwal IBM T J Watson Research Center Yorktown Heights, New York, USA CuuDuongThanCong.com CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2015 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Version Date: 20140611 International Standard Book Number-13: 978-1-4665-8675-8 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com CuuDuongThanCong.com To my wife Lata, and my daughter Sayani CuuDuongThanCong.com This page intentionally left blank CuuDuongThanCong.com Contents Editor Biography xxiii Contributors xxv Preface An Introduction to Data Classification Charu C Aggarwal 1.1 Introduction 1.2 Common Techniques in Data Classification 1.2.1 Feature Selection Methods 1.2.2 Probabilistic Methods 1.2.3 Decision Trees 1.2.4 Rule-Based Methods 1.2.5 Instance-Based Learning 1.2.6 SVM Classifiers 1.2.7 Neural Networks 1.3 Handing Different Data Types 1.3.1 Large Scale Data: Big Data and Data Streams 1.3.1.1 Data Streams 1.3.1.2 The Big Data Framework 1.3.2 Text Classification 1.3.3 Multimedia Classification 1.3.4 Time Series and Sequence Data Classification 1.3.5 Network Data Classification 1.3.6 Uncertain Data Classification 1.4 Variations on Data Classification 1.4.1 Rare Class Learning 1.4.2 Distance Function Learning 1.4.3 Ensemble Learning for Data Classification 1.4.4 Enhancing Classification Methods with Additional Data 1.4.4.1 Semi-Supervised Learning 1.4.4.2 Transfer Learning 1.4.5 Incorporating Human Feedback 1.4.5.1 Active Learning 1.4.5.2 Visual Learning 1.4.6 Evaluating Classification Algorithms 1.5 Discussion and Conclusions xxvii 4 11 11 14 16 16 16 17 18 20 20 21 21 22 22 22 23 24 24 26 27 28 29 30 31 ix CuuDuongThanCong.com Educational and Software Resources for Data Classification 661 [58] for nucleic acid sequences and Protein Information Resources (PIR) [56] and UniProt [55] for protein sequences In the context of image applications, researchers in machine learning and computer vision communities have explored the problem extensively ImageCLEF [77] and ImageNet [78] are two widely used image data repositories that are used to demonstrate the performance of image data retrieval and learning tasks Vision and Autonomous Systems Center’s Image Database [75] from Carnegie Mellon University and the Berkeley Segmentation dataset [76] can be used to test the performance of classification for image segmentation problems An extensive list of Web sites that provide image databases is given in [79] and [84] 25.4 Summary This chapter presents a summary of the key resources for data classification in terms of books, surveys, and commercial and non-commercial software packages It is expected that many of these resources will evolve over time Therefore, the reader is advised to use this chapter as a general guideline on which to base their search, rather than treating it as a comprehensive compendium Since data classification is a rather broad area, much of the recent software and practical implementations have not kept up with the large number of recent advances in this field This has also been true of the more general books in the field, which discuss the basic methods, but not the recent advances such as big data, uncertain data, or network classification This chapter is an attempt to bridge the gap in this rather vast field, by creating a book, that covers the different areas of data classification in detail Bibliography [1] C Aggarwal Outlier Analysis, Springer, 2013 [2] C Aggarwal Data Streams: Models and Algorithms, Springer, 2007 [3] C Aggarwal Social Network Data Analytics, Springer, Chapter 5, 2011 [4] C Aggarwal, H Wang Managing and Mining Graph Data, Springer, 2010 [5] C Aggarwal, C Zhai Mining Text Data, Springer, 2012 [6] C Aggarwal, C Zhai A survey of text classification algorithms, In Mining Text Data, pages 163–222, Springer, 2012 [7] C Aggarwal Towards effective and interpretable data mining by visual interaction, ACM SIGKDD Explorations, 3(2):11–22, 2002 [8] D Aha, D Kibler, and M Albert Instance-based learning algorithms, Machine Learning, 6(1):37–66, 1991 [9] D Aha Lazy learning: Special issue editorial Artificial Intelligence Review, 11(1–5): 7–10, 1997 [10] C Bishop Neural Networks for Pattern Recognition, Oxford University Press, 1996 CuuDuongThanCong.com 662 Data Classification: Algorithms and Applications [11] C Bishop Pattern Recognition and Machine Learning, Springer, 2007 [12] W Buntine Learning Classification Trees Artificial Intelligence Frontiers in Statistics, Chapman and Hall, pages 182–201, 1993 [13] C Burges A tutorial on support vector machines for pattern recognition Data Mining and Knowledge Discovery, 2(2): 121–167, 1998 [14] N V Chawla, N Japkowicz, and A Kotcz Editorial: Special issue on learning from imbalanced data sets, ACM SIGKDD Explorations Newsletter, 6(1):1–6, 2004 [15] O Chapelle, B Scholkopf, and A Zien Semi-Supervised Learning Vol 2, Cambridge: MIT Press, 2006 [16] N Cristianini and J Shawe-Taylor An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000 [17] B V Dasarathy Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques IEEE Computer Society Press, 1990, [18] J Dean and S Ghemawat MapReduce: A flexible data processing tool, Communication of the ACM, 53(1):72–77, 2010 [19] R Duda, P Hart, and D Stork, Pattern Classification, Wiley, 2001 [20] A Frank, and A Asuncion UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, 2010 http://archive.ics.uci edu/ml [21] L Hamel Knowledge Discovery with Support Vector Machines, Wiley, 2009 [22] S Haykin Neural Networks and Learning Machines, Prentice Hall, 2008 [23] T Hastie, R Tibshirani, and J Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2013 [24] A Jain, R Duin, and J Mao Statistical pattern recognition: A review IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1:4–37, 2000 [25] N Japkowicz, M Shah Evaluating Learning Algorithms: A Classification Perspective, Cambridge University Press, 2011 [26] S Kulkarni, G Lugosi, and S Venkatesh Learning pattern classification: A Survey IEEE Transactions on Information Theory, 44(6):2178–2206, 1998 [27] H Liu, H Motoda Feature Selection for Knowledge Discovery and Data Mining, Springer, 1998 [28] R Mayer Multimedia Learning, Cambridge University Press, 2009 [29] T Mitchell Machine Learning, McGraw Hill, 1997 [30] B Moret Decision trees and diagrams, ACM Computing Surveys (CSUR), 14(4):593–623, 1982 [31] K Murphy Machine Learning: A Probabilistic Perspective, MIT Press, 2012 [32] S K Murthy Automatic construction of decision trees from data: A multi-disciplinary survey Data Mining and Knowledge Discovery, 2(4):345–389, 1998 CuuDuongThanCong.com Educational and Software Resources for Data Classification 663 [33] S J Pan, Q Yang A survey on transfer learning IEEE Transactons on Knowledge and Data Engineering, 22(10):1345–1359, 2010 [34] J R Quinlan, Induction of decision trees, Machine Learning, 1(1):81–106, 1986 [35] F Sebastiani Machine learning in automated text categorization, ACM Computing Surveys, 34(1):1–47, 2002 [36] B Settles Active Learning, Morgan and Claypool, 2012 [37] T Soukop, I Davidson Visual Data Mining: Techniques and Tools for Data Visualization, Wiley, 2002 [38] B Scholkopf, A J Smola Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond Cambridge University Press, 2001 [39] B Scholkopf and A J Smola Learning with Kernels Cambridge, MA, MIT Press, 2002 [40] I Steinwart and A Christmann Support Vector Machines, Springer, 2008 [41] V Vapnik The Nature of Statistical Learning Theory, Springer, 2000 [42] T White Hadoop: The Definitive Guide Yahoo! Press, 2011 [43] D Wettschereck, D Aha, T Mohri A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms, Artificial Intelligence Review, 11(1–5):273– 314, 1997 [44] Z Xing, J Pei, and E Keogh A brief survey on sequence classification SIGKDD Explorations, 12(1):40–48, 2010 [45] L Yang Distance Metric Learning: A Comprehensive Survey, 2006 http://www.cs.cmu edu/~liuy/frame_survey_v2.pdf [46] X Zhu, and A Goldberg Introduction to Semi-Supervised Learning, Morgan and Claypool, 2009 [47] http://mallet.cs.umass.edu/ [48] http://www.cs.ucr.edu/~eamonn/time_series_data/ [49] http://www.kdnuggets.com/datasets/ [50] http://www.cs.waikato.ac.nz/ml/weka/ [51] http://www.kdnuggets.com/software/classification.html [52] http://www-01.ibm.com/software/analytics/spss/products/modeler/ [53] http://www.sas.com/technologies/analytics/datamining/miner/index.html [54] http://www.mathworks.com/ [55] http://www.ebi.ac.uk/uniprot/ [56] http://www-nbrf.georgetown.edu/pirwww/ [57] http://www.ncbi.nlm.nih.gov/genbank/ [58] http://www.ebi.ac.uk/embl/ CuuDuongThanCong.com 664 Data Classification: Algorithms and Applications [59] http://www.ebi.ac.uk/Databases/ [60] http://mips.helmholtz-muenchen.de/proj/ppi/ [61] http://string.embl.de/ [62] http://dip.doe-mbi.ucla.edu/dip/Main.cgi [63] http://thebiogrid.org/ [64] http://www.ncbi.nlm.nih.gov/geo/ [65] http://www.gems-system.org/ [66] http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi [67] http://www.statoo.com/en/resources/anthill/Datamining/Data/ [68] http://www.csse.monash.edu.au/~dld/datalinks.html [69] http://www.sigkdd.org/kddcup/ [70] http://lib.stat.cmu.edu/datasets/ [71] http://www.daviddlewis.com/resources/testcollections/reuters21578/ [72] http://qwone.com/~jason/20Newsgroups/ [73] http://people.cs.umass.edu/~mccallum/data.html [74] http://snap.stanford.edu/data/ [75] http://vasc.ri.cmu.edu/idb/ [76] http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/ [77] http://www.imageclef.org/ [78] http://www.image-net.org/ [79] http://www.imageprocessingplace.com/root_files_V3/image_databases.htm [80] http://datamarket.com/data/list/?q=provider:tsdl [81] http://www.cs.cmu.edu/~mccallum/bow/ [82] http://trec.nist.gov/data.html [83] http://www.kdnuggets.com/competitions/index.html [84] http://www.cs.cmu.edu/~cil/v-images.html [85] http://www.salford-systems.com/ [86] http://www.rulequest.com/Personal/ [87] http://www.stat.berkeley.edu/users/breiman/RandomForests/ [88] http://www.comp.nus.edu.sg/~dm2/ [89] http://www.cs.cmu.edu/~wcohen/#sw CuuDuongThanCong.com Educational and Software Resources for Data Classification 665 [90] http://www.csie.ntu.edu.tw/~cjlin/libsvm/ [91] http://www.kxen.com/ [92] http://www.tiberius.biz/ [93] http://www.esat.kuleuven.be/sista/lssvmlab/ [94] http://treparel.com/ [95] http://svmlight.joachims.org/ [96] http://www.kernel-machines.org/ [97] http://www.cs.ubc.ca/~murphyk/Software/bnsoft.html [98] ftp://ftp.sas.com/pub/neural/FAQ.html [99] http://www.neuroxl.com/ [100] http://www.mathworks.com/products/neural-network/index.html [101] http://www-01.ibm.com/software/analytics/spss/ [102] http://www.kdnuggets.com/software/classification-other.html [103] http://datamarket.com [104] http://analyse-it.com/products/method-evaluation/ [105] http://www.oracle.com/us/products/database/options/advanced-analytics/ overview/index.html CuuDuongThanCong.com This page intentionally left blank CuuDuongThanCong.com tears? reduced normal No (12) astigmatism? no yes sightedness? age? 58 Soft (5) No (1) far age? near Hard (3) > 20 No (2)

Ngày đăng: 29/08/2020, 18:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN