IT Training Advances in K-means Clustering_ A Data Mining Thinking [Wu 2012-07-10]

Springer Theses Recognizing Outstanding Ph.D Research For further volumes: http://www.springer.com/series/8790 Aims and Scope The series ‘‘Springer Theses’’ brings together a selection of the very best Ph.D theses from around the world and across the physical sciences Nominated and endorsed by two recognized specialists, each published volume has been selected for its scientific excellence and the high impact of its contents for the pertinent field of research For greater accessibility to non-specialists, the published versions include an extended introduction, as well as a foreword by the student’s supervisor explaining the special relevance of the work for the field As a whole, the series will provide a valuable resource both for newcomers to the research fields described, and for other scientists seeking detailed background information on special questions Finally, it provides an accredited documentation of the valuable contributions made by today’s younger generation of scientists Theses are accepted into the series by invited nomination only and must fulfill all of the following criteria • They must be written in good English • The topic should fall within the confines of Chemistry, Physics, Earth Sciences, Engineering and related interdisciplinary fields such as Materials, Nanoscience, Chemical Engineering, Complex Systems and Biophysics • The work reported in the thesis must represent a significant scientific advance • If the thesis includes previously published material, permission to reproduce this must be gained from the respective copyright holder • They must have been examined and passed during the 12 months prior to nomination • Each thesis should include a foreword by the supervisor outlining the significance of its content • The theses should have a clearly defined structure including an introduction accessible to scientists not expert in that particular field Junjie Wu Advances in K-means Clustering A Data Mining Thinking Doctoral Thesis accepted by Tsinghua University, China, with substantial expansions 123 Author Prof Dr Junjie Wu Department of Information Systems School of Economics and Management Beihang University 100191 Beijing China ISSN 2190-5053 ISBN 978-3-642-29806-6 DOI 10.1007/978-3-642-29807-3 Supervisor Prof Jian Chen Department of Management Science and Engineering School of Economics and Management Tsinghua University 100084 Beijing China ISSN 2190-5061 (electronic) ISBN 978-3-642-29807-3 (eBook) Springer Heidelberg New York Dordrecht London Library of Congress Control Number: 2012939113 Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made The publisher makes no warranty, express or implied, with respect to the material contained herein Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Parts of this book have been published in the following articles Wu, J., Xiong, H., Liu, C., Chen, J.: A generalization of distance functions for fuzzy c-means clustering with centroids of arithmetic means IEEE Transactions on Fuzzy Systems, forthcoming (2012) (Reproduced with Permission) Xiong, H., Wu, J., Chen, J.: K-means clustering versus validation measures: A data distribution perspective IEEE Transactions on Systems, Man, and Cybernetics— Part B 39(2): 318–331 (2009) (Reproduced with Permission) Wu, J., Xiong, H., Chen, J.: Adapting the right measures for k-means clustering In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 877–885 (2009) (Reproduced with Permission) Wu, J., Xiong, H., Chen, J.: COG: Local decomposition for rare class analysis Data Mining and Knowledge Discovery 20(2): 191–220 (2010) (Reproduced with Permission) To my dearest wife Maggie, and our lovely son William Supervisor’s Foreword In recent years people have witnessed the fast growth of a young discipline: data mining It aims to find unusual and valuable patterns automatically from huge volumes of data collected from various research and application domains As a typical inter-discipline, data mining draws work from many well-established fields such as database, machine learning, and statistics, and is grounded in some fundamental techniques such as optimization and visualization Nevertheless, data mining has successfully found its own way by focusing on real-life data with very challenging characteristics Mining large-scale data, high-dimensional data, highly imbalanced data, stream data, graph data, multimedia data, etc., have become one exciting topic after another in data mining A clear trend is, with increasing popularity of Web 2.0 applications, data mining is being advanced to build the next-generation recommender systems, and to explore the abundant knowledge inside the huge online social networks Indeed, it has become one of the leading forces that direct the progress of business intelligence, a field and a market full of imagination This book focuses on one of the core topics of data mining: cluster analysis In particular, it provides some recent advances in the theories, algorithms, and applications of K-means clustering, one of the oldest yet most widely used algorithms for clustering analysis From the theoretical perspective, this book highlights the negative uniform effect of K-means in clustering class-imbalanced data, and generalizes the distance functions suitable for K-means clustering to the notion of point-to-centroid distance From the algorithmic perspective, this book proposes the novel SAIL algorithm and its variants to address the zero-value dilemma of information-theoretic K-means clustering on high-dimensional sparse data Finally, from the applicative perspective, this book discusses how to select the suitable external measures for K-means clustering validation, and explores how to make innovative use of K-means for other important learning tasks, such as rare class analysis and consensus clustering Most of the preliminary works of this book have been published in the proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), and IEEE International Conference on Data Mining (ICDM), which indicates a strong ix x Supervisor’s Foreword data-mining thinking of the research in the book This book is also heavily based on Dr Wu’s Doctoral Thesis completed in Research Center for Contemporary Management (RCCM), Key Research Institute of Humanities and Social Sciences at Universities, Tsinghua University, which won the award of National Excellent Doctoral Dissertation of China in 2010, but with a substantial expansion based on his follow-up research In general, this book brings together the recent research efforts of Dr Wu in the cluster analysis field I believe both the researchers and practitioners in the cluster analysis field and the broader data mining area can benefit from reading this book Moreover, this book shows the research track of Dr Wu from a Ph.D student to a professor, which may be of interest particularly to new Ph.D students I want to compliment Dr Wu for having written such an outstanding book for the data mining community Tsinghua University, China, March 2012 Jian Chen Research Center for Contemporary Management Acknowledgments This book was partially supported by the National Natural Science Foundation of China (NSFC) under Grants 70901002, 71171007, 70890080, 70890082, 71028002, and 71031001, the Doctoral Fund of Ministry of Education of China under Grant 20091102120014, a Foundation for the Author of National Excellent Doctoral Dissertation of PR China (granted to Dr Junjie Wu), and the Program for New Century Excellent Talents in University (granted to Dr Junjie Wu) The author would also like to thank the supports from various research centers, including: (1) Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations, Beihang University; (2) Research Center for Contemporary Management, Key Research Institute of Humanities and Social Sciences at Universities, Tsinghua University; (3) Jiangsu Provincial Key Laboratory of E-Business Some internal funds of Beihang University also provided important supports to this book, including: (1) The publication fund for graduate students’ English textbooks; (2) The high-quality curriculum construction project for ‘‘Advanced Applied Statistics’’ (for graduate students); (3) The high-quality curriculum construction project for ‘‘Decision Support and Business Intelligence’’ (for undergraduate students) xi ... characteristics Mining large-scale data, high-dimensional data, highly imbalanced data, stream data, graph data, multimedia data, etc., have become one exciting topic after another in data mining A clear... International Conference on Data Mining) ,2 and SDM (SIAM International Conference on Data Mining) ,3 have become the main forums and prestigious brands that lead the trend of data mining research,... from various research and application domains As a typical inter-discipline, data mining draws work from many well-established fields such as database, machine learning, and statistics, and is

Định dạng
Số trang	187
Dung lượng	3,1 MB