Data mining ebook survey of text mining clustering classification and retrieval scan OCR 2004 (by laxxuss)

Springer New York Berlin Heidelberg Hong Kong London Milan Paris Tokyo Michael W Berry Editor Survey of Text Mining Clustering, Classification, and Retrieval Scanned by Velocity With 57 Illustrations Springer Michael W Berry Department of Computer Science University of Tennessee 203 Claxton Complex Knoxville, TN 37996-3450, USA berry@cs.utk.edu Cover illustration: Visualization of three major clusters in the L.A Times news database when document vectors are projected into the 3-D subspace spanned by the three most relevant axes determined using COV rescale This figure appears on p 118 of the text Library of Congress Cataloging-in-Publication Data Survey of text mining : clustering, classification, and retrieval / editor, Michael W Berry p cm Includes bibliographical references and index ISBN 0-387-95563-1 (alk Paper) Data mining—Congresses Cluster analysis—Congresses Discriminant analysis—Congresses I Berry, Michael W QA76.9.D343S69 2003 006.3—dc21 ISBN 0-387-95563-1 2003042434 Printed on acid-free paper © 2004 Springer-Verlag New York, Inc All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights SPIN 10890871 www.springer-ny.com Springer-Verlag New York Berlin Heidelberg A member of BertelsmannSpringer Science + Business Media GmbH Preface Contributors I Clustering and Classification xi xiii 1 Cluster-Preserving Dimension Reduction Methods for Efficient (Classification of Text Data Peg Howland and Haesun Park 1.1 Introduction 1.2 Dimension Reduction in the Vector Space Mode) 1.3 A Method Based on an Orthogonal Basis of Centroids 1.3.1 Relationship to a Method from Factor Analysis 1.4 Discriminant Analysis and Its Extension for Text Data 1.4.1 Generalized Singular Value Decomposition 1.4.2 Extension of Discriminant Analysis 1.4.3 Equivalence for Various and 1.5 Trace Optimization Using an Orthogonal Basis of Centroids 1.6 Document Classification Experiments 1.7 Conclusion References 10 11 14 16 17 19 22 25 Automatic Discovery of Similar Words Pierre P Senellart and Vincent D Blondel 2.1 Introduction 2.2 Discovery of Similar Words from a Large Corpus 2.2.1 A Document Vector Space Model 2.2.2 A Thesaurus of Infrequent Words 2.2.3 The SEXTANT System 2.2.4 How to Deal with the Web 2.3 Discovery of Similar Words in u Dictionary 25 26 27 28 29 32 33 vi Contents 2.3.1 Introduction 2.3.2 A Generalization of Kleinberg's Method 2.3.3 Other Methods 2.3.4 Dictionary Graph 2.3.5 Results 2.3.6 Future Perspectives 2.4 Conclusion References II Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents Hichem Frigui and Olfa Nasraoui 3.1 Introduction 3.2 Simultaneous Clustering and Term Weighting of Text Documents 3.3 Simultaneous Soft Clustering and Term Weighting of Text Documents 3.4 Robustness in the Presence of Noise Documents 3.5 Experimental Results 3.5.1 Simulation Results on Four-Class Web Text Data 3.5.2 Simulation Results on 20 Newsgroups Data 3.6 Conclusion References Feature Selection and Document Clustering Inderjit Dhillon, Jacob Kogan, and Charles Nicholas 4.1 Introduction 4.2 Clustering Algorithms 4.2.1 Means Clustering Algorithm 4.2.2 Principal Direction Divisive Partitioning 4.3 Data and Term Quality 4.4 Term Variance Quality 4.5 Same Context Terms 4.5.1 Term Profiles 4.5.2 Term Profile Quality 4.6 Spherical Principal Directions Divisive Partitioning 4.6.1 Two-Cluster Partition of Vectors on the Unit Circle 4.6.2 Clustering with sPDDP 4.7 Future Research References Information Extraction and Retrieval Vector Space Models for Search and Cluster Mining 33 33 35 36 37 41 41 42 45 45 47 52 56 57 57 59 69 70 73 73 74 74 78 80 81 86 87 87 90 90 96 98 99 101 103 Contents vii Mei Kobayashi and Masaki Aono 5.1 Introduction 5.2 Vector Space Modeling (VSM) 5.2.1 The Basic VSM Model for IR 5.2.2 Latent Semantic Indexing (LSI) 5.2.3 Covariance Matrix Analysis (COV) 5.2.4 Comparison of LSI and COV 5.3 VSM for Major and Minor Cluster Discovery 5.3.1 Clustering 5.3.2 Rescaling: Ando's Algorithm 5.3.3 Dynamic Rescaling of LSI 5.3.4 Dynamic Rescaling of COV 5.4 Implementation Studies 5.4.1 Implementations with Artificially Generated Datasets 5.4.2 Implementations with L.A Times News Articles 5.5 Conclusions and Future Work References 103 105 105 107 108 109 111 111 111 113 114 115 115 118 120 120 HotMiner: Discovering Hot Topics from Dirty Text Malu Castellanos 6.1 Introduction 6.2 Related Work 6.3 Technical Description 6.3.1 Preprocessing 6.3.2 Clustering 6.3.3 Postfiltering 6.3.4 Labeling 6.4 Experimental Results 6.5 Technical Description 6.5.1 Thesaurus Assistant 6.5.2 Sentence Identifier 6.5.3 Sentence Extractor 6.6 Experimental Results 6.7 Mining Case Excerpts for Hot Topics 6.8 Conclusions References 123 124 128 130 130 132 133 136 137 143 145 147 149 151 153 154 155 Combining Families of Information Retrieval Algorithms Using Metalearning 159 Michael Cornelson, Ed Greengrass, Robert L Grossman, Ron Karidi, and Daniel Shnidman 7.1 Introduction 7.2 Related Work 7.3 Information Retrieval 7.4 Metalearning 159 161 162 164 viii Contents 7.5 Implementation 7.6 Experimental Results 7.7 Further Work 7.8 Summary and Conclusion References III Trend Detection 166 166 167 168 168 171 Trend and Behavior Detection from Web Queries Peiling Wang, Jennifer Bownas, and Michael W Berry 8.1 Introduction 8.2 Query Data and Analysis 8.2.1 Descriptive Statistics of Web Queries 8.2.2 Trend Analysis of Web Searching 8.3 Zipf's Law 8.3.1 Natural Logarithm Transformations 8.3.2 Piecewise Trendlines 8.4 Vocabulary Growth 8.5 Conclusions and Further Studies References 173 A Survey of Emerging Trend Detection in Textual Data Mining April Kontostathis, Leon M Galitsky, William M Pottenger, Soma Roy, and Daniel J Phelps 9.1 Introduction 9.2 ETD Systems 9.2.1 Technology Opportunities Analysis (TOA) 9.2.2 CIMEL: Constructive, Collaborative Inquiry-Based Multimedia E-Learning 9.2.3 TimeMines 9.2.4 New Event Detection 9.2.5 ThemeRiver™ 9.2.6 PatentMiner 9.2.7 HDDI™ 9.2.8 Other Related Work 9.3 Commercial Software Overview 9.3.1 Autonomy 9.3.2 SPSS LexiQuest 9.3.3 ClearForest 9.4 Conclusions and Future Work 9.5 Industrial Counterpoint: Is ETD Useful? Dr Daniel J Phelps, Leader, Information Mining Group, Eastman Kodak References 185 173 174 175 176 178 178 179 179 181 182 186 187 189 191 195 199 201 204 207 211 212 212 212 213 214 215 219 Contents ix Bibliography 225 Index 241 ... rescale This figure appears on p 118 of the text Library of Congress Cataloging-in-Publication Data Survey of text mining : clustering, classification, and retrieval / editor, Michael W Berry... topic areas in text mining: I Cluslering and Classification, II Information Extraction and Retrieval, and III Trend Detection, In Part I (Clustering and Classification) , Howland and Park present... for Text Documents Hichem Frigui and Olfa Nasraoui 3.1 Introduction 3.2 Simultaneous Clustering and Term Weighting of Text Documents 3.3 Simultaneous Soft Clustering and Term Weighting of Text

Định dạng
Số trang	262
Dung lượng	5,83 MB