ACTIVE MINING Frontiers in Artificial Intelligence and Applications Series Editors: J Breuker, R Lopez de Mantaras, M Mohammadian, S Ohsuga and W Swartout Volume 79 Volume in the subseries Knowledge-Based Intelligent Engineering Systems Editor: L.C Jain Previously published in this series: Vol 78 T Vidal and P Liberatore (Eds.), STAIRS 2002 Vol 77 F van Harmelen (Ed.) ECAI 2002 Vol 76 P SinCak et al (Eds.), Intelligent Technologies - Theory and Applications Vol 75.1.F Cruz et al (Eds.) The Emerging Semantic Web Vol 74, M Blay-Fornarino et al (Eds.) Cooperative Systems Design Vol 73 H Kangassalo et al (Eds.), Information Modelling and Knowledge Bases XIII Vol 72, A Namatame et al (Eds.), Agent-Based Approaches in Economic and Social Complex Systems Vol 71 J.M Abe and J.I da Silva Filho (Eds.), Logic Artificial Intelligence and Robotics Vol 70, B Verheij et al (Eds.), Legal Knowledge and Information Systems Vol 69, N Baba et al (Eds.), Knowledge-Based Intelligent Information Engineering Systems & Allied Technologies Vol 68, J.D Moore et al (Eds.), Artificial Intelligence in Education Vol 67 H Jaakkola et al (Eds.), Information Modelling and Knowledge Bases XII Vol 66, H.H Lund et al (Eds.), Seventh Scandinavian Conference on Artificial Intelligence Vol 65, In production Vol 64 J Breuker et al (Eds.) Legal Knowledge and Information Systems Vol 63.1 Gent et al (Eds.), SAT2000 Vol 62 T Hruska and M Hashimoto (Eds.), Knowledge-Based Software Engineering Vol 61, E Kawaguchi et al (Eds.) Information Modelling and Knowledge Bases XI Vol 60, P Hoffman and D Lemke (Eds.), Teaching and Learning in a Network World Vol 59, M Mohammadian (Ed.), Advances in Intelligent Systems: Theory and Applications Vol 58 R Dieng et al (Eds.), Designing Cooperative Systems Vol 57, M Mohammadian (Ed.), New Frontiers in Computational Intelligence and its Applications Vol 56, M.I Torres and A Sanfeliu (Eds.), Pattern Recognition and Applications Vol 55, G Cumming et al (Eds.) Advanced Research in Computers and Communications in Education Vol 54 W Horn (Ed.), ECAI 2000 Vol 53, E Motta Reusable Components for Knowledge Modelling Vol 52 In production Vol 51, H Jaakkola et al (Eds.), Information Modelling and Knowledge Bases X Vol 50 S.P Lajoie and M Vivet (Eds.), Artificial Intelligence in Education Vol 49 P McNamara and H Prakken (Eds.), Norms Logics and Information Systems Vol 48 P Navrat and H Ueno (Eds.), Knowledge-Based Software Engineering Vol 47 M.T Escrig and F Toledo, Qualitative Spatial Reasoning: Theory and Practice Vol 46 N Guarino (Ed.), Formal Ontology in Information Systems Vol 45 P.-J Charrel et al (Eds.) Information Modelling and Knowledge Bases IX ISSN: 0922-6389 Active Mining New Directions of Data Mining Edited by Hiroshi Motoda Division of Intelligent Systems Science, The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan /OS Press Ohmsha Amsterdam • Berlin • Oxford • Tokyo • Washington, DC © 2002, Hiroshi Motoda All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmined in any form or by any means, without the prior written permission from the publisher ISBN 58603 264 X (IOS Press) ISBN 274 90521 C3055 (Ohmsha) Library of Congress Control Number: 2002106944 Publisher IOS Press Nieuwe Hemweg 6B 1013BG Amsterdam The Netherlands fax:+31 206203419 e-mail: order@iospress.nl Distributor in the UK and Ireland IOS Press/Lavis Marketing 73 Lime Walk Headington Oxford OX3 7AD England fax:+44 1865750079 Distributor in the USA and Canada IOS Press, Inc 5795-G Burke Centre Parkway Burke, VA 22015 USA fax:+l 703 323 3668 e-mail: iosbooks@iospress.com Distributor in Germany, Austria and Switzerland IOS Press/LSL.de Gerichtsweg 28 D-04103 Leipzig Germany fax:+49 341 995 4255 Distributor in Japan Ohmsha, Ltd 3-1 Kanda Nishiki-cho Chiyoda-ku Tokyo 101–8460 Japan fax:+81 3233 2426 LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information PRINTED IN THE NETHERLANDS Preface Our ability to collect data, be it in business, government, science, and perhaps personal life has been increasing at a dramatic rate However, our ability to analyze and understand massive data lags far behind our ability to collect them The value of data is no longer in "how much of it we have" Rather, the value is in how quickly and how effectively can the data be reduced, explored, manipulated and managed Knowledge Discovery and Data mining (KDD) is an emerging technique that extracts implicit, previously unknown, and potentially useful information (or patters) from data Recent advancement made through extensive studies and real world applications reveals that no matter how powerful computers are now or will be in the future, KDD researchers and practitioners must consider how to manage ever-growing data which is, ironically, due to the extensive use of computers and ease of data collection, ever-increasing forms of data which different applications require us to handle, and ever-changing requirements for new data and mining target as new evidences are collected and new findings are made In short, the need for 1) identifying and collecting the relevant data from a huge information search space, 2) mining useful knowledge from different forms of massive data efficiently and effectively, and 3) promptly reacting to situation changes and giving necessary feedback to both data collection and mining steps, is ever increasing in this era of information overload Active mining is a collection of activities each solving a part of the above need, but collectively achieving the various mining objectives By "collectively achieving" we mean that the total effect outperforms the simple add-sum effect that each individual effort can bring Said differently, a spiral effect of these interleaving three steps is the target to be pursued To achieve this goal the initial action is to explore mechanisms of 1) active information collection where necessary information is effectively searched and preprocessed, 2) user-centered active mining where various forms of information sources are effectively mined, and 3) active user reaction where the mined knowledge is easily assessed and prompt feedback is made possible This book is a joint effort from leading and active researchers in Japan with a theme about active mining It provides a forum for a wide variety of research work to be presented ranging from theories, methodologies, algorithms, to their applications It is a timely report on the forefront of data mining It offers a contemporary overview of modern solutions with real-world applications, shares hard-learned experiences, and sheds light on future development of active mining This collection evolved from a project on active mining and the papers in this collection were selected from among over 40 submissions The book consists of parts Each part corresponds to one of the three mechanisms mentioned above Namely, part I consists of chapters on Data Collection, part II on Usercentered Mining, and part III on User Reaction and Interaction Some of the chapters overlap each other but have to be placed in one of these three parts The topics covered in 27 chapters include online text mining, clustering for information gathering, online monitoring of Web page updates, technical term classification, active information gathering, substructure mining from Web and graph structured data, web community discovery and classification, spatial data mining, automatic configuration of mining tools, worst case analysis of exceptional rule mining, data squashing applied to boosting, outlier detection, meta-learning for evidenced based medicine, knowledge acquisition from both human expert and data, data visualization, active mining in business application world, meta analysis and many more This book is intended for a wide audience, from graduate students who wish to learn basic concepts and principles of data mining to seasoned practitioners and researchers who want to take advantage of the state-of-the-art development for active mining The book can be used as a reference to find recent techniques and their applications, as a starting point to find other related research topics on data collection, data mining and user interaction, or as a stepping stone to develop novel theories and techniques meeting the exciting challenges ahead of us Active mining is a new direction in the knowledge discovery process for real-world applications handling huge amounts of data with actual user need Hiroshi Motoda Acknowledgments As the field of data mining advances, the interest in as well as the need for integrating various components intensifies for effective and successful data mining A lot of research ensues This book project resulted from the active mining initiatives that started during 2001 as a grant-in-aid for scientific research on priority area by the Japanese Ministry of Education, Science, Culture, Sports and Technology We received many suggestions and support from researchers in machine learning, data mining and database communities from the very beginning of this book project The completion of this book is particularly due to the contributors from all areas of data mining research in Japan, their ardent and creative research work The editorial members of this project have kindly provided their detailed and constructive comments and suggestions to help clarify terms, concepts, and writing in this truly multi-disciplinary collection I wish to express my sincere thanks to the following members: Numao Masayuki, Yukio Ohsawa, Einoshin Suzuki, Takao Terano, Shusaku Tsumoto and Takahira Yamaguchi We are also grateful to the editorial staff of IOS Press, especially Carry Koolbergen and Anne Marie de Rover for their swift and timely help in bringing this book to a successful conclusion During the process of this book development, I was generously supported by our colleagues and friends at Osaka University This page intentionally left blank Contents Preface, Hiroshi Motoda Acknowledgments I Data Collection Toward Active Mining from On-line Scientific Text Abstracts Using Pre-existing Sources, TuanNam Tran and Masayuki Numao Data Mining on the WAVEs - Word-of-mouth-Assisting Virtual Environments, Masayuki Numao, Masashi Yoshida and Yusuke Ito Immune Network-based Clustering for WWW Information Gathering/Visualization, Yasufumi Takama and Kaoru Hirota Interactive Web Page Retrieval with Relational Learning-based Filtering Rules, Masayuki Okabe and Seiji Yamada Monitoring Partial Update of Web Pages by Interactive Relational Learning, Seiji Yamada and Yuki Nakai Context-based Classification of Technical Terms Using Support Vector Machines, Masashi Shimbo, Hiroyasu Yamada and Yuji Matsumoto Intelligent Tickers: An Information Integration Scheme for Active Information Gathering, Yasukiro Kitamura \1 21 31 41 51 61 II User Centered Mining Discovery of Concept Relation Rules Using an Incomplete Key Concept Dictionary, Shigeaki Sakurai, Yumi Ichimura and Akihiro Suyama 73 Mining Frequent Substructures from Web, Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Hiroshi Sakamoto and Setsuo Arikawa 83 Towards the Discovery of Web Communities from Input Keywords to a Search Engine, Tsuyoshi Murata 95 Temporal Spatial Index Techniques for OLAP in Traffic Data Warehouse, Hiroyuki Kawano 103 Knowledge Discovery from Structured Data by Beam-wise Graph-Based Induction, Takashi Matsuda, Hiroshi Motoda, Tetsuya Yoshida and Takashi Washio 115 PAGA Discovery: A Worst-Case Analysis of Rule Discovery for Active Mining, Einoshin Suzuki 127 Evaluating the Automatic Composition of Inductive Applications Using StatLog Repository of Data Set, Hidenao Abe and Takahira Yamaguchi 139 Fast Boosting Based on Iterative Data Squashing, Yuta Choki and Einoshin Suzuki 151 Reducing Crossovers in Reconciliation Graphs Using the Coupling Cluster Exchange Method with a Genetic Algorithm, Hajime Kitakami and Yasuma Mori 163 Outlier Detection using Cluster Discriminant Analysis, Arata Sato, Takashi Suenaga and Hitoshi Sakano 175 ... Knowledge Bases IX ISSN: 0922-6389 Active Mining New Directions of Data Mining Edited by Hiroshi Motoda Division of Intelligent Systems Science, The Institute of Scientific and Industrial Research,... forefront of data mining It offers a contemporary overview of modern solutions with real-world applications, shares hard-learned experiences, and sheds light on future development of active mining. .. graph structured data, web community discovery and classification, spatial data mining, automatic configuration of mining tools, worst case analysis of exceptional rule mining, data squashing applied