DATA MINING IN TIME SERIES DATABASES SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors: H Bunke (Univ Bern, Switzerland) P S P Wang (Northeastern Univ., USA) Vol 43: Agent Engineering (Eds Jiming Liu, Ning Zhong, Yuan Y Tang and Patrick S P Wang) Vol 44: Multispectral Image Processing and Pattern Recognition (Eds J Shen, P S P Wang and T Zhang) Vol 45: Hidden Markov Models: Applications in Computer Vision (Eds H Bunke and T Caelli) Vol 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K Y Huang) Vol 47: Hybrid Methods in Pattern Recognition (Eds H Bunke and A Kandel ) Vol 48: Multimodal Interface for Human-Machine Communications (Eds P C Yuen, Y Y Tang and P S P Wang) Vol 49: Neural Networks and Systolic Array Design (Eds D Zhang and S K Pal ) Vol 50: Empirical Evaluation Methods in Computer Vision (Eds H I Christensen and P J Phillips) Vol 51: Automatic Diatom Identification (Eds H du Buf and M M Bayer) Vol 52: Advances in Image Processing and Understanding A Festschrift for Thomas S Huwang (Eds A C Bovik, C W Chen and D Goldgof) Vol 53: Soft Computing Approach to Pattern Recognition and Image Processing (Eds A Ghosh and S K Pal) Vol 54: Fundamentals of Robotics — Linking Perception to Action (M Xie) Vol 55: Web Document Analysis: Challenges and Opportunities (Eds A Antonacopoulos and J Hu) Vol 56: Artificial Intelligence Methods in Software Testing (Eds M Last, A Kandel and H Bunke) Vol 57: Data Mining in Time Series Databases (Eds M Last, A Kandel and H Bunke) Vol 58: Computational Web Intelligence: Intelligent Technology for Web Applications (Eds Y Zhang, A Kandel, T Y Lin and Y Yao) Vol 59: Fuzzy Neural Network Theory and Application (P Liu and H Li) *For the complete list of titles in this series, please write to the Publisher Series in Machine Perception and Artificial Intelligence - Vol, 57 DATA MINING IN TIME SERIES DATABASES Editors Mark Last Ben-Gurion LIniversity of the Negeu, Israel Abraham Kandel Zl-Auiv University, Israel University of South Florida, Tampa, LISA Horst Bunke University of Bern, Switzerland vp World Scientific N E W JERSEY * LONDON * SINGAPORE BElJlNG SHANGHAI HONG KONG TAIPEI CHENNAI Published by World Scientific Publishing Co Pte Ltd Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library DATA MINING IN TIME SERIES DATABASES Series in Machine Perception and Artificial Intelligence (Vol 57) Copyright © 2004 by World Scientific Publishing Co Pte Ltd All rights reserved This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA In this case permission to photocopy is not required from the publisher ISBN 981-238-290-9 Typeset by Stallion Press Email: enquiries@stallionpress.com Printed in Singapore by World Scientific Printers (S) Pte Ltd Dedicated to The Honorable Congressman C W Bill Young House of Representatives For his vision and continuous support in creating the National Institute for Systems Test and Productivity at the Computer Science and Engineering Department, University of South Florida This page intentionally left blank Preface Traditional data mining methods are designed to deal with “static” databases, i.e databases where the ordering of records (or other database objects) has nothing to with the patterns of interest Though the assumption of order irrelevance may be sufficiently accurate in some applications, there are certainly many other cases, where sequential information, such as a time-stamp associated with every record, can significantly enhance our knowledge about the mined data One example is a series of stock values: a specific closing price recorded yesterday has a completely different meaning than the same value a year ago Since most today’s databases already include temporal data in the form of “date created”, “date modified”, and other time-related fields, the only problem is how to exploit this valuable information to our benefit In other words, the question we are currently facing is: How to mine time series data? The purpose of this volume is to present some recent advances in preprocessing, mining, and interpretation of temporal data that is stored by modern information systems Adding the time dimension to a database produces a Time Series Database (TSDB) and introduces new aspects and challenges to the tasks of data mining and knowledge discovery These new challenges include: finding the most efficient representation of time series data, measuring similarity of time series, detecting change points in time series, and time series classification and clustering Some of these problems have been treated in the past by experts in time series analysis However, statistical methods of time series analysis are focused on sequences of values representing a single numeric variable (e.g., price of a specific stock) In a real-world database, a time-stamped record may include several numerical and nominal attributes, which may depend not only on the time dimension but also on each other To make the data mining task even more complicated, the objects in a time series may represent some complex graph structures rather than vectors of feature-values vii viii Preface Our book covers the state-of-the-art research in several areas of time series data mining Specific problems challenged by the authors of this volume are as follows Representation of Time Series Efficient and effective representation of time series is a key to successful discovery of time-related patterns The most frequently used representation of single-variable time series is piecewise linear approximation, where the original points are reduced to a set of straight lines (“segments”) Chapter by Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani provides an extensive and comparative overview of existing techniques for time series segmentation In the view of shortcomings of existing approaches, the same chapter introduces an improved segmentation algorithm called SWAB (Sliding Window and Bottom-up) Indexing and Retrieval of Time Series Since each time series is characterized by a large, potentially unlimited number of points, finding two identical time series for any phenomenon is hopeless Thus, researchers have been looking for sets of similar data sequences that differ only slightly from each other The problem of retrieving similar series arises in many areas such as marketing and stock data analysis, meteorological studies, and medical diagnosis An overview of current methods for efficient retrieval of time series is presented in Chapter by Magnus Lie Hetland Chapter (by Eugene Fink and Kevin B Pratt) presents a new method for fast compression and indexing of time series A robust similarity measure for retrieval of noisy time series is described and evaluated by Michail Vlachos, Dimitrios Gunopulos, and Gautam Das in Chapter Change Detection in Time Series The problem of change point detection in a sequence of values has been studied in the past, especially in the context of time series segmentation (see above) However, the nature of real-world time series may be much more complex, involving multivariate and even graph data Chapter (by Gil Zeira, Oded Maimon, Mark Last, and Lior Rokach) covers the problem of change detection in a classification model induced by a data mining algorithm from time series data A change detection procedure for detecting abnormal events in time series of graphs is presented by Horst Bunke and Miro Kraetzl in Chapter The procedure is applied to abnormal event detection in a computer network Classification of Time Series Rather than partitioning a time series into segments, one can see each time series, or any other sequence of data points, as a single object Classification and clustering of such complex Preface ix “objects” may be particularly beneficial for the areas of process control, intrusion detection, and character recognition In Chapter 7, Carlos J Alonso Gonz´ alez and Juan J Rodr´ıguez Diez present a new method for early classification of multivariate time series Their method is capable of learning from series of variable length and able of providing a classification when only part of the series is presented to the classifier A novel concept of representing time series by median strings (see Chapter 8, by Xiaoyi Jiang, Horst Bunke, and Janos Csirik) opens new opportunities for applying classification and clustering methods of data mining to sequential data As indicated above, the area of mining time series databases still includes many unexplored and insufficiently explored issues Specific suggestions for future research can be found in individual chapters In general, we believe that interesting and useful results can be obtained by applying the methods described in this book to real-world sets of sequential data Acknowledgments The preparation of this volume was partially supported by the National Institute for Systems Test and Productivity at the University of South Florida under U.S Space and Naval Warfare Systems Command grant number N00039-01-1-2248 We also would like to acknowledge the generous support and cooperation of: Ben-Gurion University of the Negev, Department of Information Systems Engineering, University of South Florida, Department of Computer Science and Engineering, Tel-Aviv University, College of Engineering, The Fulbright Foundation, The US-Israel Educational Foundation January 2004 Mark Last Abraham Kandel Horst Bunke ... Artificial Intelligence Methods in Software Testing (Eds M Last, A Kandel and H Bunke) Vol 57: Data Mining in Time Series Databases (Eds M Last, A Kandel and H Bunke) Vol 58: Computational Web Intelligence:... Cataloguing -in- Publication Data A catalogue record for this book is available from the British Library DATA MINING IN TIME SERIES DATABASES Series in Machine Perception and Artificial Intelligence... challenges include: finding the most efficient representation of time series data, measuring similarity of time series, detecting change points in time series, and time series classification and clustering