DSpace at VNU: A parallel dimensionality reduction for time-series data and some of its applications tài liệu, giáo án,...
Int J Intelligent Information and Database Systems, Vol 5, No 1, 2011 A parallel dimensionality reduction for time-series data and some of its applications Hoang Chi Thanh* Department of Informatics, Hanoi University of Science, VNUH, 334 Nguyen Trai Rd., Hanoi, Vietnam E-mail: thanhhc@vnu.vn *Corresponding author Nguyen Quang Thanh Da Nang Department of Information and Communication, 15 Quang Trung Str., Da Nang, Vietnam E-mail: thanhnq@dsp.vn Abstract: The subsequence matching in a large time-series database has been an interesting problem Many methods have been proposed that cope with this problem in an adequate extent One of the good ideas is reducing properly the dimensionality of time-series data In this paper, we propose a new method to reduce the dimensionality of high-dimensional time-series data The method is simpler than existing ones based on the discrete Fourier transform and the discrete cosine transform Furthermore, our dimensionality reduction may be executed in parallel The method is used to time-series data matching problem and it decreases drastically the complexity of the corresponding algorithm The method preserves planar geometric blocks and it is also applied to minimum bounding rectangles as well Keywords: time-series data; database; dimensionality reduction; matching problem; minimum bounding rectangle; MBR Reference to this paper should be made as follows: Thanh, H-C and Thanh, N-Q (2011) ‘A parallel dimensionality reduction for time-series data and some of its applications’, Int J Intelligent Information and Database Systems, Vol 5, No 1, pp.39–48 Biographical notes: Hoang Chi Thanh is an Associate Professor at Hanoi University of Science, Vietnam He received his PhD in Computer Science from Warsaw Technical University, Poland and his BSc in Computational Mathematics from The University of Hanoi, Vietnam Since 1974 he has been working for The University of Hanoi (currently Hanoi University of Science) From 2000 to 2008 he was the Head of the Department of Informatics Since 2004 he has been the Director of Science Co., Ltd He has published more than 40 refereed papers and eight books He is the supervisor of three PhD students His current research interests include concurrency theory, combinatorics, data mining and knowledge-based systems Copyright © 2011 Inderscience Enterprises Ltd 39 40 H.C Thanh and N.Q Thanh Nguyen Quang Thanh is a PhD student at Hanoi University of Science, Vietnam He received his MSc in Information Technology and his BSc in Mathematics from Can Tho University, Vietnam Since 1999 he has been working for Da Nang Department of Information and Communication, Vietnam His research interests include data mining, knowledge-based systems and network security Introduction Time-series data are the sequences of real numbers representing values at specific points in time For example, the bid prices and the ask prices of stock items, exchange rates, weather data and human speech signals… are typical illustrations of time-series data The data stored in a database are called data sequences The aim of the subsequence matching problem in a large time-series database is finding data sequences similar to the given query sequence from the database This problem has attracted a lot of interest by its applications Many methods have been proposed that cope with this problem in an adequate extend (Agrawal et al., 1993; Keogh et al., 2000; Keogh et al., 2001; Faloutsos et al., 2001; Moon et al., 2002) One of good ideas to increase the matching speed is a proper dimensionality reduction for high-dimensional time-series data In 2007, Moon proposed a data transformation based on the discrete Fourier transform and then Moon and Kim presented a data transformation based on the discrete cosine transform In this paper we present another dimensionality reduction for high-dimensional time-series data The method splits a high-dimensional time-series data into parts as equal in time scale as possible and then takes the average of each part The reduction is simpler than existing ones above presented and it may be performed in parallel So this method decreases the time for ‘narrowing’ data and speeds up the matching process in a large time-series database We also use this dimensionality reduction for a special type of time-series data – minimum bounding rectangles (MBR) This paper is organised as follows In Section we present a dimensionality reduction function for high-dimensional time-series data and point out some its properties Section presents application of the dimensionality reduction function to time-series data matching and to MBR When applying this reduction function to MBRs we show that it becomes safe Some conclusion remarks are given in the last section Dimensionality reduction for time-series data Let T[1 n] be a time-series data The time-series data consists of n real numbers, so it is called an n-dimensional data The dimensionality n of time-series data is as high as difficult to store, search and match So it turns out that how to ‘narrow’ the data In other words, we have to construct an operation, which transforms a high-dimensional time-series data with hundreds or thousands of dimensions to a low-dimensional time-series data with some dimensions Instead of doing on high-dimensional time-series data one can the same on low-dimensional time-series data with high performance To so, we construct A parallel dimensionality reduction for time-series data 41 dimensionality reduction functions for time-series data Each such a function is indeed a mapping F: Rn → Rm Let F be any dimensionality reduction function transforming n-dimensional time-series data to m-dimensional time-series data, with < m < n We are interested only in those functions that satisfy the following requirement Definition 2.1: A dimensionality reduction function F is proper if for any pair of n-dimensional time-series data X and Y: D m ( F ( X ) , F (Y ) ) ≤ D n ( X , Y ) (2.1) where, Dn and Dm are the distance functions of the n-dimensional space and the m-dimensional space, respectively So each proper dimensionality reduction function on time-series data is a shrinking mapping The properness of a reduction function guarantees no false dismissals for range queries Let T[1 n] be an n-dimensional time-series data and let m be a positive integer such that < m