DSpace at VNU: An efficient implementation of EMD algorithm for motif discovery in time series data

180 Int J Data Mining, Modelling and Management, Vol 8, No 2, 2016 An efficient implementation of EMD algorithm for motif discovery in time series data Duong Tuan Anh* and Nguyen Van Nhat Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, 268 Ly Thuong Kiet, Dist 10, Ho Chi Minh City, Vietnam Email: dtanh@cse.hcmut.edu.vn Email: nhatbk@gmail.com *Corresponding author Abstract: The extended motif discovery (EMD) algorithm, developed by Tanaka et al (2005), is a well-known time series motif discovery algorithm which is based on minimum description length principle One unique feature of the EMD algorithm is that it can determine the suitable length of the motif and hence does not require the length of the motif as a parameter supplied by user Another interesting feature of the EMD is the lengths of each instances of a motif can be a bit different from each other and hence Tanaka et al suggested that dynamic time warping (DTW) distance should be used to calculate the distances between the motif instances in this case Due to the second feature, the EMD algorithm leads to a high computational complexity and is not easy to implement in practice In this paper, we present an efficient implementation of the EMD algorithm in which we apply homothetic transformation to convert all pattern instances of different lengths into the same length so that we can easily calculate Euclidean distances between them This modification accelerates the execution of the EMD remarkably and makes it easier to implement Experimental results on seven real world time series datasets demonstrate the effectiveness of our EMD implementation method in time series motif discovery Keywords: time series; motif discovery; MD; EMD algorithm; homothetic transformation Reference to this paper should be made as follows: Anh, D.T and Nhat, N.V (2016) ‘An efficient implementation of EMD algorithm for motif discovery in time series data’, Int J Data Mining, Modelling and Management, Vol 8, No 2, pp.180–194 Biographical notes: Duong Tuan Anh received his Doctorate of Engineering in Computer Science from the School of Advanced Technologies at the Asian Institute of Technology, Bangkok, Thailand where he also received his Master of Engineering in the same branch He is currently an Associate Professor of Computer Science at Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology His research is in the fields of metaheuristics, temporal databases and time series data mining He is currently the head of Time Series Data Mining Research Group in his faculty He has authored more than 70 scientific papers Nguyen Van Nhat received his BEng in Computer Science from Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Vietnam where he also received his Master degree in the same branch recently His main research interest is in time series data mining Copyright © 2016 Inderscience Enterprises Ltd An efficient implementation of EMD algorithm 181 Introduction A time series is a sequence of real numbers measured at equal time intervals Time series data arise in so many applications of various areas ranging from science, engineering, business, finance, economy, medicine to government Nowadays, time series datasets in several applications become very large, with the scale of multi-terabytes, and data mining tasks in time series data of such scale become very challenging The motif is a previously unknown pattern that appears frequently in a long time series Time series motif discovery (MD) is an important problem with applications in a variety of areas that range from finance to medicine Since the first formalisation by Lin et al (2002) several time series MD algorithms have been proposed (Lin et al., 2002; Chiu et al., 2003; Mueen et al., 2009; Tanaka and Uehara, 2003; Tanaka et al., 2005; Tang and Liao, 2008; Yankov et al., 2007) The first algorithm that can find motifs in linear time is Random Projection (RP), developed by Chiu et al (2003) This algorithm is based on research for MD from the bioinformatics community (Tompa and Buhler, 2001) It is an iterative approach and uses as base structure a collision matrix whose rows and columns are the symbolic aggregate approximation (SAX) representation of each time series subsequence (TSS) The subsequences are obtained using a sliding window approach At each iteration, it selects certain positions of each word as the mask and traverses the word list For each match, the collision matrix entry is incremented In the end, the largest entries in the collision matrix are selected as motif candidates Mueen et al (2009) proposed the first exact MD algorithm, called MK algorithm that works directly on raw time series data This algorithm uses the ‘nearest neighbour’ definition of motif and applies three techniques to speed up the algorithm: exploiting the symmetry of Euclidean distance, early abandoning, and using reference points One of the major disadvantages of the RP algorithm and the MK algorithm is that they still execute very slowly with large time series data With most of the above-mentioned algorithms, the user has to determine in advance the length of the motif and the distance threshold (range) for subsequence matching, which are the two parameters in most of the MD algorithms Since the motif is previously unknown, determining in advance the length of the motif is actually a very difficult task There have been a few algorithms proposed for finding time series motif with different lengths or variable lengths (Tang and Liao, 2008) However so far, to the best of our knowledge, there has been no time series MD algorithm that can determine automatically the suitable length of the motif in a time series Tanaka and Uehara (2003) proposed MD algorithm, the algorithm that can find motifs from time series data using minimum description length (MDL) principle First, it transforms the data into a sequence of symbols Next, it discovers the motif by calculating a description length of a pattern based on MDL principle That means the suitable length of the motif is determined automatically by MD algorithm The MD algorithm is useful and effective based on the assumption that the lengths of all the instances of the motif are identical However, in the real world, the lengths of all instances of a motif are a little bit different from each other To overcome this limitation, Tanaka et al (2005) proposed the extended variant of MD, called extended motif discovery (EMD) algorithm that includes the two following modifications First, the EMD algorithm transforms the symbol sequence that represents a behaviour of a given time series data to a form in which motif instances of different lengths can be extracted 182 D.T Anh and N.V Nhat Second, it uses a new definition of a description length of a time series to process not only motif instances of the same length but motif instances of different length (DL) Since in the EMD algorithm, the lengths of each instances of a motif can be a bit different from each other, Tanaka et al (2005) suggested that dynamic time warping (DTW) distance should be used to calculate the distances between the motif instances in this case Due to this suggestion, the EMD algorithm becomes a complicated algorithm with high computational complexity and not easy to implement in practice In this paper, we present an efficient implementation of the EMD algorithm in which we apply homothetic transformation to convert all motif instances of DLs into the same length so that we can easily calculate Euclidean distances between them This modification that avoids using DTW distance brings out a remarkable improvement for the EMD algorithm in terms of time efficiency and makes the EMD algorithm much easier to implement Experimental results on seven real world time series datasets demonstrate the effectiveness of our EMD implementation method in time series motif discovery The rest of the paper is organised as follows In Section 2, we give some essential definitions and explain briefly some basic ideas of the EMD algorithm Section introduces our proposed implementation method for the EMD algorithm Section reports on the experiments of the EMD algorithm with our implementation method in comparison to the EMD algorithm using DTW distance, and RP Section gives some conclusions and remarks for future work Background In this section, we provide some essential definitions and describe briefly the EMD algorithm and DTW distance 2.1 Definitions Definition 1: Time series: A time series T = t1, …, tN is an ordered set of N real-values measured at equal intervals Definition 2: Similarity distance: D(s1, s2) is a positive value used to measure differences between two time series s1 and s2 and relies on measure methods If D(s1, s2) < ε, then s1 is similar to s2 Definition 3: TSS: Given a time series T of length N, a subsequence C of T is a sampling of length n < N of contiguous positions from T, that is, C = tp, …, tp+n–1 for ≤ p ≤ N – n + Definition 4: Time series motif: Given a time series T, a subsequence C is called the (most significant) motif of T, if it has the highest count of the subsequences that are similar to it All the subsequences that are similar to the motif are called instances of the motif An efficient implementation of EMD algorithm 183 All instances of a motif must conform to the following constraints: • Behaviour constraint: Each instance has the same behaviour (temporal variation) • Distance constraint: Distances between all possible pairs of the instances of a motif are lower than the threshold R R is a user defined threshold distance • Non-overlapping constraint: Instances of a motif should not overlap to each other 2.2 The main ideas of the EMD algorithm In this subsection, we present an overview of the EMD algorithm The algorithm dynamically detects a motif from the one-dimensional time series data based on MDL, an information theoretic criterion First, we prepare the analysis window of Tmin Tmin is the minimum length of the motif for the data By shifting the analysis window, we obtain all of TSSs with length of Tmin in the data Each TSS is represented by piecewise aggregate approximation (PAA) representation (Keogh et al., 2001) PAA representation is a vector expression obtained by dividing a time series data into equal-sized segments and calculating the average value in each segment And then this PAA representation is transformed into a SAX symbol sequence by applying SAX (SAX) discretisation method (Lin et al., 2003) After this step, we obtain SAX symbol sequences for every TSS To obtain a sequence of symbols that represents the behaviour of a time series T, every SAX symbol sequence extracted from T (also called SAX word) is transformed into a single unique symbol This symbol is called ‘behaviour symbol (BS)’, since every SAX symbol sequence represents the behaviour of each TSS Finally, we obtain a ‘Behaviour symbol sequence (BS sequence)’ of the time series T For example, in Figure 1, the BS ‘A’ is assigned to SAX symbol sequence ‘cbba’, ‘B’ is assigned to SAX symbol sequence ‘bbac’ and ‘C’ is assigned to SAX symbol sequence ‘bacb’ In this way, by introducing behaviour, the EMD can detect TSSs with the same behaviour Another reason for introducing BS is to reduce memory space for symbol sequences and search space for discovering a motif rather than SAX symbol sequences Figure Illustration of transforming a time series into a symbol sequence, (a) each TSS is transformed into an SAX symbol sequence (b) ‘BS’ is assigned for every SAX symbol sequence The BS sequence C represents a behaviour of the given time-series data T Each BS in C represents a behaviour of each TSS of length Tmin All TSSs that correspond to the BS subsequence pattern ‘ABC’ have the length Tmin + (3 – 1) (in SAX symbols) and it is a pattern of instances of the same length 184 D.T Anh and N.V Nhat To make the MD algorithm able to work with pattern instances of DLs, the concept of BS subsequence should be modified First, BS subsequence where the same BS appears repeatedly are identified and the number of its occurrences is counted (this number is called BS length) Then such BS subsequence is transformed into one unique BS For example, ‘CC’ is transformed into ‘C’ with BS length This is called ‘modified BS sequence’ By extracting BS subsequence pattern from a modified BS sequence, we can detect the DL pattern In Figure 2, the first BS ‘A’ represents a behaviour in area from segment to segment (assume that each BS consists of SAX symbols) The TSS in area from segment to segment has the same behaviour So, the second BS is assigned also ‘A’ Here, it is inferred that the TSS in area from segment to segment has almost same behaviour Therefore, we can transform the first two ‘AA’ into a single BS ‘A’ Similar to this example, we can transform the rest of such BS subsequence into a single BS Figure (a) An original BS sequence (b) A modified BS sequence Source: Tanaka et al (2005) When DL patterns are extracted, we have to calculate distances between every two TSSs of DL patterns Tanaka et al (2005) suggested that DTW distance should be used rather than Euclidean distance since with DTW, we can calculate the distance between two TSSs whose lengths are different After calculating DTW distances between every TSSs, the EMD algorithm can create a distance matrix and use the matrix to identify the motif candidates The procedure that identifies the motif candidate from a distance matrix is described as in Figure The above analysis is repeated until we find all the motif candidates with arbitrary length in C When it is finished, the motif candidate with the smallest value of MDL estimation function is considered as the MDL pattern in C Using the length of the MDL pattern Li, the length of the motif Tmotif is calculated as follows: Tmotif = Tmin + ( Li − 1) The length of the motif Tmotif is measured in the number of SAX symbols The outline of EMD algorithm for discovering time series motifs is given as in Figure Figure The algorithm for extracting motif candidates Algorithm Extract_Motif_Candidate Step 1: For each TSS in the distance matrix, identify all the other TSS which are similar to it (i.e., the distance between them is less than the threshold R) Step 2: Select as the pattern instances the TSSs which have the highest count of its similar subsequence Step 3: Among the pattern instances selected in step 2, determine the instance which has the smallest sum of distances to all the other instances This one is considered as the centre of the pattern (that means a motif candidate) An efficient implementation of EMD algorithm 185 In the EMD algorithm, the minimum definition length of a time series data consists of three costs: data encoding cost, parameter encoding cost and segmentation cost The formulas for computing the three costs are given in Tanaka et al (2005) Based on these three formulas, the EMD algorithm can compute the MDL estimation for a give time series Figure The EMD algorithm for discovering time series motifs Algorithm EMD Step Transform the original time series to PAA representation Step Transform the PAA reduced time series to SAX symbol sequence Step Transform the SAX symbol sequence to BS sequence Step Transform the BS sequence to Modified BS sequence Step Step 5.1 Set the size of analysis window, W, to Tmin 5.2 Extract all modified BS subsequences under the analysis window, by sliding the window from left to right Find some DL pattern from these subsequences, if any If there exists no DL pattern found and the window is now at the end of the Modified BS sequence, go to step 5.3 From the set of all pattern instances found in 5.2, establish the distance matrix for them (using DTW distance to calculate the distances between them) Call the procedure Extract_Motif_Candidate to find the motif candidate from the distance matrix, and calculate the MDL value of the candidate 5.4 Add the motif candidate to the result list along with its MDL value 5.5 Increase the size of the analysis window (i.e., Set W: = W + 1), go to 5.2 From the result list, find the motif candidate with the smallest MDL value The found motif will be the returned result 2.3 DTW distance Inspired by the need to handle time warping in similarity computation, Berndt and Clifford (1994) introduce DTW, a classical speech recognition tool, to the data mining community, in order to allow a time series to be ‘stretched’ or ‘compressed’ to provide a better match with another time series To compute the DTW distance between two time series, we have to use dynamic programming to solve an optimisation problem and this solution method incurs a quadratic computational cost To speed up the DTW distance calculation, all practitioners using DTW apply some temporal constraint on the warping window size of DTW (Itakura, 1975; Sakoe and Chiba, 1978) and/or utilise some lower-bounding techniques such as LB_Keogh (Keogh, 2002), LB_Improved (Lemire, 2009) and LB_PAA (Zhu and Shasha, 2003) However, lower-bounding techniques often incur post-processing overhead while the exact DWT calculation is still unavoidable in practice All these complications hinder the use of DTW distance in real world applications 186 D.T Anh and N.V Nhat Our implementation method of the EMD algorithm The main idea of our implementation method of the EMD algorithm is to apply homothetic transformation to convert all pattern instances of DLs into the same length so that we can easily calculate Euclidean distances between them rather than using computationally complicated DTW distance This idea was considered in our previous work which proposed an algorithm to discover time series motif based on significant extreme points and clustering (Truong et al., 2012) Now, this idea can be applied to the EMD algorithm in order to accelerate its execution and make it much easier to implement 3.1 Homothetic transformation and minimum Euclidean distance In step 5.3 of the EMD algorithm, we apply homothety for transforming the TSSs with DLs in the distance matrix to those of the same length in order that we can calculate Euclidean distances between them rather than using DTW distance measure which incurs costly computational complexity Homothetic transform is a simple and effective technique which also can transform the subsequences with DLs to those of the same length Homothety is a transformation in affine space Given a point O and a value k ≠ A homothety with centre O and ratio k transforms M to M ′ such that OM ′ = k × OM Figure shows a homothety with centre O and ratio k = 1/2 which transforms the triangle MNP to the triangle M ′N ′P ′ Homothety can preserve the shapes of any curves under the transformation Therefore, it can be used to align a longer motif candidate to a shorter one The algorithm that performs homothety to transform a motif candidate T with length N (T = {Y1, …, YN}) to motif candidate of length N ′ is given as follows Figure Homothetic transformation let Y_Max = Max{Y1, …, YN}; Y_Min = Min{Y1, …, YN} find a centre I of the homothety with the coordinate: X_Centre = N / 2, Y_Centre = (Y_Max + Y_Min) / perform the homothety with centre I and ratio k = N ′ / N An efficient implementation of EMD algorithm 187 Notice that thanks to homothety, the step 5.3 in the EMD algorithm (see Figure 4) can capture similarities when the pattern instances are uniformly scaled along the time axis Therefore, the EMD algorithm using homothety can detect time series motifs under uniform scaling as the algorithm proposed by Yankov et al (2007) Besides, we note that two ‘similar’ subsequences will not be recognised where a vertical difference exists between them To make the step 5.3 in the EMD algorithm able to handle not only uniform scaling but also shifting transformation along vertical axis, we modify our Euclidean distance by using Minimum Euclidean Distance as a method of negating the differences caused through vertical axis offsets Given two subsequences: T ′ = {T1′, T2′,… , TN′ } and Q′ = {Q1′, Q2′ ,… , QN′ }, the Euclidean distance between them is normally given by the following formula: D (T ′, Q′ ) = N′ ∑ (T ′ − Q′ ) i i (1) i =1 In this work we use the Minimum Euclidean Distance which is defined by the following equation: N′ ⎫ 2⎪ ⎪⎧ D (T ′, Q′ ) = Min ⎨ (Ti′ − Qi′ − b ) ⎬ where b ∈ ℜ ⎪⎩ i =1 ⎪⎭ ∑ (2) From equation (2), we can derive a suitable value for the shifting parameter b such that we can find the best match between the two subsequences Q ′ and T ′ as follows: b= N′ N′ ∑ ( Q′ − T ′) i i (3) i =1 Whenever given two subsequences Q′ and T ′, we have to compute the optimal value for the shifting parameter b using equation (3) before we can find the Minimum Euclidean Distance between them using equation (2) 3.2 Other EMD implementation issues Besides accelerating the EMD algorithm by applying homothetic transform and Minimum Euclidean Distance rather than the use of DTW, we have to tackle several other implementation issues that were not addressed by Tanaka et al (2005) This subsection will explain how we handle some of these implementation issues 3.2.1 How to shift the analysis window In Tanaka et al (2005), the authors did not mention how much we shift the analysis window across the time series during the process of MD In our implementation of the EMD algorithm, when shifting the analysis window to extract time series subsequences, we slide the window one PAA segment at a time That means we shift the window one SAX symbol at a time For example, if the length of one PAA segment is then we shift the analysis window data points across the time series at a time Notice that the analysis window in the EMD algorithm is shifted with the rate faster than that of the sliding 188 D.T Anh and N.V Nhat window in RP (i.e., one data point at a time) Due to this fact, our EMD implementation (without using DTW distance) can perform faster than the RP algorithm 3.2.2 How to check the non-overlapping constraint In the step 5.2 of the EMD algorithm, we have to identify all the pattern instances that belong to the same DL pattern These pattern instances may be overlap to each other So we have to modify these instances in order that they can satisfy the non-overlapping constraint mentioned in Subsection 2.1 For example, in Figure assume that each BS consists of SAX symbols and each SAX symbol comes from one PAA segment with data points The first BS subsequence ‘ABC’ (starting at position 1) and the second BS subsequence ‘ABC’ (starting at position 41) are the two pattern instances that are considered ‘similar’ to each other But we can see that they are overlapping since the length of the first BS subsequence is 48 and the second BS sequence starts at position 41 So they violate the non-overlapping constraint of motif instances To make the two pattern instances satisfy non-overlapping constraint, we should cut off the overlapping part from one of them by pruning the suffix of the first BS subsequence or/and the prefix of the second BS subsequence The pruning technique can be described as follows: Step (Cut off the suffix of the first BS subsequence): Examine the last BS of the first pattern instance If the length of this BS is greater than 1, cut off this BS one data point at a time from the rightmost to the left until the two pattern instances satisfy the non-overlapping constraint or the length of this BS becomes Step (Cut off the prefix of the second BS subsequence): If step can not make the two pattern instances satisfy the non-overlapping constraint, examine the first BS of the second pattern instance If the length of this BS is greater than 1, cut off this BS one data point at a time from the leftmost to the right until the two pattern instances satisfy the non-overlapping constraint or the length of this BS becomes Now, we apply the pruning technique to the above-mentioned example (Figure 6) Since the length of the first pattern instance is 48 while the starting position of the second pattern instance is 41, we achieve the following steps Figure An example with two pattern instances that violate the non-overlapping constraint An efficient implementation of EMD algorithm 189 We cut off the suffix of the first pattern instance Since the length of the last BS (i.e., ‘C’) is and the length of each PAA segment is data points, we cut off times, each time one data point, from the first pattern instance to reduce the length of the first pattern instance to 40 and the length of the last BS to Since the starting position of the second pattern instance is 41, now the two pattern instances satisfy the non-overlapping constraint and we not need to cut off the prefix of the second pattern instance After the pruning, the pattern ‘ABC’ has two new instances: the first instance starts at position and its length is 40; the second instance starts at position 41 and its length is 48 Experimental evaluation We implemented all the time series MD algorithms with Microsoft Visual C# and conducted the experiments on an Intel® CoreTM i3 CPU 550, 3.20 GHz RAM GB In this experiment, we compare the two versions of EMD algorithm to the RP algorithm The RP is selected for comparison due to its popularity It is the most cited algorithm for finding time series motif up to date and is the basic of many current approaches that tackle this problem (Tang and Liao, 2008; Yankov et al., 2007) For the EMD algorithm, we implement two variants of EMD in order to compare EMD using DTW Distance to EMD using homothetic transform and Minimum Euclidean Distance We denote the two variants as follows: • EMD|DTW: The EMD algorithm using DTW distance • EMD|HT: The EMD algorithm using homothetic transform In this experiment, we tested the algorithms on seven publicly available datasets: ECG (512 data points), ECG (8,000 data points), ECG (144,000 data points), Power (35,040 data points), Memory (6,875 data points), and EEG (512 data points), and ERP (6,400 data points) The three first datasets are obtained from the webpage http://www.physionet.org/physiobank/database All the last four datasets are obtained from the UCR Time Series Data Mining Archive (Keogh and Folias, 2014) We also implemented the brute-force algorithm for MD given in Lin et al (2002) in order to evaluate efficiency of the EMD|HT algorithm The parameter settings for all the three algorithms: Brute-Force, RP, and EMD over each datasets are given in Table For the RP algorithm, we have the following parameter setting: the length of each PAA segment l, the SAX alphabet size a, the number of iterations in RP i, the length of one SAX word w and the number of errors allowed by RP d For the brute-force algorithm, we need two parameters: the length of the motif, n and the error threshold R For the EMD algorithm, we use the following parameter setting: the minimum length of analysis window Tmin (in data points), the SAX alphabet size a, the number of SAX symbols in one BS w and the size of the analysis window aw The size of analysis window is measured in the number of BSs 190 D.T Anh and N.V Nhat Table Parameter settings of the three MD algorithms on the seven datasets Dataset Brute-force RP EMD ECG (512) n = 112, R = 30 l = 16, a = 4, i = 2, w = 7, d = Tmin = 64, a = 4, w = 4, aw = ECG (8,000) n = 336, R = 30 l = 16, a = 4, i = 10, w = 21, d = Tmin = 80 a = 4, w = 6, aw = ECG (144,000) N/A N/A Tmin = 64, a = 3, w = 4, aw = Power (35,040) N/A N/A Tmin = 96, a = 4, w = 6, aw = Memory (6,875) n = 256, R = 30 l = 16, a = 5, i = 10, w = 16, d = Tmin = 80, a = 4, w = 6, aw = EEG (512) n = 112, R = 30 l = 16, a = 3, i = 2, w = 7, d = Tmin = 64, a = 4, w = 3, aw = ERP (6,400) n = 160, R = 30 l = 16, a = 4, i = 2, w = 11, d = Tmin = 96, a = 4, w = 6, aw = The experimental results obtained from the experiment on the seven datasets are shown in Table Table Dataset EGC ECG ECG Power Memory EEG ERP Experimental results on the performances of the three MD algorithms Length Algorithm # of motif instances Runtime (sec) 512 RP 0.030 8,000 144,000 35,040 6,875 512 6,400 EMD|DTW 0.034 EMD|HT 0.003 RP 21 24.387 EMD|DTW 14 56.067 EMD|HT 21 0.113 RP N/A N/A EMD|DTW N/A N/A EMD|HT 49 9.037 RP N/A N/A EMD|DTW 33 420.175 EMD|HT 30 1.160 RP 22 0.420 EMD|DTW 1.211 EMD|HT 10 0.071 RP 0.027 EMD|DTW 0.025 EMD|HT 0.001 RP 31 0.155 EMD|DTW 2.013 EMD|HT 0.048 An efficient implementation of EMD algorithm Figure 191 (a) ECG dataset (8,000 points) (b) A motif discovered in the dataset by EMD|DTW (c) A motif discovered in the dataset by EMD|HT (see online version for colours) (a) (b) (c) From the experimental results, we can see that: • The EMD algorithm with DTW distance (EMD|DTW) runs slower than Random Projection in most of the seven datasets • The number of motif instances in the case of RP is more than that of the EMD algorithm The reason of this fact is that RP, which is based on SAX symbols, extracts more subsequences in the process of MD than the EMD algorithm which is based on BSs However, the accuracy of the EMD algorithm is better than that of RP • The EMD algorithm can identify motif instances of DLs while RP can identify only motif instances of the same length 192 D.T Anh and N.V Nhat • With all the datasets, the EMD algorithm with homothetic transform and Minimum Euclidean distance (EMD|HT) is remarkably faster than EMD|DTW while brings out the same accuracy • With large datasets such as Power (35,000 data points) and ECG (144,000 data points) RP can not discover motif since it can not tackle large datasets, while EMD|HT can find the motif in a very short time (less than 10 seconds) • With the large dataset ECG (144,000 data points), EMD|DTW can not work while EMD|HT can find the motif in a very short time (9 seconds) Now, we show some motifs discovered in two datasets by the EMD|DTW and EMD|HT algorithms Figure shows an example of the motifs discovered in the heart beat (ECG) dataset of length 8,000 by the EMD|DTW and EMD|HT algorithms In Figure 8, we can see the plot of Power dataset of length 35,000 and the motif instances found in this dataset by the EMD|DTW and EMD|HT Figure (a) Power dataset (b) A motif discovered in the dataset by EMD|DTW (c) A motif discovered in the dataset by EMD|HT (see online version for colours) (a) (b) (c) An efficient implementation of EMD algorithm 193 We conclude this subsection with the evaluation on the efficiency of the EMD|HT algorithm We can evaluate the efficiency of the EMD|HT algorithm by simply considering the ratio of how many times the Euclidean distance function must be called by EMD|HT over the number of times it must be called by the brute-force algorithm given in (Lin et al., 2002) efficiency -ratio = P / B where P is the number of times the proposed algorithm calls Euclidean distance and B is the number of times the brute-force algorithm calls Euclidean distance The range of the efficiency ratios is from to The method with lower efficiency ratio is better This efficiency criterion, which is independent from specific implementation and computational power, was first suggested by Lin et al (2002) Table shows the efficiency of the EMD|HT algorithm on various datasets Table The efficiency ratios of EMD|HT on various datasets Dataset ECG 8,000 ERP 6,400 Memory 6,875 Power 35,000 ECG 144,000 Efficiency ratio 0.0001822 0.0000339 0.0000133 0.0001337 0.0000162 The efficiency ratios in Table indicate that EMD|HT can bring out a three to four order of magnitude speedup over the brute-force algorithm Furthermore, in terms of implementation cost, EMD|HT is much easier to implement than EMD|DTW Conclusions We have introduced an efficient implementation of the EMD algorithm for discovering motifs in time series which can automatically determine the suitable length of the motif basing on MDL principle This implementation method, called EMD|HT hinges on using homothetic transformation to convert all motif instances of DLs into the same length so that we can easily calculate Euclidean distances between them This modification that avoids using DTW distance brings out a remarkable improvement for the EMD algorithm in terms of time efficiency without compromising motif accuracy Experimental results on seven real world time series datasets demonstrate that EMD|HT implementation method outperforms EMD|DTW in terms of time efficiency Efficiency is achieved because Euclidean distance is much faster to compute than DTW distance Besides, in terms of implementation cost, our EMD|HT is much easier to implement than EMD with DTW distance As for future work, we plan to apply our implementation of the EMD algorithm in some time series data mining tasks such as association rule mining or classification 194 D.T Anh and N.V Nhat References Berndt, D and Clifford, J (1994) ‘Using dynamic time warping to find patterns in time series’, AAAI-94 Workshop on Knowledge Discovery in Databases, pp.229–248 Chiu, B., Keogh, E and Lonardi, S (2003) ‘Probabilistic discovery of time series motifs’, ACM SIGKDD 2003, pp.493–498 Itakura, F (1975) ‘Minimum prediction residual principle applied to speech recognition’, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol 23, No 1, pp.67–72 Keogh, E (2002) ‘Exact indexing of dynamic time warping’, Proceedings of 28th International Conference on Very Large Data Bases, pp.406–417, Hong Kong Keogh, E and Folias, T (2014) ‘The UCR time series data mining archive’ [online] http://www.cs.ucr.edu/~eamonn/TSDMA/index.html (accessed 24 February 2014) Keogh, E., Chakrabarti, K., Pazzani, M and Mehrotra, S (2001) ‘Dimensionality reduction for fast similarity search in large time series databases’, Journal of Knowledge and Information Systems, Vol 3, No 3, pp.263–286 Lemire, D (2009) ‘Faster retrieval with a two-pass dynamic-time-warping lower bound’, Pattern Recognition, Vol 42, No 9, pp.2169–2180 Lin, J., Keogh, E., Lonardi, S and Chiu, B (2003) ‘A symbolic representation of time series, with implications for streaming algorithms’, Proceedings of 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2003), pp.2–11, California, USA Lin, J., Keogh, E., Lonardi, S and Patel, S (2002) ‘Finding motifs in time series’, 2nd Workshop on Temporal Data Mining (KDD’02) Mueen, A., Keogh, E., Zhu, Q and Westoeve, B (2009) ‘Exact discovery of time series motifs’, Proceedings of SIAM International Conference on Data Mining, pp.473–484 Sakoe, H and Chiba, S (1978) ‘Dynamic programming algorithm optimization for spoken word recognition’, IEEE Trans Acoustics, Speech, and Signal Proc., Vol ASSP-26, pp.43–49 Tanaka, Y and Uehara, K (2003) ‘Discover motifs in multi-dimensional time series using the principal component analysis and the MDL principle’, Proceedings of 3rd International Conference MLDM 2003, Leipzig, Germany, 5–7 July, pp.252–265 Tanaka, Y., Iwamoto, K and Uehara, K (2005) ‘Discovery of time-series motif from multidimensional data based on MDL principle’, Journal Machine Learning, Vol 58, Nos 2–3, pp.269–300 Tang, H and Liao, S.S (2008) ‘Discovering original motifs with different lengths from time series’, Journal of Knowledge Based Systems, Vol 21, No 7, pp.666–671 Tompa, M and Buhler, J (2001) ‘Finding motifs using random projections’, Proceedings of 5th Int Conf on Computational Molecular Biology, Montreal, Canada, 22–25 April, pp.67–74 Truong, C.D., Tin, H.N and Anh, D.T (2012) ‘Combining motif information and neural network for time series prediction’, Int Journal of Business Intelligence and Data Mining, Vol 7, No 4, pp.318–339 Yankov, D., Keogh, E., Medina, J., Chiu, B and Zordan, D (2007) ‘Detecting time series motifs under uniform scaling’, Proceeding of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.844–853 Zhu, Y and Shasha, D (2003) ‘Warping indexes with envelope transform for query by humming’, Proc of ACM SIGMOD Int Conf on Management of Data, USA, pp.181–192) ... centre of the pattern (that means a motif candidate) An efficient implementation of EMD algorithm 185 In the EMD algorithm, the minimum definition length of a time series data consists of three... the time series at a time Notice that the analysis window in the EMD algorithm is shifted with the rate faster than that of the sliding 188 D.T Anh and N.V Nhat window in RP (i.e., one data point.. .An efficient implementation of EMD algorithm 181 Introduction A time series is a sequence of real numbers measured at equal time intervals Time series data arise in so many applications of

Định dạng
Số trang	15
Dung lượng	0,94 MB