2016 IEEE International Conference on Knowledge Engineering and Applications A Novel Hybrid Method for Time Series Subsequence Join Vo Duc Vinh, Nguyen Phuc Duong Tuan Anh Faculty of Information Technology Ton Duc Thang University Vietnam e-mail: vdvinh@ittdt.edu.vn.51303240@studenttdt.edu.vn Faculty of Computer Science and Engineering of Ho Chi Minh city University of Technology, Vietnam National University Vietnam e-mail: dtanh@cse hcmut.edu.vn Abstract-The exact method time series subsequence join which is based on important extrema to segment time series And then, we apply a nested loop join approach which uses a sliding window and Dynamic Time Warping (DTW) distance to fInd all the matching subsequences in the two time series within a similarity threshold Although this method executes very fast, it is an approximate method and may have some false dismissals because it might ignore some data points when shifting the sliding window several points at a time Another recent work on subsequence join using the second defmition which is proposed by Mueen et al in 2014 ([7]) can fInd the exact correlated subsequence Mueen et al introduced an exhaustive searching method, JOCOR for discovering the most correlated subsequence based on maximizing Pearson's correlation coeffIcient in two given time series Although the authors incorporated several speeding-up techniques to reduce the complexity from O(n4) to O(n Ign), where n is the length of two time series, the runtime of JOCOR is still unacceptable even for many time series datasets with moderate size In this paper, we propose a hybrid method for time series subsequence join (using the second defInition) by combining EP-M algorithm and JOCOR algorithm First, we apply EP-M to fInd the set of all pairs of matching subsequences in the two time series within a similarity threshold From this set we derive the pair of subsequences which has the smallest distance between them The pair of subsequences is used as the candidates of the most correlated subsequences in the two time series Finally, we apply the JOCOR algorithm to post-process the candidates The hybrid method helps to speed-up the process of fInding the most correlated subsequence without causing any false dismissals The experiment results demonstrate that our new hybrid method not only is nearly as accurate as exact JOCOR but also achieves better time efficiency when being compared to exact JOCOR JOCOR, proposed by Mueen et al., is the first method for joining two time series on subsequence correlation Although JOCOR requires the time complexity O(n1lgn), where n is the length of the time series, it is still time-consuming even for medium-size time series In this paper, we propose a hybrid method which can run faster than JOCOR Our method consists of four main steps First, a list of subsequences is extracted from the raw time series based on important extrema Second, we apply a nested loop join using a sliding window and Dynamic Time Warping distance to find all the matching subsequences in the two time series Third, we concatenate al1 matching subsequences whose indexes are adjacent into longer ones and find the pair of subsequences which has the smallest distance between them Finally, we apply JOCOR to find the most correlated segments in the two time series In comparison to JOCOR, our hybrid method performs much faster while high accuracy is guaranteed Keywords-correlation coefficients, dynamic time warping, important extrema, JOCOR, nested loop join, subsequence join, time series I INTRODUCTION Subsequence join over time series is considered as one of the basic problems in time series data mining This problem appears in many practical applications such as entertainment, meteorology, economy, fmance, medicine, and engineering [6] The subsequence join over time series can be viewed in different defmitions The fIrst defmition of subsequence join is based on a nested-Ioop algorithm and some distance function This joining approach returns all pairs 0/ subsequences drawn from two time series that satisfy a given similarity threshold The second defInition of subsequence join is to fmd join segments with maximum correlation coeJ j icient and requires only one parameter, the minimum length of the segments The approach based on the fIrst defInition has some disadvantages such as time overhead because distance function is called many times over a lot of iterations To reduce the runtime for nested-Ioop algorithm, some works tend to approximately estimate the similarity between two time series by dividing time series into segments In this approach, Y Lin et al ([6]) introduced solutions for joining two time series based on a non-uniform segmentation and a similarity function over a feature-set Their method is not only difficult to implement but also requires high computational complexity, especially for large time series data To avoid these drawbacks, in our previous work [11], we proposed a method, called EP-M, for 978-1-5090-3471-0/16/$31.00 ©2016 IEEE 11 A BACKGROUND Basic Concepts Definition A time series T= tl, tb , tn is a sequence of n data points measured at equal periods, where n is the length of the time series For most applications, each data point is usually represented by a real value Definition Given a time series T = t1, t2,…, tn of length n, a subsequence T[i: i + m - 1] = ti, ti+1,…, ti+m-1 is a continuous subsequence of T, starting at position i and length m (m n) 18 This work aims to join two time series T, and T2 of lengthn, and nl respectively The problem of time series join was defined by Mueen et al [7] which is described as folIows Problem (Max-Corre lati on Join): Given two time series T, and T2 of length n, and nl respectively (assume n, ;::: n2), [md the most correlated subsequences of T, and T2 with length ;::: minLength When joining two time series, we refer to finding the most correlated subsequence by calculating Pearson's correlation coefficient The correlation coefficient is defined as folIows procedure multiply produces an array that contains the sum of the products of the elements in x and y for different shifts of x The output z of the procedure multiply is expressed more precisely as follows Aigorithm Join(x, y) // return the locations and // length of the most correlated segments of x and y x:= (x -mean(x»/stdv(x) y:= (y - mean(y»/stdv(y) n:= length(x); m:= length(y) best:= for i := to m - minLength + for j:= ton - minLength + maxLength:= min(m - i + 1, n - j + 1) for len:= minLength to maxLength c:= Correlation(xUj +len-I], y[i:i +len-I]) (1) where x and y are two given time series of equal lengthn, with average values Jlx and Jly, and standard deviations ox and oy, respectively The value of Pearson's correlation coefficient ranges in [-1, 1] Besides, the z-normalized Euclidean distance is also a commonly used measure in time series data mining The distance between two time series X = x f, x], , Xn and Y = Yf, yl."" Yn with the same lengthn is calculated by: 10 Figure I n d(x,y)= L(X,_y;)2 I I where x = _ _ (x;- Jl ) and Y, = _ _ (YI - Jl ) x y , (Jy Because we just pay attention to maximizing positive correlations and ignore the negatively correlated subsequences, we can take advantage of the relationship between Euclidian distance and positive correlation as follows C(x,y)=l- d2;�y) (3) In this work, we will take advantage of statistics for computing correlation coefficient as folIows (4) d(x,y)= J2n(l- C(x,y)) > best then best:= c The brute-force algorithm for subsequence join over time series Aigorithm JOCOR(x, y) // return the locations and // length of the most correlated segments of x and y x:= (x -mean(x»/stdv(x) y:= (y - mean(y»/stdv(y) n:= length(x); m:= length(y) for i := to m Zi:= multiply(x, y[i:m]) best:= for i := to m - minLength + for j:= ton - minLength + maxLength:= min(m - i + 1, n - j + 1) 10 len:= minLength 11 while len � maxLength sumXY:= Zi[m-i +j] - Zi+len[m-i +j] 12 c := (sumXY - f lxf!y)/(lenO'xO'y) 13 14 if c > best then best:= c 15 computestepSize if stepSize � orstepSize ;::: len then 16 17 stepSize := 18 len:= len +stepSize (2) i=1 {Jx if c (5) This approach brings us two advantages Firstly, the algorithm just takes one pass to compute all of these statistic variables Secondly, it enables us to reuse computations and reduce the amortized time complexity to constant instead of line ar [7] In the paper [7], the above formulas will be used for computing correlation coefficient and z-normalized Euclidian distance between two subsequences Figure The JOCOR algorithm for subsequence join over time series Procedure multiply(x, y) // return the shifted dot products // for x and y (stored in z) n':= length(x), m':= length(y) x:= append(x, n' - zeros) y:= append(reverse(y), (2n' - m')-zeros) X:= FFT(x); Y:= FFT(y) Z:=X.Y z:= iFFT(Z) B J OCOR Algori thm The main idea of JOCOR is to add some improvements to the naive algorithm, Join, which finds the most correlated join segments The algorithm Join computes correlation of all the possible pairs of segments of all the lengths The pseudo-code of Join algorithm is given as in Fig To improve Join algorithm, JOCOR tries to reuse the sufficient statistics for overlapping correlation computation and then prune unnecessary correlation computation admissibly (Fig 2) JOCOR applies Fast Fourier Transform (FFT) to compute the shifted cross product between two time series The Figure 19 The procedure for computing the shifted cross product between two time series m' Zk = L Y, xk-m,+, which is called compression rate An increase of R leads to the selection of fewer important extreme points Given a time series T of length n, starting at the beginning of the time series, all important minima and maxima of the time series are identified by using the algorithm given in [2] The algorithm takes linear computational time and constant memory '�I Here m' is the length of y and n' > m' is the length of x The procedure multiply aims to calculate Lxy in the fonnula of computing correlation coefficient (4) Lines 4-5 in JOCOR aim to populate a set of cross products Z where Zi = multiply(x, y[i : m]) The cross products in Z are the most important statistics for any correlation computation between any pair of segments (Fig 3) The dot product of two subsequences of x and y starting at j and i -th locations, respectively, with length len can be computed by Zi[m-i+j] - Zi+len[m-i+j] The second important improvement of JOCOR is to use a mechanism to skip some of the length in the loop of line JOCOR computes the step size dynamically instead of incrementing of the len variable by one Details of how to compute this step size are given in [7] C III THE HYBRID METHOD FOR TIME SERIES SUBSEQUENCE JOIN Now we present our new hybrid method for time series subsequence join which combines our previous method, EP-M ([11]), and JOCOR algorithm ([7]) The hybrid method exploits the relationship between the two definitions of time series subsequence join The first definition of subsequence join has the spirit of range search which finds all all pairs of subsequences drawn from two time series that satisfy a given similarity threshold th while the second definition has the spirit of nearest neighbor search which fmds only the most correlated pair of subsequences But it is obvious that from the result of range search, we can derive the result of nearest neighbor search Based on the above-mentioned rationale, the hybrid method consists of the following main ideas Given two time series, first we apply EP-M approach which fmds all the matching subsequences in the two time series Second, we concatenate all matching subsequences whose indexes are adjacent into longer ones Then, we find the pair of subsequences which has the smallest distance between them Finally, we apply JOCOR algorithm to the pair of subsequences obtained in the preceding step to find the most correlated pair of subsequences Dynamic Time Warping Measure In this work, we use Dynamic Time Warping distance since this distance measure allows non-linear alignments between two time series to accommodate sequences that are similar, but locally out of phase Regarding the calculation of the DTW distance, the major issue is that implementing it in the classical way, the comparison of two time series of length I requires the calculation of the entries of an I x I matrix using dynamic pro�raming, and therefore the comparison has a complexity of 0(1) To speed up the DTW distance calculation, all practitioners using DTW constrain the warping path in a giobai manner by limiting how far it may stray from the diagonal The subset of matrix that the warping path is allowed to visit is called warping window or a band Two of the most frequently used global constraints in the literature are the Sakoe-Chiba band proposed by Sakoe and Chiba, 1978 [lO] and Itakura Parallelogram proposed by Itakura, 1975 [3] Sakoe-Chiba band is the area defined by two straight lines in parallel with the diagonal and Itakura Parallelogram is the area defmed by the parallelogram which is symmetric over the diagonal In this work, we use DTW distance with Sakoe-Chiba band r Two time series subsequences Q and C are similar to each other within threshold th if DTW(Q, C)::; th A The Proposed Method The hybrid algorithm for subsequence join over time series consists of the following steps Step 1: We extract all important extrema of the two time series T, and T2• The results of this step are two lists of important extrema EPI = (epft, eph , ep1ma and EP2 = (ep2" ep2l."" ep2m2) where ml and m2 are the numbers of important extrema in T, and Tl respectively Afterward, when extracting subsequences from a time series (T, or T2), we extract the subsequence bounded by the extrema epi and epi+2' Step 2: We keep time series T, fixed and for each subsequence s extracted from T, we find all its matching subsequences in T2 by shifting a sliding window of the length equal to the length of s along T2 one data point at a time We store all the resulting subsequences in the result set S Step 3: At this step, we concatenate all the resulting matching pairs at Step if the indexes of these pairs are adjacent Then, we find the pair of subsequences which has the smallest distance between them Step 4: At the final step, we will apply JOCOR algorithm to calculate the Pearson's correlation coefficient and fmd the most correlated subsequence among the candidate subsequences found in Step Notice that our previous algorithm EP-M consists of Step and Step in the hybrid algorithm The pseudo-code for describing Step 1, Step 2, Step and Step in the hybrid method is given in Fig and Fig Procedure Subsequence_Matching invokes the procedure DTW EA which computes DTW distance The DTW_EA D lmportant Extrema Important extrema in a time series contain important change points of the time series The algorithm for identifying important extrema was first introduced by Pratt and Fink, 2002 [8] Fink and Gandhi, 2007 [2] proposed the improved variant of the algorithm for fmding important extrema The concepts of important extrema of time series in both the papers were used for time series compression But in this work, we exploit the concepts of important extrema for segmenting time series into subsequences In [2], Fink and Gandhi give the definition of important extrema and the algorithm that can identify important extrema from a given time series Intuitively, an important minimum is the minimum value of some segment and the endpoints of this segment are much larger than it Similarly, an important maximum is the maximum value of some segment and the endpoints of this segment are much smaller than it The definition of important extrema requires a positive parameter R 20 subsequence, the runtime of algorithm and the length of the resulting subsequence procedure applies Early Abandoning technique as mentioned in the next subsection A Datasets Aigorithm Hybrid (TI[l nl], T2[1 n2]) Input: Two time series TI and T2 Output: The pair of subsequences which has the maximum Pearson's correlation coefficient l EP = Important_Extrema(TI); EP2 := Important_Extrema(T2) For i:= to length(EP 1) -2 Subsequence_TI(i) := TI[EP (i) to EP (i+2)] For i:= to length(EP2) -2 Subsequence_Tli):= T2[EP2(i) to EP2(i+2)] For i:= to length(EP 1) -2 s := Subsequence_TI(i) Subsequence_Matching (s, Tb threshold) Store an the resulting pairs of subsequences in S For each resulting pair of subsequences in S ifthe indexes of the subsequences are adjacent then concatenate these subsequences into longer subsequences, and update the result set S From the result set S, [md the pair of subsequences (SI, S2) which has the smallest DTW distance between them Apply JOCOR Aigorithm to find the most correlated subsequences in the pair of candidate subsequences (sI, S2) found in Step Procedure Subsequence_Matching(s[l m], T[l n], threshold) Input: T is a time series, s is a subsequence and threshold is the similarity threshold Output: The set S of an matching subsequences for i= to n - m + segment_o!_T= subsequence Ti,l+m-1 dtw_distance = DTW_EA(s, segment_oLT, threshold) if (dtw _distance