Computing DOI 10.1007/s00607-013-0333-1 A technique for extracting behavioral sequence patterns from GPS recorded data Thi Hong Nhan Vu · Yang Koo Lee · The Duy Bui Received: 29 December 2012 / Accepted: 14 May 2013 © Springer-Verlag Wien 2013 Abstract The mobile wireless market has been attracting many customers Technically, the paradigm of anytime-anywhere connectivity raises previously unthinkable challenges, including the management of million of mobile customers, their profiles, the profiles-based selective information dissemination, and server-side computing infrastructure design issues to support such a large pool of users automatically and intelligently In this paper, we propose a data mining technique for discovering frequent behavioral patterns from a collection of trajectories gathered by Global Positioning System Although the search space for spatiotemporal knowledge is extremely challenging, imposing spatial and temporal constraints on spatiotemporal sequences makes the computation feasible Specifically, the mined patterns are incorporated with synthetic constraints, namely spatiotemporal sequence length restriction, minimum and maximum timing gap between events, time window of occurrence of the whole pattern, inclusion or exclusion event constraints, and frequent movement patterns predictive of one ore more classes The algorithm for mining all frequent constrained patterns is named cAllMOP Moreover, to control the density of pattern regions a clustering algorithm is exploited The proposed method is efficient and scalable Its efficiency is better than that of the previous algorithms AllMOP and GSP with respect to the compactness of discovered knowledge, execution time, and memory requirement T H N Vu (B) · T D Bui Human Machine Interaction Laboratory, Vietnam National University, 144 Xuan Thuy, Cau Giay, Hanoi, Vietnam e-mail: vthnhan@gmail.com Y K Lee Robot/Cognitive System Research Department, Electronics and Telecommunication Research Institute, Daejeon, Republic of Korea 123 T H N Vu et al Keywords mining Behavioral sequence patterns · Location-based services · Trajectory Mathematics Subject Classification 68 Introduction The availability and increasingly high accuracy of Global Positioning System (GPS) receivers attached to vehicles in today’s transportation technology allows recording all the trajectories that are the traces of moving users as well as their portable devices These trajectories contain detailed information about personal and vehicular mobile behaviors and therefore reveal interesting practical opportunities to find behavioral patterns to be used, for example, in traffic and sustainable mobility management and to study the accessibility of services The development of continuous minimization of electronics technologies in display devices and in wireless communications as well as improved performance of general computing technologies enable the deployment of mobile Location-based Services(LBSs) These services integrate data derived from the users’ requests with other user information in a multidimensional database [5,8] Accumulated data are used for later modification of the services and for long-term decision making LBSs, such as information systems (e.g., shopping, tourist, traffic information system supporting queries pertaining to physical user location) or server-side selective information dissemination based approach (e.g., targeted advertisement based on user profiles and location information) are emerging application scenarios with far-reaching implication To aid decision making and customization, data mining techniques can be applied to discover interesting knowledge about the behaviors of users For example, classes of users which exhibit similar behaviors can be identified These classes can be characterized by various attributes of the class members or the services they requested Sequences of service requests made by users can also be analyzed to discover regularities in such sequences These regularities can later be applied to make intelligent predictions about users’ future behavior given the requests the user made in the past [20] In this paper, we focus on a discussion of techniques for discovering frequent movement patterns from a spatiotemporal database We present a new algorithm called cAllMOP for discovering all frequent movement patterns with the following constraints: (1) length restriction; (2) minimum timing gap between events; (3) maximum timing gap; (4) a time window of occurrence of the whole pattern; (5) inclusion and exclusion event constraints; (6) patterns predictive of one or more classes To fulfill the task of finding frequent unconstraint patterns, trajectories are frequently modeled as discrete moving points However, knowledge of movements is limited by the ability of the devices used to measure them Complete knowledge of a movement is impossible, but movement can be detected, stored, modeled, and analyzed with some degree of accuracy Aiming at the purpose of reducing the error in the observed locations, trajectories are reconstructed by re-sampling their positions and are then generalized In the mining process, the utilization of syntactic constraints and spatiotemporal proximity feature of the application domain makes the computation become feasible Moreover, because 123 A technique for extracting behavioral sequence patterns users moves in a thematically partitioned space, we take into account the concept of graph and the transitive property of similarity measure of paths in graph during the process of candidate generation, which helps avoid unnecessary candidate pattern production In addition, to control the density of the regions in patterns and automatically adjust the shape and size of the regions we employ a grid-based clustering technique The performance of cAllMOP is better than that of the algorithms AllMOP in [18] and GSP in [11] in terms of the compactness of the discovered knowledge, execution time, and memory requirement It therefore can be applied well to LBS systems Related work Sequential patterns mining is informally descried as the discovery of inter-transaction patterns in large customer databases [2,6,11,14,21] A sequence is a set of temporally ordered itemsets Since the set of frequent sequences is a superset of frequent itemsets, sequential pattern mining algorithms often utilize some of the ideas proposed for the discovery of association rules [1,19] One can divide approaches for finding frequent itemsets based on two criteria: by their strategy to traverse the search space and by their strategy to determine the support values of itemsets Based on the first criteria today’s common approaches are either breadth-first search or depth-first search A comparison of these approaches revealed that all of the methods have some types of data for which they performed better than the others [19] This data mining task, except for transaction time, is in a sense dimensionless Nevertheless, most of data describing events in the real world associate with space and time Thus far, work on spatiotemporal data mining has mainly focused on the models and structures for indexing spatiotemporal objects [4,7,10] rather than discovering movement patterns Spatiotemporal pattern mining has been treated as a generalization of pattern mining in time-series data [3,5,13,16,17] The algorithm offered in [9] discovers spatiotemporal periodic patterns from trajectories of equal length, which are then exploited in an index structure to support the execution of spatiotemporal queries We are concerned with trajectories of random length and the problem of imprecise sampled points Besides, another method DFS_MINE in [12] was proposed to discover spatiotemporal sequential patterns for weather prediction It seeks the relationships between time-varying attributes for fixed location, but does not show how to apply to movement pattern mining in which one needs to seek the relationships between time-varying locations of objects with stable attributes This problem was treated by our algorithm maxMOP in [18] An unconstraint search can produce millions of rules or may not even be tractable in some cases Discovery of sequences incorporating constraints has already received some attention in categorical domains [11,15], but to the best of our knowledge this problem has not been addressed in continuous domain, especially for spatiotemporal data The algorithm GSP in [11] was the first one to consider minimum and maximum gaps as well as time window GSP is an iterative algorithm, which counts candidate sequences of length k in the kth database scan GSP requires as many full data scans as the longest frequent sequence 123 T H N Vu et al Problem definition Definition (Trajectory) Trajectory of a moving object with identifier oj is defined as a finite sequence of points {( p1 , vt1 ), ( p2 , vt2 ), , ( pn , vtn )} in the X × Y × V T space, where point pi is represented by coordinates (xi , yi ) at the sampled time vti for 1≤ i ≤ n Assume that there is No distinct moving objects, DB is defined as the union of time o where Dj is a time series containing quadruples series of positions where D B = Nj=1 j j (oj , xi , yi , vti ) for ≤ j ≤ N o and ≤ i The spatial organization of the map M is represented as a set of regions The region is related to a specific thematic interpretation of space So, M is represented as a finite n a = M with a ∩ a = φ and i = j The set of regions {a1 ,…,an } such that ∪i=1 i i j moving possibility of an object from region to region is represented by a directed graph After decomposing M, we get a hierarchical structure as introduced in [3] However, in this study we assume that a region of the lower level is ‘fully contained’ in a region of the higher level Let T be the maximal timestamp among timestamps of the trajectories in the moving j object database DB Let oi denote the position of the moving object oj , for ≤ j ≤ No at timestamp vti for ≤ i ≤ T The trajectory of an object can be defined by the j j j sequence of points o1 o2 oK for ≤ K ≤ T Definition (Spatiotemporal sequence) Given a minimal temporal interval τ a spatiotemporal sequence is a list of temporally ordered region labels S = (a1 , t1 ), (a2 , j t2 ),…(aq , tq ) where contains oi for q ≤ T and 1≤ i ≤ q The length of S is q and this length is determined by the function length(S) A location at time t is called an event A sequence composed of k events is denoted as k-sequence For example (R1 , t1 ), (R2 , t2 ), (R2 , t3 ) is 3-sequence Definition (Subsequence) For a sequence S1 , if region a1 occurs before a2 , we denote it as a1 < a2 We say S1 is a subsequence of another one S2 if there exists a one-to-one order preserving function f that maps regions in S1 to regions in S2 such that for every ∈ S1 : (1) ∩ f (ai ) = φ, (2) if < aj then f (ai ) < f (aj ), and (3) tai+1 − tai = tf(ai+1) − tf(ai) Definition (Frequent movement pattern) A trajectory is said to comply with moving j sequence S if for each region ∈ S at vti , the point oi of the trajectory is in at the same time The support support(S) of the sequence S can be defined as the number of trajectories in DB complying with it If support(S) ≥ min_sup where min_sup is a user-specified minimum support threshold, then S is called a frequent pattern To control the density of a pattern region the density based partitioning method is j j exploited Each region of pattern S is dense if the set of positions Ai = {oi |oi ∈ } forms a dense cluster According to the definition of [16], a dense cluster is defined with two parameters r and MinPts points We apply a modified version of the partitioning method in the consideration of a multi-level spatiotemporal grid Progressing from 123 A technique for extracting behavioral sequence patterns Fig Spatiotemporal unit r γ M time finer to coarser one can find locally dense cells, which later can be combined together with dense nearby grid cells to form clusters The size of cell at the lowest level is decided based on the imprecision degree of the moving points, which will be presented below In our case, MinPts is equal to the value min_sup ∗ N o So, if all regions in S are dense, then S is frequent Problem definition: Given a database DB of trajectories along with the maximum speed vmax of the moving objects, the sampling rate t, a reference map M ⊆ R decomposed into regions accompanying with a directed graph graph, minimum support min_sup, the problem formulation is (1) to discover all frequent movement patterns from the database and (2) to discover patterns with syntactic constraints Process of discovering behavioral movement patterns 4.1 Movement summarization To make the representation of a trajectory more precise we need to re-sample moving points The sampling error across time was proved to be an ellipse [15] given the object’s maximal speed vmax and two consecutive moving points The error ellipse is used as measure for the size of the sampling error per line segment In the worst case, the error is a circle and this is the case we deal with here To make the operation more flexible and simpler we operate on it minimum bounding rectangle (MBR), which is also the cell the map explained below For a grid threshold r and without time, the reference map M is decomposed into n x ×n y array of equal sized cells When including time, M is decomposed into uniform spatiotemporal units (see Fig 1) The choice of cell size r will affect the accuracy of the obtained result In fact, the object’s maximum velocity vmax and the chosen resampling ρ influence this choice Re-sampling rate and cell size must be selected so that a trajectory produces at least one hit in each cell that it visits As √ a rule of thumb, (r 2) Additionally, the parameters r and ρ must be selected such that (vmax / ρ) temporal extent γ is a priori determined and may change depending on the application As a rule of thumb, it should be chose such that ρ γ, as ρ γ is a measure for hit number expectation per cell [14] 123 T H N Vu et al The reference map M having it origin, a point with coordinate (x0 , y0 ), is represented as a regular grid and stored in an array D[1 : n x , : n y ] Each element D[i, j] corresponds to one cell Dij that is also a page in which the moving points are assigned For a movement, we eliminate all consecutive points falling in the same cells and keep only the first point with its corresponding timestamp Assume that after projecting all the moving points in the database DB into cells (pages), we obtain the result presented in Fig 2a We find out that there are two cells D20 and D12 containing more than one point, so we remove the second point in them, (25,7) and (16,29), respectively Finally, the preprocessed database is represented in Fig 2b 4.2 Data set transformation Physically, the data structure of each cell in the spatiotemporal sequence is constructed in the form of (Dij, oj , vti ) in which Dij contains a pointer pointing to the page D[i, j] where the position of object oj at time vti is stored In case, the lifespan of all trajectories belong to the same weekdays we omit the date when representing timestamps Figure is an example of transforming time series of locations into spatiotemporal sequences with the minimum temporal interval γ = 30 Ultimately, the database of trajectories is converted into a set of spatiotemporal sequences, each associated with a distinct identifier oj 4.3 Strategy for mining all frequent patterns with syntactic constraints The considered constraints include minimum gap, maximum gap, and time window of validity of the pattern, classes of frequent and confident rules We directly prune candidates that violate syntactic constraints while finding frequent patterns The task is accomplished by extending the algorithm AllMOP, the method here is named cAllMOP It takes as its input the set MS of spatiotemporal sequences The candidate generating mechanism of our technique is based on breadth-first search strategy used by GSP with an additional temporal join operation and a technique for pruning candidates Moreover, due to the complexity of data type here, a clustering method to control the dense regions of the patterns is exploited The concept of directed graph also helps avoid the creation of redundant candidate patterns The algorithm makes multiple passes, producing longer patterns on the base of shorter ones, until no more patterns can be created Firstly, we explain how to find out frequent 1-patterns from which longer ones will be generated Different from the concept of items defined in GSP, not only the labels of pattern regions, but their shapes and sizes play an important role in the process of frequent pattern discovery The shape and size of a region changes from pattern to pattern, they therefore need to be automatically adjusted at each pass This is the reason why the prior techniques cannot be directly applied to our problem The issue is dealt with in the following way First, the set of generalized trajectories are decomposed into j j groups of moving points, each Ai = {oi |oi is position inside ,} for one timestamp 123 t 7:00 7:30 8:00 8:30 9:00 9:30 7:00 7:30 8:00 8:30 y 13 13 12 22 29 35 x 10 20 25 35 33 28 18 16 16 18 D13 D12 D30 D31 D21 D11 D20 D10 Page (a) Moving points stored in pages vt Date 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 Fig Example trajectories and result of generalization o2 o1 oid o2 o1 oid t 7:00 7:30 8:30 9:00 9:30 7:00 7:30 8:30 y 13 13 12 22 35 x 10 20 35 33 28 18 16 18 D10 D20 D30 D31 D21 D11 D12 D13 Page (b) Trajectories generalization vt Date 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 1/1/2009 A technique for extracting behavioral sequence patterns 123 T H N Vu et al Fig Example spatiotemporal sequences oid o1 o Spatiotemporal sequence vti (1 ≤ i ≤ T ) Frequent 1-patterns are exactly the dense regions explained above They are obtained by clustering points in the groups Ai Specifically, to find them, for each timestamp vti , we scan the set MS to determine the frequency of each cell and just keep frequent ones Next, the consecutive dense cells belonging to the same region are merged into large regions, which might be merged continuously to form clusters The points lying in the spare √ cells are assigned to the found clusters by applying a range query with diameter (r/ 2) The points belonging to no cluster are called outliers and are eliminated from the cells as soon as they are found The empty cells are discarded at the same time as well Frequent 1-patterns are maintained in the set F1 Figure 4a depicts a set of trajectories in a 2D space after passing through the trajectory generalization operation Assume that the maximal timestamp T is and min_sup count is which will be used in the illustration throughout the algorithm In this example, the reference map M consists of six regions denoted by Rj (0≤ j≤5) The numbers that are marked in each region Rj index the cells belonging to that region A cell is denoted by the combination of a number and the region’s index (e.g., the cell indexed by the number in region R1 is denoted as R11 , the merge of R11 and R12 denoted as R112 ) Because we consider only the spatial relationship “full containment”, a cell is contained in just one region Rj Therefore, each region Rj can be represented by a set of distinct cells (e.g., R1 is composed of three cells R11 , R12 , R13 ) Moving points are first projected into cells which have pointers pointing to pages D[i, j], one for each cell (denoted by the dashed line in the figure) The points of trajectories at this point can be gotten by accessing the pages in which they are physically stored From Fig 4a, we can see that the starting point of object at time t1 lies logically in the cell R13 and is practically stored in the element D[2,2] And its next position at time t2 falls in the cell R01 and is stored in D[3,2] The dense regions are then found Figure 5a shows the groups of moving points obtained after partitioning the trajectories in Fig 4a Different groups are denoted by different shapes of points Consider the example of finding dense regions at time t4 We found three dense cells R31 , R32 , R33 referring to three different pages, namely D[3,1], D[4,4], and D[5,1] Since all of them are neighboring cells and belong to the same spatial region R3, they are merged to create a cluster R3 However the cluster R3 still points to three pages corresponding to the original cells creating it That means, regions are logically combined while physical pages are preserved Contrary to R3, at time t4 only one point falling into the cell R41 of the region R4 This point cannot also be assigned to any existing clusters, so it is an outlier and disposed The same operation is performed for the other groups of points, all frequent 1-pattern F1 is finally obtained and displayed in Fig 5b 123 Fig Example dataset Pages t1 D[2,2] t1 D[1,2] 2 1 R13 t3 D[2,1] 2 t3 D[3,1] 3 t4 D[4,1] (a) Representation of trajectories in map R5 R2 R1 R12 Regions R1 is composed of three cells R11 R1 (R31,t3),(R41,t4),(R52,t5)> 11 12 10 (R31,t’3), (R32,t’4),(R33,t4)>