An efficient algorithm for mining maximal co location pattern using instance trees

6 3 0
An efficient algorithm for mining maximal co location pattern using instance trees

Đang tải... (xem toàn văn)

Thông tin tài liệu

2021 8th NAFOSTED Conference on Information and Computer Science (NICS) An Efficient Algorithm for Mining Maximal Co-location Pattern Using Instance-trees Dai Phong Le Cao Dai Pham Van Tuan Luu Institute of System Integration Le Quy Don Technical University Hanoi, Vietnam ledaiphong.isi@lqdtu.edu.vn Institute of System Integration Le Quy Don Technical University Hanoi, Vietnam daipc.isi@lqdtu.edu.vn Institute of System Integration Le Quy Don Technical University Hanoi, Vietnam tuanlv.isi@lqdtu.edu.vn Vanha Tran Dang Hai Nguyen Dept of Information Technology Specialization FPT University Hanoi, Vietnam hatv14@fe.edu.vn Institute of System Integration Le Quy Don Technical University Hanoi, Vietnam nguyendanghai.mta@gmail.com Abstract—Prevalent co-location patterns, which refer to groups of features whose instances frequently appear together in nearby geographic space, are one of the main branches of spatial data mining As the data volume continues to increase, it is redundant if all patterns are discovered Maximal co-location patterns (MCPs) are a compressed representation of all these patterns and they provide a new insight into the interaction among different spatial features to discover more valuable knowledge from data sets Increasing the volume of spatial data sets makes discovering MCPs still very challenging We dedicate this study to designing an efficient MCP mining algorithm First, features in size-2 patterns are regarded as a sparse graph, MCP candidates are generated by enumerating maximal cliques from the sparse graph Second, we design two instance-tree structures, star neighbor- and sibling node-based instance-trees to store neighbor relationships of instances All maximal co-location instances of the candidates are yielded efficiently from these instance-tree structures Finally, a MCP candidate is marked as prevalent if its participation index, which is calculated based on the maximal co-location instances, is not smaller than a minimum prevalence threshold given by users The efficiency of the proposed algorithm is proved by comparison with the previous algorithms on both synthetic and real data sets Index Terms—data mining, maximal co-location pattern, star neighbor, instance-tree restaurants, and convenience stores are frequently located together in neighbors of each other There are four PCPs formed in this data set, {Hotel, Restaurant}, {Hotel, Convenience store}, {Restaurant, Convenience store}, and {Hotel, Restaurant, Convenience store} {Hotel, Restaurant, Convenience store} is a MCP because it does not have any super-PCPs PCP mining has been proved to be an effective tool for discovering valuable knowledge from spatial data sets and it is applied to many fields such as environmental management [1], mobile communications [2], social science [3], and locationbased services [4] I I NTRODUCTION If a PCP has no super-patterns, it is a MCP It is challenging to discover MCPs when the numbers of features and instances are large and/or the distribution of data is dense In this study, we focus on developing an efficient algorithm of MCP mining by employing two efficient instance-tree structures The two structures are designed to store neighbor relationships of instances Co-location instances of MCP candidates can be efficiently collected from these structures Therefore, the efficiency of the mining process can be improved The remainder of this study is organized as follows Section shows the problem statement and related work The proposed algorithms are represented in Section Section makes a promise to improve the mining efficiency of our algorithms by experiments Section concludes this work With the development of the global positing system (GPS) enabled mobile and hand-held devices, many applications are designed based on geo-location data, e.g., peer-to-peer ridesharing, ride service hailing, and food delivery The valuable knowledge discovered from spatial data makes these application services more and more accurately and they can provide personalized services And prevalent co-location patterns (PCPs), which are groups of spatial features (e.g., hotels, restaurants, convenience stores in point of interest data) with their instances (e.g., a specific hotel, restaurant, or convenience store), are one of the main branches of spatial data mining Fig shows a distribution of a point of interest data set in Tokyo, Japan As can be seen, the instances of hotels, 978-1-6654-1001-4/21/$31.00 ©2021 IEEE Fig 1: The distribution of a point of interest data set 248 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) C.3 A.1 B.4 A.3 D.3 A.4 D.2 C.1 E.1 B.1 C.2 B.3 D.1 C.4 A.2 E.2 B.2 {A, B} A.2 B.2 A.3 B.1 A.3 B.4 A.4 B.3 3/4 4/4 0.75 {A, C} A.1 C.3 A.2 C.4 A.3 C.2 3/4 3/4 0.75 {A, D} A.1 D.2 A.2 D.1 A.3 D.3 3/4 3/3 0.75 I(c) T(c) {B, C} B.1 C.2 B.2 C.4 B.3 C.1 B.4 C.3 4/4 4/4 PR(c) Candidate c PI(c) f.i: the i-th instance of feature f : having neighbor relationships {B, D} {A, E} {C, E} {A, B, C} B.1 D.3 A.2 E.2 C.3 E.1 A.2 B.2 C.4 B.2 D.1 1/4 1/2 1/4 1/2 A.3 B.1 C.2 B.4 D.2 0.25 0.25 2/4 2/4 2/4 B.4 D.3 0.5 3/4 3/3 0.75 {C, D} C.1 D.1 C.1 D.2 C.1 D.3 C.2 D.3 C.4 D.1 3/4 3/3 0.75 {A, B, D} A.2 B.2 D.1 A.3 B.1 D.3 A.3 B.4 D.3 2/4 3/4 2/3 0.5 {A, C, D} A.2 C.4 D.1 A.3 C.2 D.3 2/4 2/4 2/3 0.5 {B, C, D} {A, B, C, D} B.1 C.2 D.3 A.2 B.2 C.4 D.1 B.2 C.4 D.1 A.3 B.1 C.2 D.3 2/4 2/4 2/3 2/4 2/4 2/4 2/3 0.5 0.5 Fig 2: An example of co-location pattern mining II P ROBLEM STATEMENT AND RELATED WORK A Problem statement Given (1) a set of spatial features F = {f1 , , fm }, and a set of their instances I = {I1 , , Im }, with Ii (1 ≤ i ≤ m) corresponds to instances of feature fi , each instance in Ii is a triple tuple hfeature type, instance ID, locationi; (2) a neighbor relationship R on the instance set I, R normally uses a Euclidean distance metric with a distance threshold d, if the distance between two instances that belong to different feature types is smaller than or equal to d, the two instances have a neighbor relationship; and (3) minprev is a minimum prevalence threshold to evaluate the prevalence of a pattern A subset of F, c = {f1 , , fk } (1 ≤ k ≤ m) is a size-k co-location pattern, I(c) is a co-location instance of c whose instances have the neighbor relationship R with each other A set of I(c) is called the table instance of c, T(c) The participation ratio of feature fi in c is denoted PR(c, fi ), which is the fraction of the instances of fi that participate in T {c} The participation index of c is denoted P I(c) = min{P R(c, fi )}, fi ∈ c If the PI(c) is not smaller than minprev , c is marked as a PCP If a PCP c has no any prevalent super-patterns, c is called a MCP Fig shows an example of co-location pattern mining There are five features, A, B, C, D, and E The instances of A are A.1, A.2, A.3, and A.4 Assuming that c = {A, B, C, D} is a candidate and the co-location instances of c are {A.2, B.2, C.4, D.1}, and {A.3, B.1, C.2, D.3} The participation ratio of each feature in c is P R(c, A) = 2/4, P R(c, B) = 2/4, P R(c, C) = 2/4, P R(c, D) = 2/3 Thus, P I(c) = min{2/4, 2/4, 2/4, 2/3} = 0.5 If users set minprev = 0.4, P I(c) > minprev , hence c is a prevalent CP Similarly, {A, B}, {A, C}, {A, D}, {B, C}, {C, D}, {B, D}, {A, B, C}, {A, C, D}, {B, C, D} are prevalent Since {A, B, C, D} has no prevalent super-patterns, it becomes a MCP While {A, B, C} is not a MCP since it has a prevalent super-pattern The problem of CP mining is discovering all PCPs from a given data set In furthermore, to represent compactly the patterns in the minig result, a set of MCPs are required B Related work Join-based [5] is known as the first algorithm in the PCP mining domain It uses an expensive join operation to collect table instances To tackle this weakness, many algorithms which no longer use join operations are developed [6]–[8] However, these algorithms mentioned above are difficult to handle with the increase in the volume of data Hence, many mining PCP algorithms on big data have proposed [9]–[11] The mining result normally contains a large numbers of PCPs, it is difficult for users to absorb, understand and apply them Hence, the notion of MCPs is proposed Yoo et al [12] designed a MCP mining algorithm called MAXColoc It converts instance neighborhood transactions to feature neighborhood transactions and then building a feature type tree to generate candidates The table instance of each candidate is collected by using a star instance mechanism However, this mechanism becomes very time-consuming when data sets are dense or large since it needs to examine the neighbor relationship of the instances in the all subsets of it An order-clique-based (OCB) approach for discovering MCPs is also developed [13] The candidates are generated by using a P2 -tree For collecting co-location instances, they construct two tree structures, a Neib-tree to save the neighbor relationship of instances and an Ins-tree to collect co-location instances However, when data sets are dense or big, these trees become very luxuriant, it takes a lot of time when it copies all sub-trees of a candidate from Neib-tree to Ins-tree And it needs to allocate a large amount memory space since it must remain the two trees in memory in all mining process A sparse-graph and condensed tree-based (SGCT) algorithm [14] is developed recently to mine MCPs The candidates are generated by using a maximal clique enumeration algorithm [15] The table instance of each candidate is collected by using a hierarchical verification scheme to construct a condensed instance-tree However, the scheme is a one-by-one inspection and it becomes very expensive when data is dense and the size of candidates is long, the performance of SGCT drops sharply For a summary of the mentioned MCP algorithms, there are two aspects concerned: (1) reduce the number of MCP candidates; (2) build various data structures to collect table instances efficiently However, each algorithm has its own disadvantages when dealing with dense or/and large data sets Regarding the first aspect, because in practical applications, the number of features is small (generally within 100) [13], there is no difference in efficiency between the various methods of generating MCP candidates Our full attention on the 249 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (1) A data set (2) d (3) minprev Materialize neighbor relationships Find size patterns Generate maximal candidates Construct Inst-trees Calculate PIs, filter maximal patterns Fig 3: The proposed mining framework second aspect by developing two instance-tree structures III T HE PROPOSED ALGORITHMS Fig shows the framework of the proposed algorithm The first phase requires users to input a spatial data set, a distance threshold d, and a minimum prevalence threshold minprev The neighbor relationship of instances is materialized under d in the second phase The third phase finds size-2 PCPs A set of MCP candidates are generated based on the size2 patterns in the fourth phase The fifth phase collects the table instance of each candidate by constructing a instancetree The sixth phase calculates participation indexes and filters prevalent MCPs In this study, we mainly focuses on the fourth phase, two efficient instance-trees are devised C E B A D Fig 4: The relationship of features in size-2 CPs To enumerate all maximal cliques from G2F , we employ a maximal clique enumeration algorithm developed in [12], [15] Algorithm describes the process of generating MCP candidates from size-2 PCPs, where Γ(fi ) is a set of vertices that directly connect to fi For details of Algorithm 1, please refer to [15] For example, running Algorithm in Fig 4, two maximal cliques are yielded, {A, B, C, D} and {A, C, E} The two are considered as candidates for discovering prevalent MCPs Algorithm 1: Generating candidate maximal patterns Input: an undirected graph constructed by size-2 prevalent patterns, G2F (V, E); Output: a set of candidate maximal pattens, CMPs; Initialize P = V , Q = ∅, X = ∅; A Star neighbors for fi in a degeneracy ordering f1 , , fm of (V, E) Definition 1: The star neighbor (SN) of an instance iq is P = Γ(fi ) ∩ {fi+1 , , fm } ; defined as a set of instances jp , SN = {jp | jp > iq , p 6= X = Γ(fi ) ∩ {f0 , , fi−1 } ; q, ≤ p, q ≤ m} and iq is called the center instance BronKerboshPivot(P , {fi }, X) ; For example, Table I lists the star neighbor of each instance end in the data set shown in Fig BronKerboshPivot(P , Q, X): TABLE I: Star neighbors of instances in Fig if P ∪ X = ∅ then Center Center CM P s.add(Q) ; Star neighbor Star neighbor instance instance instances instances 10 end A.1 A.2 C.3, D.2 B.2, C.4, D.1, E.2 11 Choose a pivot u in P ∪ X with |P ∩ Γ(u)| = A.3 A.4 B.1, B.4, C.2, D.3 B.3 max |P ∩ Γ(v)| ; B.1 B.2 C.2, D.3 C.4, D.1 B.3 C.1 C.1 D.2, D.3 C.3 E.1 D.1 D.3 E.2 - : empty set B.4 C.2 C.4 D.2 E.1 v∈P∪X C.3, D.2, D.3 D.3, E.2 D.1 - 12 13 14 15 16 B Generating candidates According to the anti-monotonicity property of PCPs [6], if a size-(k>2) pattern c is prevalent, all size-2 patterns which are generated by the features in c must be prevalent Hence, size-k candidates can be generated based on size-2 PCPs It is easy to find that the relationship of features in size-2 PCPs can be plotted as an undirected graph G2F whose nodes are the features in the size-2 PCPs and edges are these size-2 PCPs Definition 2: A size 2-feature graph, G2F (V, E) is a set of vertices V = {fi | fi is a feature of the size-2 PCPs} and a set of edges E = {(fi , fj ) | (fi , fj ) is a size-2 PCPs} For example, the data set in Fig 2, if users give minprev = 0.2, we can obtain size-2 PCPs are: {A, B}, {A, C}, {A, D}, {A, E}, {B, C}, {B, D}, {C, D}, and {C, E} Fig illustrates the G2F graph constructed by the size-2 PCPs It can be seen that the MCP candidates are equal to the maximal cliques which are enumerated in G2F 17 for v ∈ P \ Γ(u) BronKerboshPivot(P ∩ Γ(v), Q ∪ {v}, X ∩ Γ(v)) ; P = P \ {v} ; X = X \ {v} ; end return CMPs ; C A star neighbor-based instance-tree to collect co-location instances Definition 3: A star neighbor-based instance-tree (STN-IT) of a candidate is defined as follows: (1) The tree has one root The root is the center instance which is determined by an item in SNs; (2) Each node is an instance in the star neighbor of the center instance; (3) A qualified node is the intersection of the star neighbor of its parent with the instances in the star neighbor of the center instance; (4) The tree-depth of STN-IT is equal to (k-1) with k is the size of the candidate Algorithm shows the pseudocode of constructing a star neighbor-based instance-tree for collecting the table instance of a MCP candidate c The first phase initializes a star neighbor-based instance-tree STNIT by using an item it which 250 B.2 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (a) B.2 C.4 B.2 D.1 C.4 (b) (a) A.2 D.1 (b) A.2 has the center instance feature type that is the first feature in A.2 A.2 c and its neighbor contains all remainder feature types in c B.2 B.2 A.2 A.2 Then, these instances in the star neighbor whose feature type is B.2 B.2 D.1 A.2 D.1 C.4 equal to the second feature in c are added as children (Step 1) A.2 C.4 D.1 D.1 C.4 B.2 B.2 C.4 A variable depth is defined as the tree-depth The second phase D.1 iterates each tree-leaf leaf in STNIT and gets the intersection D.1 B.2 D.1 C.4 B.2 D.1 D.1 C.4D.1 of the star neighbor of it and the star neighbor of leaf (Steps (b) (c) (a) (b) (c) (d)(d) 2-5) The third phase adds the result of the intersection as the(a) (a) (b) (c) (d) children of leaf (Steps 6-7) If the intersection is empty, the Fig 5: The star neighbor-based instance-tree of A.2 (a) A.2 leaf is removed (Step 9) The fourth phase deletes all leaves A.2 Initialize a STN-IT tree (b) Add the children of B.2 (c) Add A.2 that their depths are smaller than the size of c (Step 10) A.2 (d) The final STN-IT tree B.2 the children of C.4 B.2 For example, SN(A.2) = {B.2, C.4, D.1, E.2} and it is a D.1 neighbor with its sibling satisfied item of c = {A, B, C, D} The STN-IT of A.2 can D.1 C.4 its star nodes; (4) The tree-depth of B.2 C.4 B.2 be plotted in Fig First, A.2 is added as the root and B.2 is SBN-IT is equal to the size of the candidate D.1 D.1it added as a child of A.2 (Fig 5a) Then, the intersection of Algorithm 3D.1 describes the C.4 pseudocode of D.1the process of conC.4 and the star neighbor of B.2 is required, com = it ∩ SN(B.2) structing SBN-IT A sibling node-based instance-tree, SBNIT = {B.2, C.4, D.1, E.2} ∩ {C.4, D.1} = {C.4, D.1}, thus C.4(c) is initialized in the (d)first phase The children of the root are D.1 D.1satisfied candidate star neighbor and D.1 are added as children of B.2 (Fig 5b) Fig 5c shows all the center instances in the after appending the children of C.4 Next, D.1 which is a child items All the star neighbors of the center instances are added (c) (d) The second phase iterates of B.2, is deleted since it is a leaf with its depth is smaller as children of themselves (Step 1) than the size of c (k = 4) Finally, {A.2, B.2, C.4, D.1} is each leaf of SBN-IT to get the intersection between the sibling regarded as a co-location instance of {A, B, C, D} nodes and the star neighbor of the leaf (Steps 3-6) Note that, only a leaf that has its feature type is equal to the one Algorithm 2: Constructing a STN-IT tree feature type obtained by (depth - 1) index in the candidate Input: a candidate maximal pattern, c; an item it in with depth is the tree-depth of SBN-IT, other leaves can be SCSNIs of c; the star neighbor, SN; directly deleted In the third phase, appending the intersection Output: a star neighbor-based instance-tree, STNIT; as children of the leaf if it is not empty (Steps 7-8) Finally, a STNIT = initialTree(c, it) ; refinement function is called to delete all leaves if their depths depth = ST N IT getDepth; are not equal to the size of the candidate (Step 14) while depth < k Algorithm 3: Constructing a SBN-IT tree for leaf ∈ ST N IT getLeaves Input: a candidate maximal pattern, c; an item it in if leaf getDepth == depth then SCSNIs of c; all item in SCSNIs, items; com = getIntersection(it, SN (leaf )); Output: a sibling node-based instance-tree, SBNIT; if com 6= ∅ then SBNIT = initialTree(c, items) ; STNIT.addChildren(leaf, com); depth = SBNIT.getDepth; else while depth ≤ k 10 STNIT.delete(leaf ); for leaf ∈ SBNIT.getLeaves 11 end if leaf.feature == c[depth - 1] then 12 end sibl = getSibling(leaf ); 13 end com = getIntersection(sibl, SN(leaf )); 14 depth = STNIT.getDepth; if com 6= ∅ then 15 end SBNIT.addChildren(com); 16 STNIT = refinementTree(STNIT); 10 else 17 return STNIT; 11 SBNIT.delete(leaf ); 12 end D A sibling node-based instance-tree to collect co-location 13 else instances 14 SBNIT.delete(leaf ) end The star neighbor-based instance-tree is constructed for each 15 end instance in the satisfied candidate star neighbor items, in this 16 17 depth = SBNIT.getDepth; Section, a new instance-tree that deals with the all instances 18 end simultaneously is designed Definition 4: A sibling node-based instance-tree (SBN-IT) 19 SBNIT = refinementTree(SBNIT); of a candidate is defined as follows: (1) The tree has one 20 return SBNIT; root named Root; (2) The children of the root are the center instances in SNs; (3) A qualified node is the intersection of For example, Fig presents the process of constructing the 251 D.1 D.3 A.3 A.2 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (c) B.2 C.4 D.1 E.2 B.1 B.4 C.2 D.3 (a) Root Root A.2 B.2 B.2 C.4 D.1 E.2 B.1 B.4 C.2 D.3 (a) (a) A.3 A.2 A.3 A.2 Root C.4 A.3 D.1 B.4 C.2 D.3 D.3 C.2 E.2 B.1 C.4 B.4 C.2 D.3 D.1 B.1 C.4 D.1 E.2 B.2 C.4 D.1 D.1 D.3 C.2 D.3 (b)(b) (c) Root (c) Fig 6: Constructing SBN-IT for {A, B, C, D} (a) Initialize a SBN-IT tree (b) Add children (c) The final SBN-IT tree sibling for the candidate pattern c = A.2 node-based instance-tree A.3 {A, B, C, D} First, the satisfied candidate star neighbor items B.1D.1, E.2} and {A.3: B.1, B.4, C.2, D.3} are {A.2: B.2, C.4, B.2 In theC.4first phase, a SBN-IT B.4 C.2 tree D.3 is constructed as shown in D.1 E.2 Fig 6a with a root and A.2, A.3 are two children of the root D.3 C.4 D.1 C.2 All instances in the star neighbors of A.2 and A.3 are added as (b) A.3, respectively Iterating each leaf in the children of A.2 and SBN-IT tree, for example, considering B.2, we get the sibling nodes of B.2 is sibl = {C.4, D.1, E.2}, and the star neighbor of B.2 is SN(B.2) = {C.4, D.1}, then com = sibl ∩ SN(B.2) = {C.4, D.1} Thus, C.4 and D.1 are added as children of B.2 In the next iterator, C.4 is considered Since the feature type of C.4 is C and it is different to c[depth -1] = B (the tree-depth now is 2), C.4 is directly removed Fig 6b shows the result when all sibling nodes of feature B are processed A complete SBN-IT tree is plotted in Fig 6c It can be seen that each branch of the tree is a co-location instance of the candidate IV C OMPUTATIONAL EXPERIMENTS A set of experiments is designed to evaluate the performance of the proposed algorithm When the framework in Fig uses the star neighbor-based instance-tree and the sibling nodebased instance-tree, we name the mining algorithm MCPMSTN-IT and MCPM-SBN-IT, respectively SGCT [14] is chosen for comparison with our algorithms since it is the most recent MCP mining algorithm and has been proven to be superior to MAXColoc [12] and OCB [13] All algorithms are implemented in C++ and performed on an Intel Core i7-3770 3.40GHz PC running Windows with 16G main memory A Data sets Two synthetic data sets are generated by a synthetic data generator which is similar to [5] The numbers of features and instances of the two are 50 and 20,000, respectively The spatial areas are set to 500×500 for dense data and 1000×1000 for sparse data Moreover, there are two real POI data sets in our experiments They are collected from facilities such as bank, parking lot, and hotel in Guangzhou (49,566 instances, 44 features) and Shanghai (67,824 instances, 50 features), China Their distributions are plotted in Fig Root A.3 A.2 B.2 C.4 D.1 E.2 B.1 B.4 C.2 D.3 (a) (a) (b) Fig 7: The distribution of (a)Root Guangzhou (b) Shanghai the distance and prevalence thresholdsA.3are set to 16 and A.2 0.4, respectively The two thresholds are set to 13 and 0.6, B.1 respectively when B.2 the dense data set is used As can be seen B.4 C.2 D.3 C.4 D.1 E.2 that: (1) The proportion of generating MCP candidates is C.4 D.1 C.2 very small in the total cost; (2) The D.3 largest fraction of the (b) computation time is devoted to constructing instance-trees to collect table instances The neighbor relationship of instances in SGCT is verified one-by-one, it takes more execution time compared with the proposed algorithms which effectively reduce searching space The different gap in computation time becomes larger when data sets are dense TABLE II: The execution time of each phase of the algorithms Algorithm Data set Factor (s) T gen neighbors T find size patterns T gen candidates T constr inst trees T calc PI filter patterns T total (a) SGCT MCPM-STN-IT MCPM-SBN-IT Sparse Dense Sparse Dense Sparse Dense 0.159 0.321 0.003 29.541 0.294 30.318 0.247 0.625 0.003 316.139 1.537 318.551 0.149 0.212 0.195 0.21 0.28 0.498 0.287 0.494 0.003 0.003 0.003 0.003 1.137 124.912 1.924 19.889 0.068 0.403 0.081 0.324 1.637 126.028 2.490 20.920 (b) B Performance study Fig 8: The execution times on different instances (a) Sparse (b) Dense a) The effectiveness: Table II lists the execution time in each phase of each algorithm For the sparse data set, b) The scalability: First, we compare the effect of different numbers of instances As shown in Fig 8, as the number 252 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (a) (c) (b) (d) Fig 9: The scalability in different distance thresholds on (a) Synthetic sparse data (b) Dense data (c) Guangzhou (d) Shanghai (a) (c) (b) (d) Fig 10: The scalability in different prevalence thresholds on (a) Sparse data (b) Dense data (c) Guangzhou (d) Shanghai of instances increases, the proposed algorithms shows better performance Second, we evaluate the performance of the proposed algorithms on different distance thresholds Fig 9a and 9b show the results examined in the synthetic data sets when the prevalence thresholds are fixed at 0.4 and 0.6 for the sparse and dense data sets, respectively Fig 9c and 9d compares the computation time of these algorithms when they are performed on the two real data sets with the prevalence thresholds is set to 0.4 As can be seen, MCPM-STN-IT and MCPM-SBN-IT show less execution time Third, the scalability of the proposed algorithms in terms of the minimum prevalence threshold is examined We set the distance thresholds to 20 and 13 for the sparse and dense data sets, respectively Fig 10a and 10b show the results Overall, with the increase of the prevalence threshold, the execution times of all algorithms are decreasing in cost However, SGCT takes more execution time in small values of the prevalence threshold When the proposed algorithms are performed on the two real data sets, the distance thresholds are set to 300m and 250m, respectively The comparison of the execution times is shown in Fig 10c and 10d As can be seen that the proposed algorithms show better performance V C ONCLUSION AND FUTURE WORK Two efficient instance-trees named STN-IT and SBN-IT are designed to collect table instances of candidate maximal co-location patterns in this study The two instance-tree structures effectively reduce the search space when examining the neighbor relationship of instances Fast construction of instance-trees supports increasing the speed of collecting table instances Therefore, the performance of discovering maximal co-location patterns is improved By examining on both synthetic and real data sets, the proposed algorithm shows more efficiently than the existing algorithms R EFERENCES [1] W Liu, Q Liu, M Deng, J Cai, and J Yang, “Discovery of statistically significant regional co-location patterns on urban road networks,” International Journal of Geographical Information Science, pp 1–24, 2021 [2] V Tran, L Wang, and H Chen, “Discovering spatial co-location patterns by automatically determining the instance neighbor,” in Fuzzy Systems and Data Mining V IOS Press, 2019, pp 583–590 [3] Z He, M Deng, Z Xie, L Wu, Z Chen, and T Pei, “Discovering the joint influence of urban facilities on crime occurrence using spatial co-location pattern mining,” Cities, vol 99, p 102612, 2020 [4] V Tran and L Wang, “Delaunay triangulation-based spatial colocation pattern mining without distance thresholds,” Statistical Analysis and Data Mining, vol 13, no 3, pp 282–304, 2020 [5] Y Huang, S Shekhar, and H Xiong, “Discovering colocation patterns from spatial data sets: a general approach,” IEEE Transactions on Knowledge and data engineering, vol 16, no 12, pp 1472–1485, 2004 [6] J S Yoo and S Shekhar, “A joinless approach for mining spatial colocation patterns,” IEEE Transactions on Knowledge and Data Engineering, vol 18, no 10, pp 1323–1337, 2006 [7] V Tran, L Wang, and L Zhou, “A spatial co-location pattern mining framework insensitive to prevalence thresholds based on overlapping cliques,” Distributed and Parallel Databases, pp 1–38, 2021 [8] V Tran, L Wang, and L Zhou, “Mining spatial co-location patterns based on overlap maximal clique partitioning,” in 20th IEEE International Conference on Mobile Data Management, 2019, pp 467–472 [9] A M Sainju and Z Jiang, “Mining colocation from big geo-spatial event data on gpu,” 2021 [10] J S Yoo, D Boulware, and D Kimmey, “Parallel co-location mining with mapreduce and nosql systems,” Knowledge and Information Systems, pp 1–31, 2019 [11] A M Sainju, D Aghajarian, Z Jiang, and S Prasad, “Parallel gridbased colocation mining algorithms on gpus for big spatial event data,” IEEE Transactions on Big Data, vol 6, no 1, pp 107–118, 2018 [12] J S Yoo and M Bow, “Mining maximal co-located event sets,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining Springer, 2011, pp 351–362 [13] L Wang, L Zhou, J Lu, and J Yip, “An order-clique-based approach for mining maximal co-locations,” Information Sciences, vol 179, no 19, pp 3370–3382, 2009 [14] X Yao and L Peng, “A fast space-saving algorithm for maximal colocation pattern mining,” Expert Syst Appl., vol 63, pp 310–323, 2016 [15] D Eppstein and D Strash, “Listing all maximal cliques in large sparse real-world graphs,” in International Symposium on Experimental Algorithms Springer, 2011, pp 364–375 253 ... instances Therefore, the performance of discovering maximal co-location patterns is improved By examining on both synthetic and real data sets, the proposed algorithm shows more efficiently than... proposed algorithms show better performance V C ONCLUSION AND FUTURE WORK Two efficient instance-trees named STN-IT and SBN-IT are designed to collect table instances of candidate maximal co-location. .. Information Science, pp 1–24, 2021 [2] V Tran, L Wang, and H Chen, “Discovering spatial co-location patterns by automatically determining the instance neighbor,” in Fuzzy Systems and Data Mining

Ngày đăng: 18/02/2023, 05:29

Tài liệu cùng người dùng

Tài liệu liên quan