DSpace at VNU: Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently

Engineering Applications of Artificial Intelligence 38 (2015) 183–189 Contents lists available at ScienceDirect Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently Minh-Thai Tran a, Bac Le b, Bay Vo c,n a Faculty of Information Technology, Information Technology College, Ho Chi Minh City, Vietnam Department of Computer Science, University of Science, VNU-Ho Chi Minh, Vietnam c Faculty of Information Technology, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam b art ic l e i nf o a b s t r a c t Article history: Received 24 May 2014 Received in revised form 23 October 2014 Accepted 28 October 2014 Sequence mining algorithms attempt to mine all possible frequent sequences These algorithms produce redundant results, increasing the required storage space and runtime, especially for large sequence databases In recent years, many studies have proved that mining frequent closed sequences is more efficient than mining all frequent sequences The desired information can be fully extracted from frequent closed sequences Most algorithms for mining frequent closed sequences use a candidate maintenance-and-test paradigm The present paper proposes an algorithm called CloFS-DBV that uses dynamic bit vectors Various methods are employed to reduce memory usage and runtime Experimental results show that CloFS-DBV is more efficient than the BIDE and CloSpan algorithms in terms of execution time and memory usage & 2014 Elsevier Ltd All rights reserved Keywords: Dynamic bit vector Frequent closed sequence CloFS-DBV Introduction Sequential pattern mining is a fundamental problem in knowledge discovery and data mining with broad applications, including those in the analysis of customer purchase behavior, web access patterns, scientific experiments, disease treatment, natural disaster prevention, and protein formation Sequential pattern mining includes two main stages: frequent pattern mining and rule mining Many studies have modified the AprioriAll algorithm (Agrawal and Srikant, 1995) for mining frequent sequential patterns Unlike the general mining of frequent sequences, the mining of frequent closed sequences has not been extensively studied Although some algorithms have been proposed, such as CloSpan (Yan et al., 2003), CLOSETỵ (Wang et al., 2003), and BIDE (Wang et al., 2007), their performance is poor for large databases BIDE detects frequent sequences, not closed ones, and prunes candidates early, instead of using maintenance-and-test patterns Recently, many authors have proposed techniques that present data in a vertical format (Song et al., 2005), use projection databases operation (Pei et al., 2001), use bit vector data structures (Song et al., 2008), all of which have been shown to be effective However, the storage space and execution time can be further reduced in the mining process for large sequence databases The present study proposes the CloFS-DBV algorithm, which uses a vertical data format and data compression, and divides the n Corresponding author E-mail addresses: minhthai@itc.edu.vn (M.-T Tran), lhbac@fit.hcmus.edu.vn (B Le), bayvodinh@gmail.com (B Vo) http://dx.doi.org/10.1016/j.engappai.2014.10.021 0952-1976/& 2014 Elsevier Ltd All rights reserved search space to reduce the required storage space and execution time for mining frequent closed sequences The rest of the paper is organized as follows Section gives the problem definition Section summarizes related work Sections and present the proposed algorithm and experimental results, respectively The conclusions and future work are given in Section Problem definition Consider a sequence database with a set of distinct events I ¼ fi1 ; i2 ; i3 ; ⋯; in g, where ij is an event (or an item), where r j rn A set of unordered events is called an itemset Each itemset is put in brackets, for example ðABCÞ To simplify notation, for itemsets that contain only a single item, the brackets are omitted, for example B A sequence S ¼ fe1 ; e2 ; e3 ; ⋯; em g is an ordered list of events, where ej ð1 r j rmÞ is an itemset Suppose that ℓ is the number of events in a sequence A sequence with length ℓ is called an ℓ À sequence For example, ABðAEÞCB is a À sequence A sequence Sa ¼ a1 ; a2 ; ⋯; am is contained in another sequence Sb ¼ b1 ; b2 ; ⋯; bn if there exist integers r i1 o i2 o ⋯ oim r n such that ¼ bi1 ; a2 ¼ bi2 ; ⋯; am ¼ bim If sequence Sa is contained in sequence Sb , Sa is called a subsequence of Sb and Sb is called a supersequence of Sa , denoted as Sa D Sb A sequence database is denoted as D ¼ fs1 ; s2 ; s3 ; ⋯; sjDj g, where jDj is the number of sequences in D and si ð1 ri r jDjÞ is a transaction in the form ID; Sequence, where the attribute ID is used to describe the information of si corresponding to transaction information over time The absolute support (support) of a sequence Sa in a sequence database D is calculated as the number of occurrences of Sa in the 184 M.-T Tran et al / Engineering Applications of Artificial Intelligence 38 (2015) 183–189 ft α :pα g and ft β :pβ g are the transactions and positions of sequences α and β, respectively There are two forms of sequence extension Table Example sequence database D ID Sequence CAA(AC) AB(ABC)B A(BC)ABCE AB(BC)AD Itemset extension : α ỵ i ẳ sub1;k ịuvịft :p g if u ovị t ẳ t Þ ðpα ¼ pβ Þ ðsub1;k À ị ẳ sub1;k ịị 3:1ị Sequence extension : ỵ s ẳ vft :p g; if t ẳ t ị p o p Þ ðsub1;k À ðαÞ ¼ sub1;k À ðβÞÞ transactions of D, denoted as supD ðSa Þ The support of a sequence is given in the notation sequence : support For example, a sequence AB with support is represented as AB : Given a minimum support threshold minSup, a sequence Sa is a frequent sequence on D if supD ðSa Þ ZminSup If sequence Sa is frequent and there exists no proper supersequence Sb of Sa with the same support, Sa is called a frequent closed sequence, i.e., there does not exist Sb such that Sa D Sb and supD Sa ị ẳ supD Sb ị The problem of mining frequent closed sequences is to find a complete set of frequent closed sequences for an input sequence database D and a given minimum support threshold minSup Example Consider the sequence database in Table The database has five unique items I ¼ fA; B; C; D; Eg and four transactions, i.e., jDj ¼ Assume that the minimum support threshold is minSup ẳ 50%ị If all frequent sequences of D are mined with the given minSup, the following 32 sequences are obtained: SFS ¼{A : 4, AA : 4, AB:3, AC:4, (AC):2, AAB:2, AAC:2, A(AC):2, ABA:3, ABB:3, ABC:3, A(BC):3, ACA:2, ACB:2, ABAB:2, AB(BC):2, A(BC)A:2, A(BC)B:2, B:3, BA:3, BB:3, BC:3, (BC):3, BAB:2, B(BC):2, (BC)A:2, (BC)B:2, C:4, CA:3, CB:2, CC:2, CAC:2} In contrast, mining the frequent closed sequences yields SFCS ¼ {AA:4, AC:4, AAC:2, A (AC):2, ABA:3, ABB:3, ABC:3, A(BC):3, ABAB:2, AB(BC):2, A(BC)A:2, A (BC)B:2, CA:3, CAC:2}, which has only 14 sequences Frequent closed sequences SFCS are thus more compact than general frequent sequences SFS This is due to subsequence Sa with the same support as that of supersequence Sb being absorbed by Sb without affecting the mining results For example, sequence ðBCÞA : is absorbed by sequence AðBCÞA : because ðBCÞA D ABCịA and supD BCịAị ẳ supD ABCịA ị ẳ At first, the frequent sequences with length are mined from a sequence database After that, these frequent sequences will combine (or extend) each other to form new candidates with length This process is repeated until there are no new generated frequent sequences In general, the sequences with length k are used to generate sequences with length k ỵ1 Besides the generation of candidates, the checking of frequent closed sequences is applied in each process The following definitions are used in the process of extending sequences and checking frequent closed sequences Definition (substring of a sequence) Let S be a sequence subi;j ðSÞ ði r jÞ is defined as a substring of length j i ỵ 1ị from position i to position j of S For example, sub1;3 ðBABCÞ is BAB and sub4;4 ðBABCÞ is C Definition (extending a sequence from a 1-sequence) Let α and β be two frequent 1-sequences ft α :pα g and ft β :pβ g are the transactions and positions of sequences α and β, respectively There are two forms of sequence extension Itemset extension : 〈ðαβÞ〉ft β :pβ g; if o ị t ẳ t ị p ẳ p ị: 2:1ị Sequence extension : 〉ft β :pβ g; if ðt α ¼ t β Þ ðpα o pβ Þ ð2:2Þ Definition (extending a sequence from a k-sequence) Let α and À Áβ be two frequent k-sequences k 1ị, u ẳ subk;k ị, and v ẳ subk;k 3:2ị Denition Let S ¼ e1 e2 ⋯en An item e' can be added to a pattern extension of S in one of three positions S0 ¼ e1 e2 ⋯en e0 supD S0 ị ẳ supD Sịị 4:1ị ( i1 r i onịsuch that S0 ẳ e1 e2 ei e0 en supD S0 ị ẳ supD Sịị 4:2ị S0 ¼ e0 e1 e2 ⋯en ðsupD ðS0 Þ ¼ supD ðSÞÞ ð4:3Þ In (4.1), item e appears after en , so item e is called a forwardextension and S0 is called a forward-extension sequence For example, sequence AC : is a forward-extension of sequence A : because sequence C is extended after sequence A and their support is In (4.2) and (4.3), item e0 appears before en , so item e0 is called a backward-extension and S0 is called a backwardextension sequence For example, sequence CAC : is a backward-extension of sequence CC : because sequence A is extended in the middle of sequence CC and their support is Definition Let S ¼ e1 e2 ⋯en The starting position of sequence S is the position of the first appearance of itemset e1 For example, in the sequence ABðABCÞCB, the starting position of sequence ðABCÞ is 3, and that of sequence ABB is Related work Mining frequent sequences was first proposed in 1995 by Agrawal and Srikant with their AprioriAll algorithm, which is based on the Apriori property Agrawal and Srikant then expanded the mining problem in a general way with the GSP algorithm (Srikant and Agrawal, 1996) Since then, many frequent sequence mining algorithms have been proposed to improve mining efficiency The algorithms use various approaches for organizing data and storing mined information Typical algorithms include SPADE (Zaki, 2001), PrefixSpan (Pei et al., 2001), SPAM (Ayres et al., 2002), and LAPINSPAM (Yang and Kitsuregawa, 2005) The SPAM algorithm organizes data in a vertical bitmap format and uses a dictionary tree structure to store mined information PrefixSpan uses database projection for sequence extension to reduce the search space, with the data presented horizontally The LAPIN-SPAM algorithm uses a list to store the final positions of items and a set of boundary positions of the prefix to reduce the scope of the search space Various algorithms have been proposed for mining non-redundant frequent sequences to reduce the required storage space and runtime for mining rules Frequent closed sequence mining and frequent closed itemset mining algorithms include A-CLOSE (Pasquier et al., 1999), CLOSET (Pei et al., 2000), CHARM (Zaki and Hsiao, 2002), and CLOSETỵ (Wang et al., 2003) Most of these algorithms maintain mined frequent itemsets in order to test frequent closed sequences, which require a lot of memory CLOSETỵ uses a two-level hash-index structure and a tree structure for storing the itemsets to reduce memory space and the time required for testing closed itemsets CloSpan (Yan et al., 2003) uses a maintainand-test pattern method and combines a hash-index structure with a tree structure for storing sequences This algorithm prunes patterns M.-T Tran et al / Engineering Applications of Artificial Intelligence 38 (2015) 183–189 using techniques such as Common Prefix and Backward Sub-Pattern to reduce the search space The ClaSP (Gomariz et al., 2013) algorithm uses a vertical database format strategy, as done by the SPADE algorithm, and a heuristic to prune non-closed sequences, as done by the CloSpan algorithm However, the algorithm maintains previous candidates to test the closure of sequences and removes them later The maintenance of candidates increases memory consumption, and the number of test candidates increases with the number of generated frequent closed sequences In order to overcome these problems, the BIDE algorithm (Wang et al., 2007) does not keep track of historical frequent closed sequences for checking the closure of new patterns Instead, it uses bi-directional extension techniques to examine frequent closed patterns as candidates before extending a sequence Moreover, the algorithm uses a BackScan process to determine candidates that cannot be extended to reduce mining time The algorithm uses pseudo projection techniques to reduce database storage space and is efficient for low support thresholds However, in the process of mining, it has to project and scan databases many times for each prefix, making it inefficient Proposed algorithm This section describes the proposed CloFS-DBV algorithm, which uses a dynamic bit vector (DBV) structure combined with location information in the structure of the transaction CloFSDBVPattern to mine frequent closed sequences 4.1 DBV data structure Sequence mining algorithms based on a vertical data format have proven to be more efficient than those based on a horizontal data format Typical algorithms that use a vertical format include SPADE (Zaki, 2001), DISC-all (Chiu et al., 2004), HVSM (Song et al., 2005), and MSGPs (Pham et al., 2012) These algorithms scan the database only once and calculate the support of the sequence quickly However, the disadvantage is that they consume much more memory to store additional information BitTableFI (Dong and Han, 2007) and Index-BitTableFI (Song et al., 2008) have solved this problem by compressing data by using a bit table (BitTable) The main drawback of the bit vector structure is a fixed size, which depends on the number of transactions in a sequence database ‘1’ indicates that the item appears in the transaction and ‘0’ indicates otherwise In practice, there are usually many ‘0’ bits in a bit vector, i.e., items in sequence database often random appear in the sequence database In addition, during the extending process of sequences (using bitwise AND) the ‘0’ bits will more appear Thus increases the required memory and processing time In order to overcome this problem, dynamic bit vector architecture is used (Vo et al., 2012) Let A and B be two bit vectors p1 and p2 are the probabilities of ‘1’ bits in two bit vectors A and B, respectively Assuming k is the probability of ‘0’ bits after joining A and B to get AB by the extending process of sequence Therefore, the probability of ‘1’ bits in the bit vector AB is minðp1 ; p2 Þ À k, where minðp1 ; p2 Þ is the minimum value of p1 and p2 Obviously, the probability of ‘1’ in AB will decrease in contrast the probability of ‘0’ in that increase Moreover, the gap between p1 and p2 will be larger quickly after several sequence extensions Suppose there are 16 transactions in a sequence database An item i exists in transactions 7, 9, 10, 11, and 13 The bit vector for Table Example of 16-byte bit vector 0 0 0 1 1 0 185 the item i needs 16 bytes, as shown in Table The first non-zero byte appears at index The DBV only stores the starting index and sequence of bytes starting from the first non-zero byte until the last non-zero byte, as shown in Table Only bytes are required to store the information using the DBV structure Each DBV consists of two parts: (1) Start bit: the position of the first appearance of ‘1’ and (2) Bit vector: sequence of bits starting from the first non-zero byte until the last non-zero byte The DBV structure is used to store transactions in a vertical format Sequence supports can easily be calculated by counting the number of ‘1’ bits Example Consider database D in Table Sequence A exists in transactions 1, 2, 3, and 4, so the start bit is 1, and the bit vector is 1111 The bit vector has four ‘1’ bits, so the support of sequence A is Sequence B exists in transactions 2, 3, and 4, so the start bit is 2, and thus the bit vector is 111 The bit vector has three ‘1’ bits, so the support of sequence B is Table shows the conversion of database D in Table to DBV format 4.2 CloFS-DBVPattern data structure The CloFS-DBVPattern structure combines a DBV structure with a representation of sequence information Each CloFS-DBVPattern consists of two parts: (1) Sequence: sequence information and (2) BlockInfo: a DBV and a list of positions appearing in the sequence of transactions List positions of each transaction are represented in the form of startPos : flist positionsg, where startPos is the first appearance of the sequence in each transaction Example In database D (Table 1), sequence A exists in transactions 1, 2, 3, and For the first transaction, sequence A appears at positions f2; 3; 4g The starting position is 2, and thus : f2; 3; 4g is stored For the second transaction, sequence A appears at positions f1; 3g The starting position is 1, and thus : f1; 3g is stored For the third transaction, sequence A appears at positions f1; 3g The starting position is 1, and thus : f1; 3g is stored Similarly, for the last transaction, sequence A appears at positions f1; 4g and the starting Table Conversion of bit vector in Table to DBV Table Conversion of database D in Table to DBV format Item ID Bit vector A B C D E 1, 2, 3, 2, 3, 1, 2, 3, 4 1111 111 1111 0001 010 Conversion to DBV Start bit Bit vector Value 1111 111 1111 1 15 15 1 Table CloFS-DBVPattern for sequence A in Table Sequence Start bit Value Index Positions A 15 1:{1,4} 1:{1,3} 1:{1,3} 2:{2,3,4} 186 M.-T Tran et al / Engineering Applications of Artificial Intelligence 38 (2015) 183–189 position is 1, thus : f1; 4g is stored Table presents the CloFSDBVPattern for sequence A in Table The CloFS-DBV tree is used to store CloFS-DBVPattern The CloFS-DBV tree is an extension of the prefix tree The prefix tree can be constructed in the following way The root node of the tree is at the top level and labeled NULL Recursively, each node X at level k in the tree can be extended by adding one item to get a child node X at level k ỵ The children of node X are generated and arranged in lexicographical order By using the prefix tree, the generation of sequence rules becomes more efficient Typical algorithms for building a prefix tree include CloGen (Pham et al., 2013), IMSR_PreTree (Van et al., 2014), and MNSR_PreTree (Pham et al., 2014) In the CloFS-DBV tree, each node is a CloFSDBVPattern: a sequence, a DBV, and a list of positions of the sequence in each transaction Each node in the tree is extended in two forms: sequence extension and itemset extension Fig shows candidates for the database in Table obtained using the CloFSDBV algorithm sequence A that occurs before B in each transaction that contains prefix B If we extend prefix B, the results obtained will be absorbed due to the extension of prefix A already containing B and having the same support The CloFS-DBV algorithm consists of four main phases: (1) conversion of the sequence database to the CloFS-DBVPattern structure, (2) examination of the closure of frequent sequences, (3) early pruning of prefix sequences, and (4) extension of sequences Since CloFS-DBV uses the CloFS-DBVPattern structure, it can check the backward-extension and forward-extension quickly For each transaction, CloFS-DBV just considers the start position or the last position of the sequence Therefore, if the sequence has N transactions, the CloFSDBV takes only N operations to check each candidate In contrast, BIDE algorithm that is more efficiently than CloSpan in almost all the cases (Wang et al., 2007) uses a local database to check backwardextension and uses a projected local database to check forwardextension, i.e., it has to scan each item on each transaction in this database Let k be the sequence length, and N be the number transaction of sequence Thus, BIDE requires k Â N operations to check each candidate 4.3 CloFS-DBV algorithm Table CloFS-DBV algorithm Proposition (checking sequence closure) If there exists a sequence Sb that is a forward-extension or backward-extension of sequence Sa , sequence Sa is not closed, and Sa can be safely absorbed by Sb Considering the above example, suppose that Sa ¼ CC : and Sb ¼ CAC : Then, CC : will be absorbed by CAC because CC D CAC and supD CCị ẳ supD CACị ¼ Method: CloFS-DBV (D, minSup) Input: A sequence database D and a support threshold minSup Output: A complete set of frequent closed sequences FCS Let FCS:root ¼ NULL; Let f cs1 ẳ fiCloFS DBV Patterniị ji A I U supðiÞ Z minSupg; Sort (f cs1 ) increase order by item i; Add f cs1 to child node of FCS:root; For (each child node subNode in FCS:root) Call DBV-Pattern-Extension (subNode, minSup); End For Proposition (pruning a prefix) Consider a prefix Sp ¼ e1 e2 ⋯en If there exists an item e before the starting position of prefix Sp in each of the transactions containing Sp in sequence database D, the extension can be pruned by prefix Sp For example, consider the database D in Table There is no need to extend prefix B because there exists a NULL A (1,1111) 1:{1,4} 1:{1,3} 1:{1,3} 2:{2,3,4} s AA (1,1111) 1:{4} 1:{3} 1:{3} 2:{3,4} s s i AAB AAC A(AC) (2,11) (1,101) (1,11) 1:{4} 1:{5} 1:{3} 1:{4} 2:{4} 2:{4} s s i s AC AB (1,1111) (AC) (2,111) 1:{3} (1,11) 1:{2,3} 1:{2,5} 3:{3} 1:{2,4} 1:{3} 4:{4} 1:{2,3,4} 2:{4} s s ABA (2,111) 1:{4} 1:{3} 1:{3} s ABAB (2,11) 1:{4} 1:{4} C (1,1111) 3:{3} 2:{2,5} 3:{3} 1:{1,4} B (2,111) 2:{2,3} 2:{2,4} 2:{2,3,4} s ABB (2,111) 1:{3} 1:{4} 1:{3,4} i BA (2,111) 2:{3} 2:{3} 2:{4} s ABC (2,111) 1:{3} 1:{5} 1:{3} i AB(BC) (2,101) 1:{4} 1:{4} s s s i BB BC (BC) (2,111) (2,111) (2,111) 2:{3,4} 2:{3} 3:{3} 2:{4} 2:{5} 2:{2} 2:{3} 2:{3} 3:{3} s A(BC) (2,111) 1:{3} 1:{2} 1:{3} s ACA (3,11) 1:{4} 1:{3} s i ACB (2,11) 1:{4} 1:{4} BAB (2,11) 2:{4} 2:{4} s s CA CB CC (1,1101) (2,11) (1,101) 3:{4} 2:{4} 2:{5} 2:{3} 3:{4} 1:{4} 1:{2,3,4} s B(BC) (2,101) 2:{3} 2:{3} s s (BC)A (BC)B (3,11) (2,11) 2:{3} 3:{4} 3:{4} 2:{4} CAC (1,101) 2:{5} 1:{4} s A(BC)A A(BC)B (3,11) (2,11) 1:{4} 1:{4} 1:{4} 1:{3} Fig CloFS-DBV tree for database in Table Shaded rectangles represent candidates that are not closed Unshaded rectangles represent frequent closed sequences Lines with symbol s indicate sequence extension Lines with symbol i indicate itemset extension M.-T Tran et al / Engineering Applications of Artificial Intelligence 38 (2015) 183–189 187 extension (line 5) and itemset extension (line 8) Before sequence extension, the algorithm tests and eliminates prefixes that cannot extend frequent closed sequences using Proposition (line 3) The process executes recursively (line 12) until no frequent closed sequences are generated Line 14 uses Proposition to check the prefix Sp If Sp is not a frequent closed sequence, it will be set to NULL Table shows the pseudo code of proposed CloFS-DBV algorithm The algorithm first scans database D to find frequent 1-sequences and stores them in f cs1 as CloFS-DBVPattern (line 2) Then, the items in f cs1 are sorted in ascending order (line 3) to reduce the steps in the extension phase of the itemsets On line 6, the algorithm performs the sequence extension according to the child nodes of FCS:root Table shows DBV-Pattern-Extension algorithm called by the CloFS-DBV algorithm The sequence extension in two forms: sequence Example This example demonstrates sequence extension for the CloFS-DBV algorithm with sequence database D in Table and Table DBV-Pattern-Extension algorithm Method: DBV-Pattern-Extension (root, minSup) Input: A root of prefix tree root and a minSup Output: A set of frequent closed sequences root Let list_node ¼child node of root; For (each Sp in list_node) If (Sp is not pruned) then For (each Sa in list_node) If (sup (Let Spa ¼Sequence-Extension (Sp , Sa ))Z minSup) then Add Spa to child node of Sp ; End If If (sup (Let Spa ¼Itemset-Extension (Sp , Sa )) ZminSup) then Add Spa to child node of Sp ; 10 End If 11 End For 12 Call DBV-Pattern-Extension (Sp , minSup) 13 End If 14 If (Sp is not a frequent closed sequence) then 15 Let Sp ¼NULL; 16 End If 17 End For Table Sequences A, B, and C in the sample database after conversion to CloFS-DBVPattern Sequence A B C Start bit Value Index Positions 15 1:{1,4} 2:{2,3} 15 3:{3} 1:{1,3} 1:{1,3} 2:{2,3,4} 2:{2,4} 2:{2,3,4} 2:{2,5} 3:{3} 1:{1,4} Table Example of (a) sequence extension and (b) itemset extension for prefix A Sequence AA Sequence (AB) Start bit Value Index Positions A Positions A Positions AA 15 1:{1,4} 1:{1,4} 1:{4} Start bit Value Index Positions A Positions B Positions (AB) 1:{1,4} 2:{2,3} ∅ 1:{1,3} 2:{2,4} ∅ 1:{1,3} 2:{2,3,4} 3:{3} Sequence Start bit Value Index Positions A Positions B Positions AB AB 1:{1,4} 2:{2,3} 1:{2,3} 1:{1,3} 2:{2,4} 1:{2,4} Sequence Start bit Value Index Positions A Positions C Positions (AC) (AC) 1:{1,4} 3:{3} ∅ 1:{1,3} 2:{2,5} ∅ 1:{1,3} 3:{3} 3:{3} Sequence Start bit Value Index Positions A Positions C Positions AC AC 15 1:{1,4} 3:{3} 1:{3} 1:{1,3} 2:{2,5} 1:{2,5} (a) Sequence-extension 1:{1,3} 1:{1,3} 1:{3} 1:{1,3} 1:{1,3} 1:{3} 1:{1,3} 2:{2,3,4} 1:{2,3,4} 1:{1,3} 3:{3} 1:{3} 2:{2,3,4} 2:{2,3,4} 2:{3,4} 2:{2,3,4} ∅ 2:{2,3,4} 1:{1,4} 2:{4} (b) Itemset-extension 2:{2,3,4} ∅ 2:{2,3,4} 1:{1,4} 4:{4} 188 M.-T Tran et al / Engineering Applications of Articial Intelligence 38 (2015) 183189 minSup ẳ 50%ị After line (Table 6) is executed, three frequent 1-sequences are stored, i.e., f cs1 ¼ fA : 4; B : 3; C : 4g (Table 8) In this example, prefix A is not a closed sequence after the backward-extension process and prefix B can be pruned after the pruning prefix process The algorithm performs sequence extension to create new frequent closed 2-sequences Starting with prefix A, the extension proceeds with sequences A, B, and C in the forms of sequence extension (Table 9a) and itemset extension (Table 9b) Table 10 Definitions of parameters for standard databases from IBM C T S I N D Average number of itemsets per sequence Average number of items per itemset Average number of itemsets in maximal sequences Average number of items in maximal sequences Number of distinct items Number of sequences Fig Comparison of runtime for various minSup values for (a) C6T5S4I4N1kD10k, (b) T10I4D100k, and (c) N1kD10k databases Positions and of itemset ðABÞ are empty, so the bits corresponding to those positions are set to ‘0’ and this itemset is removed (supD ABịị ẳ o minSup) The process of expanding continues until no candidate is generated The results obtained are shown in Fig Experiment results Experiments were performed to evaluate the proposed algorithm All algorithms were implemented on a personal computer with an Intel Core Duo 2.0-GHz CPU and GB of RAM running Windows 8.1 The BIDE and CloSpan algorithm, the currently wellknown state of the art methods, were used for comparison The databases used for comparison were generated using the IBM synthetic data generator The definitions of parameters used to generate the databases are shown in Table 10 The comparisons of runtime and memory usage were performed on three databases: C6T5S4I4N1kD10k, T10I4D100k, and Fig Comparison of memory usage for various minSup (a) C6T5S4I4N1kD10k, (b) T10I4D100k, and (c) N1kD10k databases values for M.-T Tran et al / Engineering Applications of Artificial Intelligence 38 (2015) 183–189 N10kD10k First, experiments were conducted to compare the execution time of the three algorithms The results are shown in Fig Fig 2a shows the runtimes for minSup values of 6% to 10% for the C6T5S4I4N1kD10k database, Fig 2b shows those for minSup values of 3.5% to 5.5% for the T10I4D100k database, and Fig 2c shows those for minSup values of 5–9% for the N10kD10k database When decreasing the minSup, there are more obtained frequent sequences The result is the number of checking frequent closed sequences also increase So the execution time of algorithms increases quickly Fig shows the execution time of three algorithms increases with decreasing minSup, CloFS-DBV being faster in all cases For example, Fig 2a shows the mining time of CloFS-DBV, CloSpan, and BIDE With minSup ¼ 6%, the mining time of BIDE is 7432 ms, that of CloSpan is 7825 ms, and that of CloFS-DBV is 5040 ms Almost the execution time of CloFS-DBV occurs in the first stage of the mining process, i.e., CloFS-DBV first scans a sequence database to construct a CloFS-DBVPattern structure for each item After that stage, this algorithm takes very little time due to its operation mostly works on bit manipulation Next, experiments were conducted to compare the total memory usage (MBs) of the three algorithms Fig shows the memory usage of the three algorithms for various minSup values Fig 3a–c shows the memory usage for C6T5S4I4N1kD10k, T10I4D100k and N10kD10k database respectively With decreasing minSup, the number of generated candidates and required memory increases for the three algorithms CloFS-DBV requires less storage space than does BIDE or CloSpan due to its use of a compressed data structure For example, Fig 3c shows the total memory usage of CloFS-DBV, CloSpan, and BIDE for the N1kD10k database With minSup ¼ 5%, the total memory usage of BIDE is 29.5 MBs, that of CloSpan is 60.2 MBs, and that of CloFS-DBV is 7.37 MBs The total memory usage of CloFS-DBV less than that of BIDE or CloSpan because CloFS-DBV uses a DBV data structure and stores the needed information in the mining process In the mining process, CloFS-DBV neither uses a hash table nor uses database projection as CloSpan or BIDE does Moreover, while extending sequences, counting support of sequences, and other operations of CloFS-DBV are mainly based on bit manipulation So that it consumes less memory usage in the process Conclusion and future work This paper proposed the CloFS-DBV algorithm, which uses DBVs and transaction information to mine frequent closed sequences The CloFS-DBV algorithm is divided into two main stages: (1) the original sequence database is transformed into a vertical data format called CloFS-DBVPattern, where each CloFSDBVPattern stores the position of frequent closed sequences which appear in the database; (2) frequent closed sequences are generated and tested, and prefixes are pruned early The CloFS-DBV algorithm scans the database only once and calculates the supports based on the DBV to generate new patterns Due to its use of a compressed structure, the CloFS-DBV algorithm is more efficient than the BIDE and CloSpan algorithm in terms of memory usage and runtime The CloFS-DBV algorithm has a few limitations that will be addressed in the future Frequent closed inter-sequences will be mined to reduce the number of redundant patterns Based on mining frequent closed inter-sequences, the generation of rules will be made more compact and efficient In addition, mining maximal frequent sequences has been proposed in recent years (Guan et al., 2005; García-Hernández et al., 2006; Lin et al., 2007; Fournier-Viger et al., 2013) The DBV data structure will be applied for the efficient mining of such sequences 189 Acknowledgment This work was funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant no 102.05-2013.20 References Agrawal, R., Srikant, R., 1995 Mining sequential patterns In: Proceedings of the IEEE International Conference on Data Engineering, pp 3–14 Ayres, J., Gehrke, J., Yiu, T., Flannick, J., 2002 Sequential pattern mining using a bitmap representation In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Edmonton Alberta, Canada, pp 429–435 Chiu, D.Y., Wu, Y.H., Chen, A.L.P., 2004 An efficient algorithm for mining frequent sequences by a new strategy without support counting In: Proceedings of the 20th International Conference on Data Engineering, pp 375–386 Dong, J., Han, M., 2007 BitTableFI: an efficient mining frequent itemsets algorithm Knowl.-Based Syst 20 (4), 329–335 Fournier-Viger, P., Wu, C.W., Tseng, V.S., 2013 Mining maximal sequential patterns without candidate maintenance In: Advanced Data Mining and Applications, Lecture Notes in Computer Science, vol 8346, pp 169–180 Gomariz, A., Campos, M., Marin, R., Goethals, B., 2013 ClaSP: an efficient algorithm for mining frequent closed sequences In: Advances in Knowledge Discovery and Data Mining, LNA, vol 7818, pp 50–61 Guan, E.Z., Chang, X.Y., Wang, Z., Zhou, C.G., 2005 Mining maximal sequential patterns In: Proceedings of the Second International Conference on Neural Networks and Brain, pp 525–528 García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., 2006 A new algorithm for fast discovery of maximal sequential patterns in a document collection In: Computer Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, vol 3878, pp 514–523 Lin, N.P., Hao, W.H., Chen, H.J., Chueh, H.E., Chang, C.I., 2007 Fast mining maximal sequential patterns In: Proceedings of the 7th International Conference on Simulation, Modeling and Optimization, September 15–17 Beijing, China, pp 405–408 Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Discovering frequent closed itemsets for association rules In: Proceedings of the International Conference on Database Theory (ICDT’99), pp 398–416 Pei, J., Han, J., Mao, R., 2000 CLOSET: an efficient algorithm for mining frequent closed itemsets In: Proceedings of the ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD’00), pp 21–30 Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C., 2001 PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth In: Proceedings of the International Conference on Data Engineering, pp 215–224 Pham, T.T., Luo, J., Hong, T.P., Vo, B., 2012 MSGPs: a novel algorithm for mining sequential generator patterns In: Computational Collective Intelligence, Technologies and Applications, Lecture Notes in Computer Science, vol 7654, pp 393–401 Pham, T.T., Luo, J., Vo, B., 2013 An effective algorithm for mining closed sequential patterns and their minimal generators based on prefix trees Int J Intell Inf Database Syst (4), 324–339 Pham, T.T., Luo, J., Hong, T.P., Vo, B., 2014 An efficient method for mining nonredundant sequential rules using attributed prefix-trees Eng Appl.Artif Intell 32, 88–99 Srikant, R., Agrawal, R.,1996 Mining sequential patterns: Generalizations and performance improvements In: Proceedings of the International Conference on Extending Database Technology, pp 3–17 Song, W., Yang, B., Xu, Z., 2008 Index-BitTableFI: an improved algorithm for mining frequent itemsets Knowl.-Based Syst 21 (6), 507–513 Song, S., Hu, H., Jin, S., 2005 HVSM: a new sequential pattern mining algorithm using bitmap representation In: Proceedings of the Advanced Data Mining and Applications, pp 455–463 Vo, B., Hong, T.P., Le, B., 2012 DBV-Miner: a dynamic bit-vector approach for fast mining frequent itemsets Exp Syst Appl 39 (8), 7196–7206 Van, T.T., Vo, B., Le, B., 2014 IMSR_PreTree: an improved algorithm for mining sequential rules based on the prefix-tree Vietnam J Comput Sci (2), 97–105 Wang, J., Han, J., Pei, J., 2003 CLOSETỵ: searching for the best strategies for mining frequent closed itemsets In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’03), pp 236–245 Wang, J., Han, J., Li, C., 2007 Frequent closed sequence mining without candidate maintenance IEEE Trans Knowl Data Eng 19 (8), 1042–1056 Yan, X., Han, J., Afshar, R., 2003 CloSpan: mining closed sequential patterns in large datasets In: Proceedings of the SIAM International Conference on Data Mining, pp 166–177 Yang, Z., Kitsuregawa, M., 2005 LAPIN–SPAM: an improved algorithm for mining sequential pattern In: Proceedings of the ICDE Workshops 2005, p 1222 Zaki, M., 2001 SPADE: an efficient algorithm for mining frequent sequences Mach Learn 42 (1–2), 31–60 Zaki, M., Hsiao, C.,2002 CHARM: an efficient algorithm for closed itemset mining In: Proceedings of the SIAM International Conference on Data Mining (SDM’02), pp 457–473 ... transformed into a vertical data format called CloFS-DBVPattern, where each CloFSDBVPattern stores the position of frequent closed sequences which appear in the database; (2) frequent closed sequences. .. previous candidates to test the closure of sequences and removes them later The maintenance of candidates increases memory consumption, and the number of test candidates increases with the number of. .. dynamic bit vector (DBV) structure combined with location information in the structure of the transaction CloFSDBVPattern to mine frequent closed sequences 4.1 DBV data structure Sequence mining

Định dạng
Số trang	7
Dung lượng	1,04 MB