Appl Intell DOI 10.1007/s10489-016-0765-3 Mining non-redundant sequential rules with dynamic bit vectors and pruning techniques Minh-Thai Tran1 · Bac Le2 · Bay Vo3,4 · Tzung-Pei Hong5,6 © Springer Science+Business Media New York 2016 Abstract Most algorithms for mining sequential rules focus on generating all sequential rules These algorithms produce an enormous number of redundant rules, making mining inefficient in intelligent systems In order to solve this problem, the mining of non-redundant sequential rules was recently introduced Most algorithms for mining such rules depend on patterns obtained from existing frequent sequence mining algorithms Several steps are required to Bay Vo bayvodinh@gmail.com vodinhbay@tdt.edu.vn Minh-Thai Tran minhthai@huflit.edu.vn Bac Le lhbac@fit.hcmus.edu.vn Tzung-Pei Hong tphong@nuk.edu.tw Faculty of Information Technology, University of Foreign Languages - Information Technology, Ho Chi Minh, Vietnam Department of Computer Science, University of Science, VNU-HCM, Vietnam Division of Data Science, Ton Duc Thang University, Ho Chi Minh, Vietnam Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh, Vietnam Department of CSIE, National University of Kaohsiung, Kaohsiung, Taiwan, Republic of China Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan, Republic of China organize the data structure of these sequences before rules can be generated This process requires a great deal of time and memory The present study proposes a technique for mining non-redundant sequential rules directly from sequence databases The proposed method uses a dynamic bit vector data structure and adopts a prefix tree in the mining process In addition, some pruning techniques are used to remove unpromising candidates early in the mining process Experimental results show the efficiency of the algorithm in terms of runtime and memory usage Keywords Data mining · Dynamic bit vector · Non-redundant rule · Sequential rule Introduction The goal of sequential rule mining is to find the relationships between occurrences of sequential items in sequence databases A sequential rule is expressed in the form X → Y ; i.e., if X occurs in a sequence of a database then Y also occurs in that sequence following X with high confidence In general, the mining process is divided into two main phases: (1) mining of frequent sequences and (2) generation of sequential rules based on those sequences Since Agrawal and Srikant proposed the AprioriAll algorithm [1], the mining of frequent sequences has been widely studied The mining of frequent sequences is a necessary step before the generation of sequential rules, and thus researchers mainly focus on improving the efficiency of this step Several algorithms that use different strategies for the organization of data, data structure, and mining techniques have been proposed Algorithms for mining sequential rules include Full [15] and MSR PreTree [18] M.-T Tran With a large sequence database, the number of frequent sequences is very large, which affects the efficiency of mining sequential rules Some scholars have thus attempted to remove sequences that not affect the final mining results to make the sequences compact Some examples include the mining of frequent closed sequences and the mining of nonredundant sequential rules Typical algorithms for mining frequent closed sequences are CloSpan [21], BIDE [20] and CloGen [11] Efficient algorithms for mining non-redundant sequential rules include CNR [8] and MNSR PreTree [12] However, these algorithms generate sequential rules based on the results of existing frequent sequence mining algorithms Thus, they depend entirely on the data structure of the mined frequent sequences Some algorithms build a prefix tree of frequent sequences before generating sequential rules The present study proposes the algorithm NRD-DBV for mining non-redundant sequential rules based on dynamic bit vectors and pruning techniques, which adopts a prefix tree and uses a dynamic bit vector structure to compress the data The algorithm uses a depth-first search order with pruning prefixes in order to traverse the search space efficiently The pruning techniques are adopted to reduce the required storage and execution time for mining non-redundant sequential rules directly from sequence databases The rest of this paper is organized as follows Section defines the problem Section summarizes some related work Section presents the proposed algorithm Section shows the experimental results Conclusions and suggestions for future work are given in Section Problem definitions Consider a sequence database with a set I of distinct events where I = {i1 , i2 , i3 , · · · , in }, ij is an event (or an item), and ≤ j ≤ n A set of unordered events is called an itemset Each itemset is represented in brackets For example (ABC) represents an itemset with three items, namely A, B, and C The brackets are omitted to simplify the notation for itemsets with only a single item For example, the notation B is used to represent an itemset with only item B A sequence S = {e1 , e2 , e3 , · · · , em } is an ordered list of events, where ej is an itemset and ≤ j ≤ m The size of a sequence is the number m of itemsets in the sequence The length of a sequence is the number of items in the sequence A sequence with length k is called a k-sequence Definition (Subsequence and supersequence) Let Sa = a1 , a2 , · · · , am and Sb = b1 , b2 , · · · , bn be two sequences The sequence Sa is a subsequence of Sb if there are m integers i1 to im and ≤ i1 < i2 < · · · < im ≤ n such that a1 = bi1 , a2 = bi2 , · · · , am = bim In this case, Sb is also called a supersequence of Sa , denoted as Sa ⊆ Sb Definition (Sequence database) A sequence database D is composed of sequences and is denoted as D = {s1 , s2 , s3 , · · · , s|D| }, where |D| is the number of sequences in D and si (1 ≤ i ≤ |D|) is the i−th sequence in D For example, the database D in Table includes five sequences, i.e., |D| = Definition (Support of a sequence) The support of a sequence Sa in a sequence database D is calculated as the number of sequences with at least one occurrence of Sa in D divided by |D| and is denoted as sup(Sa ) A sequence Sa with a support sup(Sa ) will be shown in the form Sa : sup(Sa ) to simplify the notation For example, in Table 1, the sequence (AC) appears in three sequences; thus the support of (AC) is 60 %, denoted (AC): 60 % Definition (Frequent sequence) Given a minimum support threshold minSup, a sequence Sa is called a frequent sequence in D if sup(Sa ) >= minSup The problem of mining frequent sequences is to find a complete set of frequent subsequences for an input sequence database D and a given minimum support threshold minSup Definition (Frequent closed sequence) Let Sa and Sb be two frequent sequences Sa is called a frequent closed sequence if there is no Sb such that Sa ⊆ Sb ∧ sup(Sa ) = sup(Sb ) Different from the problem of mining frequent sequences, the problem of mining frequent closed sequences is to find a complete set of frequent closed sequences for an input sequence database D and a given minimum support threshold minSup Frequent closed sequences are more compact than general frequent sequences because subsequence Sa , which has the same support as that of supersequence Sb , is absorbed by Sb without affecting the mining results For example, in Table 1, sequence A(BC) is absorbed by sequence A(BC)C because A(BC) ⊆ A(BC)C and sup(S A(BC) ) = sup( A(BC)C ) = 40 % Table Example sequence database ID Sequence AA(AC) A(AC) A(BC)C AB(BC)(AC) AB(AB) Mining non-redundant sequential rules with dynamic bit vectors and pruning techniques Definition (Substring of a sequence) Let S be a sequence A substring of S, denoted as subi,j (S)(i ≤ j ), is defined as the segment from position i to position j of S Its length is (j − i + 1) For example, sub1,2 ( AA(AC) ) is AA and sub4,4 ( AA(AC) ) is C Definition (Concatenation) Let Sa and Sb be two sequences A sequence Sa + Sb denotes the concatenation of Sa and Sb by appending Sb after Sa For example, AB + AC = ABAC Definition (Sequential rule) A sequential rule r is denoted by pre → post (sup, conf), where pre and post are frequent sequences, and sup and conf are the support and confidence values of r respectively, wheresup = sup(pre + post) and conf = sup(pre + post) / sup(pre) Definition (Frequent sequential rule and strong sequential rule) Given a minimum support threshold minSup and a minimum confidence threshold minConf, a rule whose support value is higher than or equal to minSup is considered a frequent sequential rule, and a rule whose confidence value is higher than or equal to minConf is a strong sequential rule For each frequent sequence f of size k, (k − 1) rules are possibly generated For example, if there is a frequent sequence A(BC)C , then two possible rules are A → (BC)C and A(BC) → C Definition 10 (Rule inference and redundant rule) Let D be a sequence database, Si be the i-th sequence in D (1 ≤ i ≤ |D|), and r1 and r2 be two sequential rules r1 infers r2 if and only if both of the following two situations hold: (1) ∀Si ∈ D, ≤ i ≤ |D|, r1 pre + r1 post ⊆ Si ∧ r2 pre + r2 post ⊆ Si and (2) sup(r1 ) = sup(r2 ) ∧ conf(r1 ) = conf(r2 ) A sequential rule is said to be redundant if it can be inferred by another rule For example, assume that the two rules r1 : A → (BC)C and r2 : A → C have the same support and confidence values r2 is thus redundant since it can be inferred by r1 Definition 11 (Prefixed generator) A frequent sequence P is considered to be a prefixed generator if there is no other P such that P ⊆ P ∧ sup(P ) = sup(P) Definition 12 (Non-redundant rule) Based on Definitions and 10, a rule r: pre → post is said to be nonredundant if pre + post ∈ frequent closed sequence and pre ∈ prefixed generator Given two minimum thresholds minSup and minConf, the goal of this study is to find the non-redundant sequential rules from a sequence database Related work In order to mine sequential rules, frequent sequences need to be mined before sequential rules can be generated The mining of frequent sequences was first proposed by Agrawal and Srikant with their AprioriAll algorithm [1], based on the downward closure property Agrawal and Srikant expanded this mining problem in a general way with the GSP algorithm [16] Then, several frequent sequence mining algorithms have been proposed to improve mining efficiency These algorithms use various approaches for organizing data and storing mined information Most of them transform an original database into a vertical format or use the database projection technique to reduce the search space and thus execution time Typical algorithms include SPADE [23], PrefixSpan [9], SPAM [2], LAPIN-SPAM [22], and CMAP [5] The mining of frequent closed sequences has also been studied Its runtime and required storage are relatively low due to its compact representation The original information of frequent sequences can be entirely retrieved from frequent closed sequences Frequent closed sequence mining or frequent closed itemset mining algorithms include CloSpan [21], BIDE [20], ClaSP [6], and CloFS-DBV [17] These algorithms prune non-closed sequences using techniques such as the common prefix and backward subpattern methods BIDE differs from the other algorithms in that it does not keep track of previously obtained frequent closed sequences for checking the downward closure of new patterns Instead, it uses bi-directional extension techniques to examine candidate frequent closed sequences before extending a sequence Moreover, the algorithm uses a back scan process to determine candidates that cannot be extended to reduce mining time The algorithm also uses the pseudo-projection technique to reduce database storage space which is efficient for low support thresholds In addition to frequent sequence mining algorithms, many researchers have proposed sequential and nonredundant sequential rule mining algorithms The latter has the advantage of complete and compact rules because only non-redundant rules are derived, reducing runtime and memory usage For example, Spiliopoulou [15] proposed a method for generating a complete set of sequential rules from frequent sequences that removes redundant rules in a post-mining phase Lo et al [8] proposed the compressed non-redundant algorithm (CNR) for mining a compressed set of non-redundant sequential rules generated from two types of sequence set: LS-Closed and CS-Closed The premise of a rule is a sequence in the LS-Closed set and the consequence is a sequence in the CS-Closed set The generation of sequential rules is based on a prefix tree to increase efficiency Some other typical algorithms include CloGen [11], MNSR PreTree [12] and IMSR PreTree [19] M.-T Tran Most sequential or non-redundant sequential rule mining algorithms, however, use a set of frequent sequences or frequent closed sequences mined using existing frequent sequence miners A lot of time is required to transform or construct frequent sequences for generating sequential rules An efficient method is proposed here to mine nonredundant sequential rules It uses a compressed data structure in a vertical data format and some pruning techniques to mine frequent closed sequences and generate non-redundant sequential rules directly Proposed algorithm This section describes the proposed NRD-DBV algorithm, which uses the DBVPattern to mine frequent closed sequences A prefix tree is used for storing all frequent closed sequences Based on the property of a prefix tree, non-redundant sequential rules can be generated efficiently Before a discussion of the proposed approach, the DBVPattern data structure will first be described Sequence mining algorithms based on a vertical data format have proven to be more efficient than those based on a horizontal data format Typical algorithms that use a vertical data format include SPADE [23], DISC-all [3], HVSM [13], and MSGPs [10] These algorithms scan the database only once and quickly calculate the support of the sequence However, a disadvantage is that they require a great deal of memory to store additional information BitTableFI [4] and Index-BitTableFI [14] have solved this problem by compressing data by using a bit table (BitTable) The main drawback of the bit vector structure is a fixed size, which depends on the number of transactions in a sequence database ‘1’ indicates that a given item appears in the transaction and ‘0’ indicates otherwise In practice, there are usually many ‘0’ bits in a bit vector, i.e., items in a sequence database often randomly appear In addition, during the extension of sequences (using bitwise AND) ‘0’ bits will appear more often, which increases the required memory and processing time In order to overcome this problem, a dynamic bit vector (DBV) architecture is used here (Tran et al [17]; Le et al [7]) Let A and B be two bit vectors p1 and p2 are the probabilities of ‘1’ bits in bit vectors A and B, respectively Let k be the probability of ‘0’ bits after joining A and B to get AB by extending the sequence Therefore, the probability of ‘1’ bits in bit vector AB is min(p1 , p2 ) − k, where min(p1 , p2 ) Table Conversion of bit vector in Table to DBV index = 0 0 0 0 1 1 0 0 0 1 1 0 DBV = {7, 1 1 1} is the minimum value of p1 and p2 The probability of ‘1’ in AB decreases as that of ‘0’ increases Moreover, the gap between p1 and p2 quickly increases after several sequence extensions For example, suppose that there are 16 transactions in a sequence database An item i exists in transactions 7, 9, 10, 11, and 13 The bit vector for item i needs 16 bytes, as shown in Table The first non-zero byte appears at index The DBV only stores the starting index and the sequence of bytes starting from the first non-zero byte until the last non-zero byte, as shown in Table Only bytes are required to store the information using the DBV structure To efficiently find frequent closed sequences, the DBVPattern data structure is used in the proposed algorithm It uses a DBV structure combined with the location information of sequences The DBV structure is used to store sequences in a vertical format, so the support of a sequence pattern can easily be calculated by counting the number of ‘1’ bits Each DBV consists of two parts: (1) start bit: the position of the first appearance of ‘1’, and (2) bit vector: the sequence of bits starting from the first non-zero bit until the last non-zero bit Table shows the conversion of the example database D in Table into the DBV format Take the item B as an example It appears in sequences 3, and Thus the bit vector for B is (0, 0, 1, 1, 1) Since its leading ‘1’ bit is at the third position, the DBV representation is (3, 111) Note that if a bit vector is (0, 1, 0, 1, 0), then its DBV is (2, 101) Since a pattern may appear multiple times in a sequence, its starting position and all appearance positions are stored in the form of startPos: {list positions} For example, the item B in sequence first appears in the second position and then appears in the third position The list of positions for B in the sequence is thus 2: {2, 3}, where the first Table Conversion of database D in Table into DBV format Item Table Example of bit vector for 16 transactions of item i ID Bit vector A 1, 2, 3, 4, 1 1 B 3, 4, C 1, 2, 3, Start bit Bit vector 11111 00111 111 11110 1111 Conversion to DBV Mining non-redundant sequential rules with dynamic bit vectors and pruning techniques Table DBVPattern for item B in Table Item Start bit Index List of positions B 2: {2} Table NRD-DBV algorithm: mining non-redundant sequential rules Algorithm: NRD-DBV (D, minSup, minConf ) 2: {2, 3} 2: {2, 3} Input: Sequence database D with item set I, minSup, and minConf Output: Set of non-redundant sequential rules nr-SeqRule root = root node with value {NULL}; nr-SeqRule = ; fcs={Convert pattern i to DBVPattern | i I in D and sup(i) ≥ minSup}; represents the first appearance position Table shows the DBVPattern of item B in the example The index field represents the corresponding sequence with bit ‘1’ in Table The NRD-DBV algorithm consists of five main steps: (1) conversion of a sequence database to the DBVPattern structure; (2) early pruning of prefix sequences; (3) examination of the downward closure of frequent sequences; (4) sequence extension; and (5) generation of non-redundant sequential rules The proposed algorithm uses several kinds of sequence extension: 1-sequence extension: Assume that α and β are two frequent 1-sequences represented in the DBVPattern form Let {DBVα , pα } and {DBVβ , pβ } be the DBVs and the list of positions for α and β, respectively A bit AND operator on two DBVs with the same indices (data sequences) is defined as DBV αβ = DBV α ∧DBV β There are two forms of 1-sequence extension: (a) Itemset extension: α +i β = (αβ) {DBV αβ , pβ }, if (α 1) represented in the DBVPattern form Let u = subk,k (α), v = subk,k (β), and {DBVα , pα } and {DBVβ , pβ } represent the DBVs and Add fcs as the child nodes of root; For (each child node c of root) Call ClosedPattern-Extension (c, minSup); For (each child node c of root) Call Generate-NRRule (c, minConf, nr-SeqRule); the list of positions for α and β, respectively There are two forms of sequence extension: (a) Itemset extension: α+i,k β = sub1,k−1 (α)(uv) {DBVαβ , pβ }, if (u < v)∧pα = pβ ) ∧ (sub1,k−1 (α) = sub1,k−1 (β)), and (b) Sequence extension: α+s,k β = αv {DBVαβ , pβ }, if (pα < pβ )∧(sub1,k−1 (α) = sub1,k−1 (β)) Backward extension and forward extension: Let S be a sequence, S = e1 e2 · · · en An item e can be added to a sequence S in one of three positions: (a) S = e1 e2 · · · en e ∧ (sup(S ) = sup(S)), (b) ∃i(1 ≤ i < n) such that S = e1 e2 · · · ei e · · · en ∧ (sup(S ) = sup(S)), and (c) S = e e1 e2 · · · en ∧ (sup(S ) = sup(S)) Case (a) is called a forward extension and cases (b) and (c) are called backward extensions Fig Frequent closed sequences for database D in Table Nodes with dashed border correspond to pruning prefixes and shaded node corresponds to not frequent closed sequence, which is removed {} A: 100% AA: 80% A(AC): 60% ABA: 40% AB: 60% ABB: 40% ABC: 40% B: 60% C: 80% AC: 80% A(BC): 40% A(BC)C: 40% (AC): 60% ACC: 40% M.-T Tran In order to prune and check candidates early, the proposed approach uses the following three operations Checking a sequence closure: If there is a sequence Sb that is a forward extension or a backward extension of sequence Sa , then sequence Sa is not closed and can be safely absorbed by Sb For example, suppose that Sa = A(BC) : 40 % and Sb = A(BC)C : 40 %, where the number represents the support value According to the cases above, Sb is a backward extension of Sa , so A(BC) : 40 % will be absorbed by A(BC)C : 40 % because A(BC) ⊆ A(BC)C and sup( A(BC) ) = sup( A(BC)C ) = 40 % Pruning a prefix: Consider a prefix Sp = e1 e2 · · · en If there is an item e before the starting position of Sp in each of the data sequences containing Sp in a sequence database D, the extension can be pruned by prefix Sp Based on the starting position (startPos) of each sequence in the DBVPattern, the proposed algorithm can check it quickly by comparing two start positions of two sequences For example, consider the database D in Table There is no need to extend prefix B because there is a pattern A (startPos = 1) that occurs before B (startPos = 2) in each data sequence that contains prefix B If prefix Bis extended, the results obtained will be absorbed since the extension of prefix A already contains B and has the same support Figure illustrates frequent closed sequence mining using the two operations above for the example database in Table Stopping generation of sequential rules for a subtree of a prefix: Consider three nodes n, n1 , and n2 , where n1 is a child node of n, and n2 is a child node of n1 Table ClosedPattern-Extension method: mining frequent closed sequences Method: ClosedPattern-Extension (root, minSup) Input: Prefix tree root and minSup Table Generate-NRRule method: generating non-redundant rules from prefix tree Method: Generate-NRRule (root, minConf, nr-SeqRule) Input: Prefix tree root and minConf Output: Set of non-redundant sequential rules nr-SeqRule 22 pre = sequence of root; 23 subNode = child nodes of root; 24 For (each node Sr in subNode) 25 If (Sr is a prefixed generator) then For (each node Sn in the subtree with root Sr) 26 27 r = pre If (( 28 29 post, where pre+post = a sequence of Sn sup ( ) sup (Sequence of pre ) ) ≥ minConf) then nr-SeqRule = nr-SeqRule r; Else Stop generating rules for child nodes of Sn; 30 31 End For 32 End If 33 Call Generate-NRRule (Sr, minConf, nr-SeqRule); 34 End For Since sup(n2 ) < sup(n1 ), if sup(n1 ) sup(n) < minConf then sup(n2 ) sup(n) < minConf Thus, if the confidence of the rule r = pre + post is less than minConf, then we can safely stop generating the rules for all child nodes of post For example, suppose that minConf = 65 % in Fig 1; then, there is no need to generate rules for nodes ABA, ABB, and ABC (child nodes of AB) because the confidence of rule A → B is 60 % (less than minConf ) Table shows the pseudo-code of the proposed NRDDBV algorithm, which is based on the above principles Table The relation between a number of nodes (n) and an average number of child nodes (k) in a prefix tree Database minSup (%) Number of nodes (n) Average number of child nodes (k) C6T5S4I4N1kD1k 0.6 0.7 0.8 0.9 0.6 0.7 0.8 0.9 0.06 0.07 0.08 0.09 5278 3401 2320 1720 14541 10845 8488 6818 87273 20447 9391 5706 11 10 5 4 171 42 15 Output: Set of frequent closed sequences in prefix tree root listNode = child nodes of root; 10 For (each Sp in listNode) 11 12 13 If (Sp is not pruned) then For (each Sa in listNode) If (sup(Spa = Sequence-extension(Sp, Sa)) ≥ minSup) then 14 15 Add Spa as a child node of Sp; Add Spa as a child node of Sp; 16 17 18 C6T5S4I4N1kD10k If (sup(Spa = Itemset-extension(Sp, Sa)) ≥ minSup) then End For Call ClosedPattern-Extension (Sp, minSup); 19 End If 20 Check and set the attribute of Sp: closed, prefixed generator or NULL; 21 End For Gazelle Mining non-redundant sequential rules with dynamic bit vectors and pruning techniques Table 10 Definitions of parameters for generating databases using IBM synthetic data generator C T S I N D Average number of itemsets per sequence Average number of items per itemset Average number of itemsets in maximal sequences Average number of items in maximal sequences Number of distinct items Number of sequences The algorithm first scans the given sequence database D to find frequent 1-sequences and stores them in fcs as DBVPattern (line 3) Then, the 1-sequences in fcs are added to a prefix tree with the root of the tree set to NULL (line 4) On line 6, the algorithm performs the sequence extension for each child node of the root in the tree by calling the ClosedPattern-Extension method in Table After finding all frequent closed sequences, the algorithm begins generating all significant sequential rules by calling the Generate-NRRule method in Table In Table 7, the ClosedPattern-Extension method is used to extend sequences in a given group prefix The (a) C6T5S4I4N1kD1k (b) C6T5S4I4N1kD10k Fig Comparison of runtime for a C6T5S4I4N1kD1k and b C6T5S4I4N1kD10k with various minConf values (minSup = 0.5 %) method executes line 18 recursively until no frequent closed sequences are generated The sequence extension is performed in two forms: sequence extension (line 13) and itemset extension (line 15) Before the sequence extension, the algorithm tests and eliminates prefixes that cannot be used to extend frequent closed sequences using the second extension judgment on line 11 If the sequence results obtained are frequent, they are stored as the child nodes of the prefix The prefix Sp will be checked and marked as a frequent closed sequence or a prefixed generator by using the first extension judgment and Definition 11 (line 20) Otherwise, it is set to NULL After finding all frequent closed sequences, the algorithm begins generating all significant sequential rules by calling a Generate-NRRule method in Table For a prefixed generator in a node of a given prefix tree, the algorithm generates all rules within a subtree with the node being the prefix (line 25) In this process, the third sequence extension judgment is used to stop generating rules for child nodes that not meet the minConf value (line 30) The method is executed recursively for all nodes in the prefix tree (line 33) (a) C6T5S4I4N1kD1k (b) C6T5S4I4N1kD10k Fig Comparison of runtime for a C6T5S4I4N1kD1k and b C6T5S4I4N1kD10k with various minSup values (minConf = 50 %) M.-T Tran Suppose n be a number of nodes in a prefix tree (a complete set of frequent closed sequences) Let k be an average number of child nodes in the prefix tree For each node that is a prefixed generator, the generating sequential rule process will be done on its child nodes once Thus, based on the prefix tree structure, the generating sequential rule algorithm will be performed n × k times However, if we not enumerate the set of frequent sequences on the prefix tree, for each sequence, we have to perform (n − 1) operations for checking and generating sequential rules So, the complexity of generating rules will be O(n × k) Since k