An efficient method for mining frequent

8 14 0
An efficient method for mining frequent

Đang tải... (xem toàn văn)

Thông tin tài liệu

This article appeared in a journal published by Elsevier The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited In most cases authors are permitted to post their version of the article (e.g in Word or Tex form) to their personal website or institutional repository Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/authorsrights Author's personal copy Engineering Applications of Artificial Intelligence 27 (2014) 148–154 Contents lists available at ScienceDirect Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai An efficient method for mining frequent itemsets with double constraints Hai Duong a, Tin Truong a, Bay Vo b,n a b Department of Mathematics and Computer Science, University of Dalat, Dalat, Vietnam Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh, Vietnam art ic l e i nf o a b s t r a c t Article history: Received 23 March 2013 Received in revised form 11 July 2013 Accepted September 2013 Available online 18 October 2013 Constraint-based frequent itemset mining is necessary when the needs and interests of users are the top priority In this task, two opposite types of constraint are studied, namely anti-monotone and monotone constraints Previous approaches have mainly mined frequent itemsets that satisfy one of these two types of constraint Mining frequent itemsets that satisfy both types is of interest The present study considers the problem of mining frequent itemsets with the following two conditions: they include a set C0 (monotone) and contain no items of set C01 (anti-monotone), where the intersection of C0 and C01 is empty and they are changed regularly A unique representation of frequent itemsets restricted on C0 and C01 using closed itemsets and their generators is proposed Then, an algorithm called MFS_DoubleCons is developed to quickly and distinctly generate all frequent itemsets that satisfy the constraints from the lattice of closed itemsets and generators instead of mining them directly from the database The theoretical results are proven to be reliable Extensive experiments on a broad range of synthetic and real databases that compare MFS_DoubleCons to dEclat-DC (a modified version of dEclat utilized to mine frequent itemsets with constraints) show the effectiveness of our approach & 2013 Elsevier Ltd All rights reserved Keywords: Frequent itemsets Closed frequent itemsets Closed itemset lattice Generators Constraint mining Introduction Frequent itemsets play an important role in many data mining tasks such as the mining of association rules and classification Therefore, a lot of frequent itemset mining algorithms, such as Apriori (Agrawal and Srikant, 1994), Eclat (Zaki et al., 1997), FPGrowth (Han et al., 2000), FP-Growthn (Grahne and Zhu, 2005), BitTable-FI (Dong and Han, 2007) and Index-BitTableFI (Song et al., 2008), have been proposed Some condensed representations of frequent itemsets, such as frequent closed itemsets and maximal frequent itemsets, have been proposed (Burdick et al., 2001; Lin and Kedem, 2002; Pasquier et al., 1999a, 1999b, 2005; Pei et al., 2000; Vo et al., 2013; Vo et al., 2012; Wang et al., 2003; Zaki and Hsiao, 2005) Although the set of all frequent itemsets is quite large, users only care about a small number that satisfy some given constraints A model of constraint-based mining has thus been developed (Bayardo et al., 2000; Nguyen et al., 1998; Srikant et al., 1997) Constraints help to focus on interesting knowledge and to reduce the number of patterns extracted to those of potential interest In addition, they are used for decreasing the n Corresponding author Tel.: ỵ 84 083974186 E-mail addresses: haidv@dlu.edu.vn (H Duong), tintc@dlu.edu.vn (T Truong), vdbay@it.tdt.edu.vn (B Vo) 0952-1976/$ - see front matter & 2013 Elsevier Ltd All rights reserved http://dx.doi.org/10.1016/j.engappai.2013.09.006 search space and enhancing the mining efficiency Two important types of constraint have been studied, namely anti-monotone constraints (Nguyen et al., 1998), denoted as Cam , and monotone constraints (Pei et al., 2001), denoted as Cm An itemset satisfies a constraint Cam (or Cm ) if its arbitrary subset (or superset) also satisfies the constraint Cam is simple and suitable with Apriori-like algorithms, so it is often integrated into them to prune candidates Cm is more complicated to exploit and less effective for pruning the search space Most previous approaches mine frequent itemsets with either Cam or Cm Mining frequent itemsets that satisfy both Cam and Cm is of interest This can be accomplished by first mining frequent itemsets that satisfy Cam using an algorithm such as Apriori (Agrawal and Srikant, 1994; Mannila and Toivonen, 1997), Eclat (Zaki and Hsiao, 2005), and FP-growth (Pei and Han, 2002), and then filtering the ones matching Cm in a post-processing step This approach is inefficient because it often has to test a large number of itemsets A more complicated solution is to integrate both Cam and Cm into the algorithm to find all frequent itemsets that satisfy them However, the authors in Jeudy and Boulicaut (2002) showed that the integration of Cm can lead to a reduction in the pruning of Cam Therefore, there is a tradeoff between Cam and Cm pruning The present study considers problems that include many Cm and Cam constraints Such a problem is stated as follows Let A be a set of attributes or items and T be a database of transactions, Author's personal copy H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 where each transaction contains a set of items A set X DA is called an itemset The support of an itemset X, denoted as supp(X), is the ratio of the number of transactions containing X and N, the number of transactions in T Given two threshold values, s0 and s1, and two subsets, C0 and C1, such that 0o s0 os1 r 1, C0 D C1 DA, the task is to find the class FSC0 D C1 ðs0 ; s1 Þ of all the frequent itemsets that (1) include a subset C0 and (2) contain no items of a subset C01 DA Criterion (2) implies that the itemsets are contained in C1 ¼A\C′1 (the complement of C′1 from A) Formally, the goal is to discover all elements L′ of FSC0 D C1 ðs0 ; s1 Þ that match the following criteria, called double-constraint (1) the support of L′ is in [s0, s1]: s0 rsupp (L′) (Cam ) and supp (L′) rs1 (Cm ); (2) L′ includes C0 (Cm ) and L′ is contained in C1 (Cam ): C0 DL′DC1 1.1 Contributions The contributions of this paper are as follows First, the mining of frequent itemsets whose supports are not too high based on the s1 value is proposed Such itemsets can be valuable For instance, they can help to discover association rules which are a little frequent but unique, or to find abnormal rules with high confidence values from low frequent itemsets Second, an efficient method for discovering frequent itemsets that satisfy both Cam and Cm constraints is proposed Third, a structure and unique representation of frequent itemsets with double-constraint are proposed Based on these, the MFS_DoubleCons algorithm is developed to quickly and distinctly mine all constraint-based frequent itemsets Moreover, the algorithm accesses the database only once, even if the Cam and Cm constraints are changed regularly This considerably enhances mining performance The rest of this paper is organized as follows Some studies related to constraint-based frequent itemset and association rule mining are reviewed in Section Section presents some basic concepts in frequent itemset mining and notations In Section 4, a unique representation of frequent itemsets with double-constraint and a procedure for quickly determining all closed frequent itemsets and their generators are described An efficient algorithm for finding all frequent itemsets with double-constraint is also proposed Experimental results are discussed in Section Finally, conclusion and future work are presented in Section Related work Constraint mining was first defined by Nguyen et al in 1998 (Nguyen et al., 1998) They proposed an Apriori-like algorithm called CAP to reduce the frequent itemsets computation In Pei and Han (2002), Pei et al proposed the concept of convertible constraints and integrated them into the mining process of the FPgrowth algorithm Srikant et al (1997) considered the problem of integrating constraints that are Boolean expressions in the presence or absence of items in the association rules Bayardo et al (2000) restricted the problem of mining association rules in two constraints that are the consequent and the minimum improvement In Bucila et al (2003), Bucila et al integrated both Cam and Cm into the algorithm DualMiner Unfortunately, it has to scan the database many times and perform a large number of useless tests on long itemsets, especially when the minimum support is low An Apriori-like algorithm called ExAMiner (Bonchi et al., 2003) uses both of these constraints to reduce the input data and the search space A shortcoming of these algorithms is that they need to be rerun whenever the constraints are changed This makes a reduction on mining speed of users A solution is to mine and save only 149 once all frequent itemsets with small values of minimum support Then, when the constraints are changed, the system extracts frequent itemsets that satisfy them The computational and memory costs for this second step are very high because the number of frequent itemsets generated from the first step is usually enormous Recently, many authors have focused on combining the advantages of constrained mining and a condensed representation of frequent itemsets Instead of mining all frequent itemsets, only the condensed ones are extracted Using condensed frequent itemsets has three primary advantages First, it is easier to store because the number of them is much smaller than the size of the class of all frequent itemsets, especially for dense databases Second, they are mined only once from database even when the constraints are changed Third, they can be used to generate all frequent itemsets without access to the database There are two types of condensed representation The first type is maximal frequent itemsets (Burdick et al., 2001; Lin and Kedem, 2002) Since their cardinality is very small, they can be discovered quickly All frequent itemsets can be generated from the maximal frequent itemsets However, the generation often produces duplications In addition, the generated frequent itemsets can lose information about their supports Therefore, the supports need to be recomputed when mining association rules The second type is closed frequent itemsets, called maximal ones, and their generators, called minimal ones (Bonchi and Lucchese, 2004; Boulicaut and Bykowski, 2000; Boulicaut et al., 2003; Pasquier et al., 2005) Each closed frequent itemset represents a class of all frequent itemsets that have the same closure Thus, it (with its generators) can be used to uniquely determine all frequent itemsets in the class without losing their supports Some concepts and notations For a database T , let O be a non-empty set containing transactions, A be a set of items appearing in those transactions, and ℛ be a binary relation on O Â A A set of items is called an itemset Consider two operators: λ: 2O -2A and ρ: 2A -2O defined as follows (λ(∅):¼A and ρ(∅): ¼O): ODO, A DA, λ(O) ¼{a A A| (o, a) A ℛ, oA O}, ρ(A) ¼{o A O|(o, a) A ℛ, a AA} A closed operator h in 2A (Birkhoff, 1948; Wille, 1992) is defined as h¼ λ o ρ Denote h(A) as the closure of subset A DA A is called a closed itemset iff h(A) ¼A Let CS be the class of all closed itemsets The minimum and maximum support thresholds are denoted as s0 and s1, respectively, where 1/|O| rs0 rs1 r1 Only consider non-trivial items in subset AF : ¼{a A A: supp({a}) Zs0} of A are considered A non-empty itemset A (subset of AF) is called frequent iff s0 rsupp(A)rs1 It is noted that if s1 is equal to 1, then the traditional frequent itemset concept is obtained Briefly, ℱS: ¼{L′ DA: s0 rsupp(L′) rs1} denotes the class of all frequent itemsets and ℱCS:¼ℱS \ CS denotes the class of all closed frequent itemsets For any two sets G, A: ∅ aG DA DA, G is called a generator (Mannila and Toivonen, 1997) of A if h(G) ¼h(A) and (8 ∅ aG′ CG ) h(G′)Ch(G)) Let G(A) be the class of all generators of A and ℒGA be the lattice of all closed itemsets and their generators Given C0 and C1 as two constraint subsets such that ∅aC0 DC1 DA, FSC0 D C1 s0 ; s1 ị:ẳ{LD S: C0 DL′DC1} denotes the class of all frequent itemsets with double-constraint, FCS + C s0 ; s1 ị: ẳ{L A ℱCS: L +C0} denotes the class of all closed frequent itemsets which include C0, and FCSC D C s0 ; s1 ị: ẳ {LC1  L \ C1: LA FCS + C0 ðs0 ; s1 Þand ((Li A G(L): Li D C1)} denotes the class of all closed frequent itemsets with double-constraint Author's personal copy 150 H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 Trans Items c a ac e, g ae ag b f, h cf, ch d Fig (a) Example database and (b) corresponding lattice of closed itemsets The frequent itemsets are underlined and their generators are in italics Supports are shown on the right of the lattice Remark L′A ℱS, if C0 DL′DC1, then supp (C0)Zsupp (L′)Zs0 and supp (C1)r supp (L′) rs1 Thus, only C0 and C1 where supp (C0)Zs0 and supp (C1)rs1 are considered Mining frequent itemsets with double-constraint Proposition (The disjoint partition of the class of all frequent itemsets with double-constraint) C0 DC1 DA: FSC0 D C1 ðs0 ; s1 Þ ¼ ∑ FSC0 D LC1 LC1 A FCSC0 D C1 ðs0 ; s1 Þ Proof Definition (Tin and Anh, 2010) (Equivalence relation $ h over ℱS) A, BAℱS:  The sets on the right-hand side are disjoint In fact, L′i A $ h B hðAÞ ¼ hðBÞ For each A A ℱS, [A]: ¼{BA ℱS: h(B)¼ h(A)} denotes the equivalence class of all frequent itemsets with the same closure If L A ℱCS, then [L]:¼{L′ DL: h(L′)¼L} Using this relation, ℱS is partitioned into disjoint equivalence classes Each class contains frequent itemsets with the same closure L A ℱCS and the same support The following theorem is derived:   Theorem (Tin and Anh, 2010) (A partition of S) FS ẳ ẵL L A FCS This partition allows us to independently mine frequent itemsets with double-constraint in each equivalence class Example The rest of this paper considers database T shown in Fig (a) For s0 ¼1/4 and s1 ¼ 3/4, Charm-L (Zaki and Hsiao, 2005) and MinimalGenerators (Zaki, 2004) are used to mine a lattice of all closed frequent itemsets and their generators The results are shown in Fig 1(b) Then, S ẳ [a]ỵ[c] ỵ[ceg] ỵ[ac] ỵ [afh] ỵ[aceg] ỵ[bceg] ỵ[acfh] þ[adfh].1 With this disjoint partition, the duplication in the mining process of frequent itemsets in different classes is greatly reduced However, Theorem in Anh et al (2011) showed that frequent itemsets generated in each class can be still duplicated 4.1 Partition of the class of all frequent itemsets with doubleconstraint Let FSC0 D LC1 :¼ {L′D LC1: C0 DL′, h(L′)¼h(LC1)} be the class of all frequent itemsets in [L] which include C0 and are contained in LC1, where LC1 AFCSC0 D C1 ðs0 ; s1 Þ Based on the idea of the above partition and Proposition in Hai et al (2013), the following proposition is derived The symbol ỵ denotes the union of two disjoint sets AFSC0 D LC1; i , where LC1, i:¼Li \ C1 AFCSC0 D C1 ðs0 ; s1 Þ, Li AFCS + C0 s0 ; s1 ị, iẳ 1, and L1 aL2 Then, h(L′1)¼h(LC1, 1)¼ L1 aL2 ¼ h(LC1, 2)¼ h(L′2) Thus, L′1 and L′2 are in different equivalence classes [L1] and [L2] Hence, L′1 aL′2 “D”: L′AFSC0 D C1 ðs0 ; s1 ị, assign that Lẳ h(L) and LC1 ẳ L\ C1, then C0 DL′Dh(L′)¼L and s0 rsupp(L′)¼ supp(L)rs1 Let Li A G(L′) (because G(L′)a∅ (Anh et al., 2012), then Li DL′DC1, C0 DLC1 and h(Li)¼h(L′)¼ L and Li AG(L′)DG(L) Thus, LA FCS + C0 ðs0 ; s1 Þ and LC1 AFCSC0 D C1 ðs0 ; s1 Þ Moreover, it is known that C0 DL′DLC1 DL and L¼h(L′)¼h(Li)Dh(LC1)DL Therefore, h (LC1)¼ h(L′) It is thus concluded that L′AFSC0 D LC1 “+”: LC1 AFCSC D C ðs0 ; s1 ị, LAFSC D LC1 , then h(L)ẳh(LC1), C0 DL′DLC1 DC1 and there exists Li AG(L): Li DC1 Then, Li DLC1 DL and L¼h(Li)Dh(LC1)Dh(L)¼L Thus, h(L′)¼ h(LC1)¼L and s0 rsupp(L′)¼supp(L)rs1 Hence, L′AFSC D C ðs0 ; s1 Þ □ Remark The disjoint partition of the class of all frequent itemsets allows parallel algorithms to be designed to independently exploit each class, significantly reducing mining time Example Set s0 ¼1/4, s1 ¼ 3/4, C0 ¼a, and C1 ¼adefg Then, consider L ¼aceg, G(L) ¼{L1 ¼ae, L2 ¼ ag} From Theorem in Anh et al (2011), XU ¼aeg, XU,1 ¼ g, XU,2 ¼ e, X_ ¼c Thus, [aceg] ¼{ae, aeg, aegc, aec, ag, agc} Similarly, [bceg] ¼{bc, be, bg, bce, bcg, beg, bceg}, [acfh] ¼{cf, acf, cfh, acfh, ch, ach}, [adfh] ¼{d, ad, df, dh, adf, adh, dfh, adfh}, [ceg] ¼{e, ce, eg, ceg, g, cg}, [ac] ¼{ac}, [afh] ¼ {f, af, fh, afh, h, ah}, [c] ¼{c}, and [a] ¼{a} Frequent itemsets L′ of all classes are tested by the condition C0 D L′D C1, yielding FSC D C ðs0 ; s1 Þ¼ {ae, aeg, ag, ad, adf, af, a} In Example 2, testing the condition C0 DL′DC1 is very expensive In the next section, a method for efficiently mining the frequent itemsets with double-constraint in each equivalence class is proposed 4.2 Extracting closed frequent itemsets and generators with doubleconstraint The elements of FCSC D C ðs0 ; s1 Þcan be determined as follows From the lattice ℒGA , the class of all closed frequent itemsets including C0 is first determined Second, for each element L of them, Author's personal copy H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 if there exists a generator of L contained in C1, then the closed frequent itemset that satisfies the double-constraint, called LC1, is the intersection of L and C1, i.e LC1 ¼ L\ C1 AFCSC D C ðs0 ; s1 Þ Otherwise, L is discarded In this way, the generators of LC1 are determined by finding all generators (of L) contained in C1 Similar to Proposition in Anh et al (2011), the following proposition is derived: Proposition Generate closed frequent itemsets and their generators with double-constraint a LC1 ¼L \ C1 A FCSC D C ðs0 ; s1 Þ: G(LC1) ¼{Li A G(L): Li DC1} b The elements of FCSC0 D C1 ðs0 ; s1 Þ are generated distinctly From this proposition, the procedure for generating all closed frequent itemsets with double-constraint and generators using ℒGA is shown in Fig Example With the values of C0, C1, s0, and s1 as those in Example 2, closed frequent itemsets with double-constraint and their generators are mined using MFCS_DoubleCons Table shows the process Remark At line of the MFCS_DoubleCons procedure, if L includes C0, then all its supersets also include C0 (the property of constraint Cm ) A similar property holds for the test supp(L) r s1 Thus, if a candidate passes the tests, its supersets not need to be tested, significantly reducing the execution time of the algorithm A similar remark can be made for constraint s0 (constraint Cam ) 151 UKi A Kmin; C0 ; C1 Ki , KU; C0 ;C1 ; i :¼KU; C ;C \Ki, K_ ; C0 ;C :ẳLC1\(KU; C0 ;C1 ỵ C0), and dene: FSnC D LC1 : ẳ {LẳC0 ỵKi ỵKi ỵ K|Ki A Kmin; C ;C , K′i DKU; C0 ;C1 ; i , K″ DK_ ; C0 ;C1 and (Kj g Ki ỵKi, Kj A K ; C ;C : rjoi)} (n) Remark If there exists Li A G(LC1) such that Ki ¼ Li\C0 ¼∅, then FSnC0 D LC1 ẳ{Lẳ C0 ỵ K, K DLC1\C0} In this way, the frequent itemsets with double-constraint are generated more quickly Indeed, if there exists Li A G(LC1) such that Ki ¼ Li\C0 ¼∅, which implies that Li DC0, then Kmin; C ;C ¼ ∅, K U; C0 ;C1 ¼ ∅, K U; C0 ;C1 ; i ¼∅ This implies that K_; C ;C ¼ LC1\C0 Thus, LẳC0 ỵK A FSnC D LC1 , where K″ DLC1\C0 Theorem Distinctly generate the elements of FSnC D For each LC1 A FCSC D C ðs0 ; s1 Þ: a) FSC D LC1 ¼FSnC D L C1 b) The elements of FSnC0 D LC1 are generated distinctly Proof: (a)  “D”: if L′AFSC , according to Theorem in Anh et al (2011), D LC1 LẳLi ỵLi ỵ L $ , where Li AG(LC1), LU ¼ [ Li A GðLC1 Þ Li , L′i DLU,i ¼ LU\Li, L $ DL_ ẳLC1\LU Thus, LẳC0 \ (Li ỵLi ỵL $ )ỵ(Li\C0)ỵ (Li\C0)ỵ (L $ \C0)ẳC0 ỵKi ỵ(Li\C0)ỵ (L $ \C0)ẳC0 ỵKi ỵKi þK″ where Ki ¼Li\ C0 AKmin; C ;C , K′i ¼ (L′i\C0)\ KU; C0 ;C1 DKU; C ;C \ (LU\Li)DKU; C ;C MFS_DoubleCons 4.3 Mining all frequent itemsets with double-constraint if if ( This section describes the unique representation and the structure of frequent itemsets with double-constraint For each LC1 AFCSC D C ðs0 ; s1 Þ, letK ; C ;C :¼Minimal{Ki ¼Li\C0|Li AG(LC1)} be the class of all the minimal itemsets of {Li\C0|Li AG(LC1)} in terms of the inclusion order “D” on sets Assign K U; C ;C :¼ or then return then remark or return MFCS_DoubleCons MFS_DoubleCons_OneClass For each return MFCS_DoubleCons Fig MFS_DoubleCons algorithm : for each if and if then and MFS_DoubleCons_OneClass // then (LC1) for each C1 if : then (LC1) break return if remark then For each Fig MFCS_DoubleCons procedure else Table Mined closed frequent itemsets and their generators with double constraint L G(L) L+ C0 acfh aceg adfh bceg afh ceg ac c a cf, ch ae, ag d b f, h e, g ac c a acfh aceg adfh ae, ag a aeg adf ae, ag d afh f af f a a a Li A G(L) and C1 + Li LC1 ¼ L \ C1 for for each G(LC1) test the condition (*) for if a then break ac LC1 if then for each return Fig MFS_DoubleCons_OneClass procedure Author's personal copy 152  H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 \Li DKU; C ;C \Ki ¼KU; C0 ;C1 ; i and Kẳ[(Li\C0)\KU; C0 ;C1 ỵ(L $ \C0)] D(LC1\(C0 ỵKU; C ;C ))ỵ(LC1\K U; C0 ;C1 )\C0 DLC1\(C0 ỵ KU; C ;C )ẳ K_; C ;C The lowest index i is always selected such that Ki ¼ Li\C0 AKmin; C ;C is the minimal set and L′ has the representation in the form: LẳC0 ỵKi ỵKi ỵK Moreover, if there exists index koi such that Kk AKmin; C ;C and Kk CKi ỵKi, then L also has a different representation: Lẳ C0 ỵKk ỵKk ỵ K, where Kk ẳ (Ki ỵKi)\Kk DKU; C0 ;C1 ; k This contradicts the selection of index i Thus, Kk gKi ỵKi, Kk AK min; C ;C : 1rkoi Hence, L′AFSnC D LC1 “+ ”: if L′ AFSnC D LC1 , there exists Ki ¼ Li\C0 A K ; C ;C , Li A G (LC1), K′i DK U; C ;C ; i , K″ D K _ ; C ;C : C0 D Lẳ C0 ỵ Ki ỵKi ỵK DLC1 Thus, h(L) Dh(LC1) On the other hand, since Li ẳ Ki ỵ(Li \ C0) DKi ỵC0 DL′, so h(LC1) ¼h(Li)D h(L′) It is inferred that h(L′)¼h(LC1) Hence, L′A FSC D LC1 (b) Assume that there exist i and k, where i4k Z1, such that Lk  Li, where Lk:ẳ C0 ỵKk ỵKk ỵKk, Li:ẳ C0 ỵ Ki ỵKi ỵKi, Kk, Ki A K min; C ;C , Kk aKi, K″k, K″i DK _; C ;C , K′k DK U; C ;C ; k , K′i DK U; C ;C ; i Since Kk \ C0 ¼ ∅ and Kk \ K″i ¼∅, so Kk CKi þK′i (the equality does not hold because Ki and Kk are two different minimal sets) It is opposite to the way that index i is selected Therefore, all elements of FSnC D LC1 are generated distinctly □ Example For s0 ¼ 1/4, s1 ¼3/4, C0 ¼ a, and C1 ¼ adefg, consider L¼aceg, G(L)¼{L1 ¼ae, L2 ¼ag} The following results are obtained: Table Database characteristics Database #Items #Records Avg length Connect (C) Mushroom (M) Pumsb (P) T10I4D100K (T10) T40I10D100K (T40) 129 119 7117 1000 1000 67,557 8124 49,046 100.000 100.000 43 23 74 10 40 C0 DL, LC1 ¼ L\ C1 ¼aeg, G(LC1)¼{L1 ¼ae, L2 ¼ ag}, K1 ¼L1\C0 ¼e, K2 ¼L2\C0 ¼g, Kmin ; C ;C ¼ Minimal{K1, K2}¼{K1, K2}¼ {e, g}, K U; C ;C ¼eg, K U; C ;C ; ¼g, K U; C ;C ; ¼ e, K _;C ;C ¼ With K1 ẳe, then aỵ e, aỵeỵg AFSC D aceg \ C ðs0 ; s1 Þ, and with K2 ẳ g, then aỵg AFSC D aceg \ C ðs0 ; s1 Þ Thus, FSC D aceg \ C s0 ; s1 ịẳ {ae, aeg, ag} Similarly, FSC D adf h \ C s0 ; s1 ị ẳ {ad, adf}, FSC D af h \ C ðs0 ; s1 Þ ¼ {af} and FSC D a \ C s0 ; s1 ị ẳ {a} Hence, FSC D C s0 ; s1 ịẳ {ae, aeg, ag, ad, adf, af, a} According to Theorem 2, the procedure MFS_DoubleCons_OneClass (pseudo code shown in Fig 4) is used for mining frequent itemsets with double-constraint in a class Using Proposition and this procedure, the algorithm MFS_DoubleCons is proposed, shown in Fig 3, for mining all frequent itemsets with double-constraint Experimental results Experiments were performed on a PC with an i5-2400 CPU, 3.10 GHz@ 3.09 GHz PC and 3.16 GB of memory, running Windows XP The algorithms were coded in C# To compare the performance, the source code for Charm-L, MinimalGenerators and dEclat (Anon, 2010) was converted to C# Charm-L and MinimalGenerators were used to mine the lattice of the closed itemsets and their generators dEclat was used to exploit all frequent itemsets Here, it was modified for mining frequent itemsets with doubleconstraint In its 1st step, only frequent itemsets contained in C1 are taken Then, the output is filtered to determine frequent itemsets that satisfy s1 and C0 This modified version is called dEclat-DC For the post-processing approach, Gen_Itemsets_DC is a modification of Gen_Itemsets (Anh et al., 2011) For the performance test, the following benchmark databases in FIMDR (2009) were chosen: Pumsb, Connect, Mushroom, T10I4D100K, and T40I10D100K Pumsb, Connect, and Mushroom are real and dense, i.e., they produce many long frequent itemsets Table Time reductions of MFS_DoubleCons compared to Gen_Itemsets_DC and dEclat-DC DB, MS R_MG (%) R_ME (%) RR (%) DB, MS R_MG (%) R_ME (%) RR (%) M,15 M,10 P,75 P,70 P,65 C,65 C,60 C,55 0.96 1.4 36.5 40.5 57.8 91.8 60.5 57.8 2.66 1.62 30.3 18.1 13.2 4.3 2.2 3.2 11.7 20.4 40.6 5.9 5.1 7.7 99.94 99.96 99.37 99.65 99.9 96.77 99.63 95.94 M,5 M,3 T10,0.09 T10,0.07 T10,0.04 T40,2 T40,0.9 T40,0.6 3.13 3.77 2.5 2.4 2.2 2.01 8.04 99.28 98.88 99.96 99.99 99.97 99.98 99.96 99.94 Fig Performance results for T10I4D100K Fig Performance results for Pumsb Fig Performance results for Mushroom Author's personal copy H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 even for very high support values The others are synthetic and sparse Table shows their characteristics The support threshold s1 is fixed at 0.95 Assuming that the size of C0 is m, C1 with a size of m ỵ dn|AF|/100 (d A [1,100]) is chosen With dense databases, for each pair of database (DB) and minimum support (MS), m ranges from 4% to 14% of |AF | (step 2%) and d ¼81 For sparse ones, m ranges from 0.2% to 2% of |AF | (step 0.2%) and d ¼ 93 For each pair of C0's size and C1's size, 10 doubleconstraints are selected for the dense ones and doubleconstraints are selected for the sparse ones (each of them contains items randomly selected from AF ) Let T_MD, T_GID, and T_DED be the average execution times of MFS_DoubleCons, Gen_Itemsets_DC, and dEclat-DC for 60 selected double-constraints Table shows the experimental evaluation of MFS_DoubleCons against Gen_Itemsets_DC and dEclat-DC, where column R_MG shows the ratios of T_MD and T_GID, and column R_ME shows the ratios of T_MD and T_DED NCIn is the number of frequent itemsets contained in C1 and NNC is the number of frequent 153 itemsets contained in C1 without including C0 Column RR is used to indicate the average number of the ratios of NNC and NCIn Compared to Gen_Itemsets_DC, MFS_DoubleCons is faster for dense databases The time is reduced by 30.3–0.96% For sparse databases, the reduction is lower because the number of frequent itemsets is small and their size is small, leading to a low cost for testing the constraints Compared to dEclat-DC, MFS_DoubleCons is much faster for all databases The time is reduced by 40.6–2.01% The reason for the reduction is that there is a large number of candidates (RR ranges from 95.94% to 99.99%) which fail the last test of dEclat-DC, leading to lower performance Figs 5–8 show the comparisons of the average execution times for various support values The performance and scalability of MFS_DoubleCons are superior to those of Gen_Itemsets_DC and dEclat-DC Figs 9–11 show the performance results for various numbers of double-constraints The performance gap between MFS_DoubleCons and dEclat-DC increases with the number of doubleconstraints The main reason is that, when the double-constraint changes, MFS_DoubleCons executes without creating the lattice of closed itemset and their generators again from the database MFS_DoubleCons outperforms both Gen_Itemsets_DC and dEclat-DC, especially when the minimum support is lower and the number of double-constraints is high Conclusion and future work Fig Performance results for T40I10D100K This paper presented a unique representation and structure of frequent itemsets with double-constraint The correctness of the theoretical results was proven An efficient algorithm was developed for mining all frequent itemsets with double-constraint Tests on the benchmark databases show the efficiency of the proposed approach Moreover, the tests show that the algorithm outperforms existing Fig Performance results for Mushroom for various numbers of double-constraints Fig 10 Performance results for real and dense databases for various numbers of double-constraints Author's personal copy 154 H Duong et al / Engineering Applications of Artificial Intelligence 27 (2014) 148–154 Fig 11 Performance results for synthetic and sparse databases for various numbers of double-constraints algorithms, especially when the minimum support values are very low and the number of double-constraints is high In the future, the mining of frequent itemsets with more complicated types of constraint will be studied and association rules based on these frequent itemsets will be derived Acknowledgments This work was funded by Dalat University and Vietnam’s National Foundation for Science and Technology Development (NAFOSTED) under Grant no 102.01-2012.17 References Agrawal, R., Srikant, R., 1994 Fast algorithms for mining association rules In: Proceedings of the 20th International Conference on Very Large Data Bases, pp 478–499 Anh, T., Hai, D., Tin, T., Bac, L., 2011 Efficient algorithms for mining frequent itemsets with constraint In: Proceedings of the Third International Conference on Knowledge and Systems Engineering, pp 19–25 Anh, T., Hai, D., Tin, T., Bac, L., 2012 Mining frequent itemsets with dualistic constraints In Proceedings of the PRICAI 2012, LNAI, vol 7458, Springer, pp 807–813 Anh, T., Tin, T., Bac, L., Hai, D., 2012 Mining association rules restricted on constraint In Proceedings of the IEEE-RIVF 2012, pp 51–56 Anon〈http://www.cs.rpi.edu/ $ zaki/wwwnew/pmwiki.php/Software/Software#pa tutils〉 (accessed 2010) Bayardo, R.J., Agrawal, R., Gunopulos, D., 2000 Constraint-based rule mining in large, dense databases Data Mining and Knowledge Discovery (2–3), 217–240 Birkhoff, G., 1948 Lattice Theory American Mathematical Society, New York Bonchi, F., Lucchese, C., 2004 On closed constrained frequent pattern mining In: Proceedings of the IEEE ICDM’04, pp 35–42 Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D., 2003 Examiner: optimized level-wise frequent pattern mining with monotone constraints In Proceedings of the IEEE ICDM’03, pp 11–18 Boulicaut, J.F., Bykowski, A., 2000 Frequent closures as a concise representation for binary Data Mining In Proc PAKDD’00, vol 1805, pp 62–73, Springer Boulicaut, J.F., Bykowski, A., Rigotti, C., 2003 Free-sets: a condensed representation of boolean data for the approximation of frequency queries Data Mining and Knowledge Discovery (1), 5–22 Bucila, C., Gehrke, J.E., Kifer, D., White, W., 2003 Dualminer: a dual-pruning algorithm for itemsets with constraints Data Mining and Knowledge Discovery (3), 241–272 Burdick, D, Calimlim, M, Gehrke, J., 2001 MAFIA: a maximal frequent itemset algorithm for transactional databases In Proceedings of the IEEE ICDE’01, pp 443–452 Dong, J., Han, M., 2007 BitTable-FI: an efficient mining frequent itemsetsalgorithm Knowledge Based Systems 20 (4), 329–335 Frequent Itemset Mining Dataset Repository (FIMDR), 〈http://fimi.cs.helsinki.fi/ data/〉 (accessed 2009) Grahne, G., Zhu, J., 2005 Fast algorithms for frequent itemset mining using fp-trees IEEE Transactions on Knowledge and Data Engineering 17 (10), 1347–1362 Hai, D., Tin, T., Bac, L., 2013 An efficient algorithm for mining frequent itemsets with single constraint In: Proceedings of ICCSAMA 2013, Advanced Computational Methods for Knowledge Engineering, Springer, pp 367–378 Han, J., Pei, J., Yin, Y., 2000 Mining frequent patterns without candidate generation SIGMOD’00, 1–12 Jeudy, B., Boulicaut, J.F., 2002 Optimization of association rule mining queries Intelligent Data Analysis (4), 341–357 Lin, D.I., Kedem, Z.M., 2002 Pincer search: an efficient algorithm for discovering the maximum frequent sets IEEE Transactions on Knowledge and Data Engineering 14 (3), 553–566 Mannila, H., Toivonen, H., 1997 Levelwise search and borders of theories in knowledge discovery Data Mining and Knowledge Discovery (3), 241–258 Nguyen, R.T., Lakshmanan, V.S., Han, J., Pang, A., 1998 Exploratory mining and pruning optimizations of constrained association rules In Proceedings of the 1998 ACM-SIG-MOD International Conference on the Management of Data, pp 13–24 Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Discovering frequent closed itemsets for association rules ICDT’12, 398–416 Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L., 1999 Efficient mining of association rules using closed itemset lattices In: Information Systems 24 (1), 25–46 Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., Lakhal, L., 2005 Generating a condensed representation for association rules Journal of Intelligent Information Systems 24 (1), 29–60 Pei, J., Han, J., 2002 Constrained frequent pattern mining: A Pattern-Growth view ACM SIGKDD Explorations (1), 31–39 Pei, J., Han, J., & Mao, R., 2000 CLOSET: an efficient algorithm for mining frequent closed itemsets SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, p 11–20 Pei, J., Han, J., Lakshmanan, L.V.S., 2001 Mining frequent itemsets with convertible constraints In Proceedings of the IEEE ICDE’01, pp 433–442 Song, W., Yang, B., Xu, Z., 2008 Index-BitTableFI: an improved algorithm formining frequent itemsets Knowledge Based Systems 21 (6), 507–513 Srikant, R., Vu, Q., Agrawal, R., 1997 Mining association rules with item constraints In Proc KDD’97, pp 67–73 Tin, T., Anh, T., 2010 Structure of set of association rules based on concept lattice Adv in Intelligent Infor and Database Systems, SCI, 283, Springer, p 217–227 Vo, B., Hong, T.P., Le, B., 2012 DBV-Miner: a dynamic bit-vector approach for fast mining frequent closed itemsets Expert Systems with Applications 39 (8), 7196–7206 Vo, B., Hong, T.P, Le, B., 2013 A lattice-based approach for mining most generalization association rules Knowledge-Based Systems 45, 20–30 Wang, J., Han, J., Pei, J., 2003 CLOSET ỵ: searching for the best strategies for mining frequent closed itemsets SIGKDD’03, 236–245 Wille, R., 1992 Concept lattices and conceptual knowledge systems Computers & Mathematics with Applications 23, 493–515 Zaki, M.J 2004 Mining Non-Redundant Association Rules Data Mining and Knowledge Discovery, pp 223–248 Zaki, M.J., Hsiao, C.J., 2005 Efficient algorithms for mining closed itemsets and their lattice structure IEEE Transactions on Knowledge and Data Engineering 17 (4), 462–478 Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W., 1997 New algorithms for fast discovery of association rules In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD’97), pp 283–296 ... Constraint mining Introduction Frequent itemsets play an important role in many data mining tasks such as the mining of association rules and classification Therefore, a lot of frequent itemset mining. .. the algorithm outperforms existing Fig Performance results for Mushroom for various numbers of double-constraints Fig 10 Performance results for real and dense databases for various numbers of... in frequent itemset mining and notations In Section 4, a unique representation of frequent itemsets with double-constraint and a procedure for quickly determining all closed frequent itemsets and

Ngày đăng: 08/02/2022, 16:07

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan