DSpace at VNU: Mining erasable itemsets with subset and superset itemset constraints

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	28
Dung lượng	8,22 MB

Nội dung

DSpace at VNU: Mining erasable itemsets with subset and superset itemset constraints tài liệu, giáo án, bài giảng , luận...

Accepted Manuscript Mining Erasable Itemsets with Subset and Superset Itemset Constraints Bay Vo , Tuong Le , Witold Pedrycz , Giang Nguyen , Sung Wook Baik PII: DOI: Reference: S0957-4174(16)30559-0 10.1016/j.eswa.2016.10.028 ESWA 10933 To appear in: Expert Systems With Applications Received date: Revised date: Accepted date: 19 July 2016 12 October 2016 13 October 2016 Please cite this article as: Bay Vo , Tuong Le , Witold Pedrycz , Giang Nguyen , Sung Wook Baik , Mining Erasable Itemsets with Subset and Superset Itemset Constraints, Expert Systems With Applications (2016), doi: 10.1016/j.eswa.2016.10.028 This is a PDF file of an unedited manuscript that has been accepted for publication As a service to our customers we are providing this early version of the manuscript The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain ACCEPTED MANUSCRIPT AC CE PT ED M AN US CR IP T Highlights  We state the problem of mining EIs with subset and superset itemset constraints  Two propositions supporting a quick pruning nodes were proposed  pMEIC algorithm based on two above propositions was proposed  The experiments were conducted to show the effectiveness of pMEIC ACCEPTED MANUSCRIPT Mining Erasable Itemsets with Subset and Superset Itemset Constraints Bay Vo1, 2, Tuong Le3, 4, *, Witold Pedrycz5,6,7, Giang Nguyen1, Sung Wook Baik2 Faculty of Information Technology, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam College of Electronics and Information Engineering, Sejong University, Seoul, Republic of Korea Division of Data Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam Department of Electrical and Computer Engineering, University of Alberta, Edmonton, T6R AN US CR IP T 2V4 AB, Canada Department of Electrical and Computer Engineering, Faculty of Engineering, King Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland ED M Abdulaziz University, Jeddah, 21589, Saudi Arabia Email: Bay Vo (bayvodinh@gmail.com), Tuong Le (lecungtuong@tdt.edu.vn and PT tuonglecung@gmail.com), Witold Pedrycz (wpedrycz@ualberta.ca), Giang Nguyen CE (nh.giang@hutech.edu.vn), Sung Wook Baik (sbaik@sejong.ac.kr) Abstract Erasable itemset (EI) mining, a branch of pattern mining, helps managers to AC establish new plans for the development of new products Although the problem of mining EIs was first proposed in 2009, many efficient algorithms for mining these have since been developed However, these algorithms usually require a lot of time and memory usage In reality, users only need a small number of EIs which satisfy a particular condition Having this observation in mind, in this study we develop an efficient algorithm for mining EIs with subset and superset itemset constraints (C0  X  C1) Firstly, based on the MEI (Mining ACCEPTED MANUSCRIPT Erasable Itemsets) algorithm, we present the MEIC (Mining Erasable Itemsets with subset and superset itemset Constraints) algorithm in which each EI is checked with regard to the constraints before being added to the results Next, two propositions supporting quick pruning of nodes that not satisfy the constraints are established Based on these, we propose an efficient algorithm for mining EIs with subset and superset itemset constraints (called pMEIC CR IP T – p: pruning) The experimental results show that pMEIC outperforms MEIC in terms of mining time and memory usage Keywords: data mining, erasable itemset, subset and superset itemset constraint, pruning AN US techniques Introduction Data mining and knowledge discovery is the process of discovering interesting patterns and rules in large databases This process combines a variety of methods stemming from M artificial intelligence, machine learning, and statistics Many problems in data mining have ED attracted the attention of researchers, such as mining association rules (Lin et al., 2016; Sahoo et al., 2015), application of association rules (Cheng et al., 2016; Parkinson et al., 2016; PT Khader et al., 2016), classification (Sun et al., 2015; Jia et al., 2015; Wang et al., 2015), and CE clustering (Agarwal & Bharadwaj, 2015; Das & Maji, 2015; Nanda & Panda, 2015) Pattern mining is the fundamental approach to solving the above problems, and there are currently AC many methods of mining frequent patterns, including Apriori (Agrawal et al., 1993), FPgrowth (Han et al., 2000), dEclat (Zaki & Hsiao, 2005), NSFI (Vo et al., 2016), and FIN+ (Deng, 2016) In 2009, Deng et al (2009) formulated a problem of mining erasable patterns (EPs), a variant of pattern mining In this problem a factory produces many products, which are composed of a number of items (components) Each product generates revenue, and each item has a cost of purchase and storage During a financial crisis, a factory will not have enough money to purchase all the required components as usual The problem of EP mining ACCEPTED MANUSCRIPT is thus to find the patterns that can be removed to reduce the loss to the factory’s profit under some conditions Managers can then utilize the knowledge obtained from EPs to make a new production plan Although many algorithms have been developed for mining EIs, the problem of mining EIs with itemset constraints remains unexplored In this study we thus propose an efficient algorithm for mining EIs with subset and superset itemset constraints The results to CR IP T this problem come in the form Y = {X  I | X is EI and C0  X  C1}, where I is the set of items in databases and C0  C1  I Mining EIs with the subset and superset itemset constraint (Y) can be achieved simply by mining all EIs and then filtering the results which satisfy this constraint However, this approach is inefficient in both time and memory usage AN US We thus develop two pruning techniques to reduce the search space based on two clauses related to the paternity of the nodes in the tree search The main contributions of this study are as follows: (1) presenting the problem of mining EIs with subset and superset itemset M constraints, (2) proposing two pruning techniques to reduce the search space, (3) presenting an efficient algorithm for mining EIs with subset and superset itemset constraints, and (4) ED conducting experiments to show the effectiveness of the proposed algorithm The paper is organized as follows In Section 2, we present the underlying concept of EI PT mining and the problem statement of mining erasable itemsets with subset and superset CE itemset constraints Section reviews related works on EI mining In Section 4, a modified version of the MEI algorithm for mining erasable itemsets with subset and superset itemset AC constraints, named the MEIC algorithm, is presented The main contribution of this article, the pMEIC algorithm, is then proposed in Section We show results of the performance evaluation of our algorithm and MEIC in Section 6, while the conclusions of this work are then presented in Section ACCEPTED MANUSCRIPT Basic concepts and problem statement Let ={ , , , } be a set of all items, which are the abstract representations of components of products A product dataset is denoted by , , , }, where is a product presented in the form of Items, Val, where Items are the items (or components) that constitute A set is also called an itemset, and an itemset with k items is called a k- CR IP T the product , and Val is the profit that the factory generates by selling itemset The example dataset shown in Table will be used throughout this article In which, {a, b, c, d, e, f, g, h} is the set of items, and {P1, P2, , P11} is the set of products Items Val (USD) P1 a, b, c 2,100 P2 a, b 1,000 P3 a, c 1,000 P4 b, c, e 150 P5 M AN US Table An example database (DBe) Product b, e 50 c, e 100 P7 c, d, e, f, g 200 P8 d, e, f, h 100 P9 d, f P10 b, f, h 150 P11 c, f 100 AC Let CE PT ED P6 be an itemset The gain of 50 is defined as: ∑ | (1) } For example, X = {ac} is an itemset We have {P1, P2, P3, P4, P6, P7, P11} that are products which contain {a}, {c}, or {ac} Therefore, P4.Val + P6.Val + P7.Val + P11.Val = 4,650 USD (X) = P1.Val + P2.Val + P3.Val + ACCEPTED MANUSCRIPT Given a threshold and product dataset DB, pattern X is said to be erasable if and only if  Where is the total profit of product dataset DB: ∑ The total gain of the factory is the sum of the gain of all products Consider DBe, we set T = 5,000 USD An itemset X is an EI if and only if (X) ≤ T × , where  is a user given threshold For example, let  = 16%, (e) = 600 USD and e is an EI due to (e) = 600 ≤ 5,000 × 16% = 800 CR IP T From Definitions and 2, the problem of mining EIs is to find all itemsets that have (X) smaller than or equal to T ×  Table shows all EIs for DBe with  = 16% Table All EIs for DBe with  = 16% e f d h M g 600 600 350 250 200 ed 650 eh 750 ED PT CE AC Val (USD) AN US Erasable Itemsets eg 600 fd 600 fh 600 dh 500 dg 350 hg 450 edh 800 edg 650 ehg 750 fdh 600 fdg 600 fhg 600 dhg 500 edhg 800 ACCEPTED MANUSCRIPT fdhg 600 Le and Vo (2014) defined the index of gain as [] , where for For DBe, the index of gain is shown in Table For example, the gain of product P4 is the value of the element at position in G denoted by G[4] = 150 dollars Index Gain 2,100 1,000 1,000 150 50 100 CR IP T Table Index of gain for DBe 200 100 50 10 150 11 100 In addition, Le and Vo (2014) proposed Pidset and dPidset structures for mining EIs, which are summarized as follows For an itemset X, dPidset (the set of product identifiers) is ⋃ AN US computed by , where A is an item in X and p(A) is the pidset of item A, i.e., the set of product identifiers which includes A Let XA and XB be two itemsets with the same prefix X The pidset of XAB is computed by p(XAB) = p(XB)  p(XA), where p(XA) and M p(XB) are pidsets of XA and XB, respectively The dPidset of pidsets p(XA) and p(XB), denoted as dP(XAB), is computed by: dP(XAB)= p(XB) \ p(XA) Moreover, assume that ED dP(XA) and dP(XB) are the dPidsets of XA and XB, respectively The dPidset of XAB is PT computed by: dP(XAB) = dP(XB) \ dP(XA) Definition Given two itemsets C0 and C1, and C0  C1  I, the problem of mining EIs Y = {X  I | AC CE with a subset and superset itemset constraint is to find all EIs with:   C0  X  C1 } (2) For example, consider DBe (with T = 5,000 USD), let  = 16%, C0 = {f} and C1 = {fdh} Table shows all EIs satisfying this constraint ACCEPTED MANUSCRIPT Table All EIs for DBe with  = 16% with a subset and superset itemset constraint (C0 = {f} and C1 = {fdh}) Erasable Itemsets Val (USD) 600 fd 600 fh 600 dh 500 fdh 600 CR IP T f Related studies Many algorithms have been proposed for solving the problem of mining EIs, as AN US summarized by Le et al (2014), including META (Mining Erasable iTemsets with the Antimonotone property) (Deng et al., 2009), MERIT (Fast Mining ERasable ITemsets) (Deng & Xu, 2012), MEI (Mining Erasable Itemsets) (Le & Vo, 2014), and EIFDD (Erasable Itemsets for very Dense Datasets) (Nguyen et al., 2015) First, META is an Apriori-based algorithm, M which is slow because it generates candidate patterns level by level An erasable (k-1)-itemset ED X is checked with all the remaining erasable (k-1)-itemsets for combination to generate candidate erasable k-itemsets Only a small number of the remaining erasable (k-1)-itemsets PT that have the same prefix as X are combined Second, MERIT uses the NC_Sets structure to reduce memory usage, which is its main advantage However, there are still some CE disadvantages with regard to storing its structure, as this leads to high memory consumption AC and long execution time Third, MEI uses a divide-and-conquer strategy and the concept of the difference pidset (dPidset) Some theorems for efficiently computing itemset information to reduce mining time and memory usage were derived for MEI Although the mining time and memory usage are better than those of META and MERIT, MEI’s performance in mining EIs from very dense databases is relatively weak Fourth, EIFDD was thus proposed to overcome this weakness of MEI for very dense databases by using the subsume concept This is used to help in the early determination of information about a large number of EIs, without ACCEPTED MANUSCRIPT the usual computational cost In summary, EIFDD is now generally used to mine EIs for very dense databases, while MEI is used to mine EIs for the remaining types of databases Besides the problem of mining EIs, a number of related problems have been proposed, as follows (1) The problem of mining top-rank-k EIs (Deng, 2013; Nguyen et al., 2014) is finding the top k rank of gain EIs to avoid finding all EPs Deng (2013) first proposed solving CR IP T this problem with a basic algorithm named VM, which uses the PID_list structure Nguyen et al (2014) then presented an improved structure of PID_list named dPID_list Based on this, the authors proposed a fast algorithm called dVM for mining top-rank-k EIs (2) For the problem of mining erasable closed itemsets (ECIs), Nguyen et al (2015) first represented and AN US compressed the mined EIs without loss of information They then proposed an effective algorithm to deal with this problem, named the MECP algorithm (3) The problem of mining weighted erasable patterns was first proposed in Lee et al (2015), which considers the M distinct weight of each item according to its quality, size, price, and so on In 2016, the same group of authors (Yun et al., 2016) proposed a new approach for mining weighted erasable ED patterns for streaming data applications In this study, we present the problem of mining EIs with subset and superset itemset PT constraints The following three approaches can be applied to solve this problem: (1) Using CE one of the existing algorithms to mine all EIs satisfying the threshold and then singling out EIs satisfying the constraint In the experiment section, we use MEI to mine all EIs and single AC out EIs satisfying the constraint We call this approach MEI-N (2) In the process of mining EIs, we check if an EI satisfied the constraint or not when it was created If so, it will be added to the results This approach will be presented in Section (3) The itemsets that satisfy the constraint are expanded, and this approach will be proposed in Section ACCEPTED MANUSCRIPT {} ed×9 650 edh×10 800 eh×10 750 edg× 650 d×789 350 f×7891011 600 eg× 600 ehg× 750 edhg× 800 fd× 600 fdh× 600 fh× 600 fdg× 600 fdhg× 600 fg× 600 fhg× 600 dh×10 500 h×810 250 dg× 350 g×7 200 hg×7 450 dhg× 500 CR IP T e×45678 600 AN US Fig The search tree of MEIC for DBe with  = 16%, C0 = {f} and C1 = {fdh} The search tree of MEI-N is the same as the search tree of MEIC in Fig without the set of EIs that satisfy the constraints C0 and C1 Therefore, MEI-N has to scan the whole search M tree again to find the EIs that satisfy the constraints C0 and C1 While MEIC will give the result without scanning this tree Therefore, MEIC is better than MEI-N However, in the ED process of searching for EIs satisfying the constraints C0 and C1, MEIC creates many redundant nodes For example, the branches of {e}, {d}, {h} and {g} are redundant PT Therefore, we propose the pMEIC algorithm to alleviate this disadvantage CE pMEIC algorithm AC In this section we propose two propositions for fast mining of EIs with subset and superset itemset constraints C0 and C1 The two propositions are as follows Proposition Let EIC be the set of EIs that satisfy the constraint, and EC(C0) be the equivalence class of C0 in the search tree If X  EC(C0) then X does not belong to EIC ACCEPTED MANUSCRIPT Proof Based on the definition of equivalence class, EC(C0) = {C0Y | Y is an itemset and Y can be an empty set} (1) By assumption, X  EC(C0) (2) From (1) and (2), C0 X  X does not belong to EIC Therefore, Proposition is proven Proposition Let X  EC(C0) be an EI, if X does not satisfy C1, then for any Y, which is superset of X, Y does not belong to EIC C1 For the first case: because X  C1  X \ C1  , therefore,  Y does not belong to EIC If Y is a superset of X then Y \ C1  For the second case: X  C1 and Y is a superset of X, i.e., X  Y  C1  Y and thus Y does not belong to EIC M Therefore, Proposition is proven AN US  CR IP T Proof By assumption, X does not satisfy C1, therefore, there are two cases: X  C1 or X  ED 5.1 The algorithm Based on Propositions and 2, the pMEIC algorithm improves the MEIC based on the PT following ideas: (i) start with C0; (ii) when the EI candidates are created, the algorithm only adds EIs which satisfy the constraint C1 to EInext From (i) and (ii), pMEIC does not generate CE candidates that not satisfy the constraint AC First, pMEIC scans the database to determine EI'1, plus its gain and pidsets (Line 1) If C0 is not an EI, the algorithm will return empty results (Lines 2-3) pMEIC will then sort EI’1 in descending order of pidsets’ size (Line 4) In the next step we combine C0 with EI'1 Only itemsets which are EIs and are a subset of C1 will be added to EInext and the results (Lines 612) The algorithm will stop when no more itemsets are created In the second step, for EIk (k  1) in the same class of equivalence (Line 14 for k = and Line 26 for k  2), the algorithm combines the first element with the remaining elements to create sets of (k+1)-itemsets ACCEPTED MANUSCRIPT candidates If itemset X satisfies T × ξ (Line 22), the algorithm will: (i) add X to EIk+1 (Line 23); (ii) add X to EIC (Lines 24); (iii) combine the elements in EIk+1 together to create EIk+2 (Line 26) The pMEIC algorithm is presented below Algorithm pMEIC algorithm CR IP T Input: Database DB, threshold  and the constraints C0, C1 Output: EIC (the erasable itemsets satisfy the constraints C0, C1) Scan DB to calculate T, G, EI’1 (EI’1 = EI1 C0) with their pidsets and g(C0) if g(C0) > T  then AN US return EIC = Sort EI’1 in descending order of pidsets’ size EInext for k to | | - E.items = C0  [k].Item Sub_pidsets( pidset, M ( pidset, Gain) [k].pidset) E.gain = C.gain + Gain  and E.items  C1 then EInext 12 EIC EInext  E EIC  PT 11 ED gain ≤ T 10 if 13 if |EInext |>1 CE 14 call Expand_E(EInext) 15 procedure Expand_E(EIv) AC 16 for k ← to |EIv| - 17 EInext 18 for j 19 (k+1) to | | - [k].Items  Items = 20 ( pidset, Gain) 21 E.gain = 22 if gain ≤ T [j].Items Sub_pidsets( [k].pidset, [k].gain + Gain  and E.items  C1 then [j].pidset) ACCEPTED MANUSCRIPT 23 EInext 24 EIC EInext  E EIC  25 if |EInext |>1 26 call Expand_E(EInext) pMEIC algorithm CR IP T 5.2 Illustration of the pMEIC process The execution of the pMEIC algorithm for DBe with  = 16%, C0 = {f} and C1 = {fdh} is described in the following steps: pMEIC scans DBe to calculate T = 5,000 USD; the index of gain ( ); and EI'1 = {e, d, h, AN US g} with their pidsets (Line 1) Sort EI'1 in descending order of pidsets’ size Once this step has been completed, the new order of EI'1 is {e, d, h, g} M pMEIC combines C0 to itemsets in EI'1 to create {fe}, {fd}, {fh}, and {fg} In this, {fe} and {fg} are discarded as they not meet the threshold Itemsets {fd} and {fh} are used PT to constraint C1 ED to generate the itemsets in the next step The algorithm finishes when {fdh} is created due The results, EIC, will be {f, fd, fh, fdh} Figure shows the search tree of pMEIC for DBe CE with  = 16%, C0 = {f} and C1 = {fdh} pMEIC has applied Propositions and 2, enabling AC faster mining compared with MEIC It does not generate redundant itemsets, and thereby improves the time and memory usage of the mining process ACCEPTED MANUSCRIPT {} e×45678 600 f×7891011 600 fe×91011 900 fd× 600 fh× 600 d×789 350 h×810 250 g×7 200 fg× 600 CR IP T fdh× 600 Fig The search tree of pMEIC for DBe with  = 16%, C0 = {f} and C1 = {fdh} AN US Experimental studies All the experiments presented in this section were performed on a laptop with an Intel Core i3-6100U 2.3-GHz CPU and GB of RAM running Windows 10 All algorithms were The experiments were M coded in C# and Net Framework (version 4.5.50709) in Microsoft Visual Studio 2012 conducted on four datasets downloaded from ED http://fimi.cs.helsinki.fi/data/ To make these datasets look like product datasets, a column PT was added to store the profit related to each product The profit was generated following the normal distribution N(100, 50) The characteristics of these datasets are shown in Table AC CE Table Characteristics of datasets used in the experiments Dataset1 # of Products # of Items 3,196 76 67,557 130 8,124 120 T10I4D100K 100,000 870 Accidents 340,183 468 49,046 7,117 Chess Connect Mushroom Pumsb These datasets are available at http://sdrv.ms/14eshVm ACCEPTED MANUSCRIPT 6.1 Mining time 400 300 200 100 B 300 200 100 30 40 50 60 70 80 30 Threshold MEIC (|C0|=1, |C1|=14) pMEIC (|C0|=1, |C1|=14) CR IP T A Mining time (seconds) Mining time (seconds) 400 40 50 60 70 80 Threshold MEIC (|C0|=2, |C1|=18) MEI-N(|C0|=1, |C1|=14) MEI-N(|C0|=1, |C1|=14) pMEIC (|C0|=2, |C1|=18) Fig Mining time of (A) MEI-N, MEIC and pMEIC with (|C0|=1, |C1|=14); (B) MEI-N, MEIC and pMEIC with (|C0|=2, |C1|=18) for the Chess dataset 0.4 Mining time (seconds) 0.12 0.11 0.1 30 40 50 Threshold 0.3 0.2 0.1 60 pMEIC (|C0|=2, |C1|=14) ED pMEIC (|C0|=1, |C1|=14) pMEIC (|C0|=3, |C1|=14) B AN US A M Mining time (seconds) 0.13 30 40 50 60 Threshold pMEIC (|C0|=2, |C1|=14) pMEIC (|C0|=2, |C1|=16) pMEIC (|C0|=2, |C1|=18) Fig Mining time of (A) pMEIC with (|C0|=1, |C1|=14), (|C0|=2, |C1|=14) and (|C0|=3, PT |C1|=14); and (B) pMEIC with (|C0|=2, |C1|=14), (|C0|=2, |C1|=16) and (|C0|=2, |C1|=18) for the Chess dataset CE Figure shows the results of the experiment to compare the mining times of the MEI-N, MEIC and pMEIC algorithms (Figs 4A, 4B), and Figure shows the mining time of pMEIC AC with many pairs of C0 and C1 (Figs 5A, 5B) for the Chess dataset Figures 4A and 4B show that pMEIC is always better than MEI-N and MEIC in terms of the mining time for the Chess dataset In particular, with large thresholds ( = 70%, 80%) the time differences are very large For example, in Fig 4A, with  = 80%, the mining time of MEIC is 284 s and that of MEI-N is 301 s, while that of pMEIC is only 0.08 s When |C1|=14, and with dynamic |C0| in ACCEPTED MANUSCRIPT Fig 5A, the best mining time is with |C1|=14 and |C0|=3 Then, in Fig 5B, we set |C0|=2, and dynamic |C1|, and the best mining time is with |C1|=14 and |C0|=2 Figures 6A and 6B show that pMEIC is always better than MEI-N and MEIC in terms of mining time for the Connect dataset When |C1|=14, and dynamic |C0| in Fig 7A, pair |C1|=14 and |C0|=3 is the best in terms of mining time Then, in Fig 7B, we set |C0|=2, and dynamic 300 Mining time (Seconds) A 200 100 200 100 Threshold MEIC (|C0|=1, |C1|=14) B AN US Mining time (seconds) 300 CR IP T |C1|, and the best mining time is with |C1|=14 and |C0|=2 Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=2, |C1|=18) pMEIC (|C0|=1, |C1|=14) MEI-N (|C0|=2, |C1|=18) MEI-N (|C0|=1, |C1|=14) Fig Mining time of (A) MEI-N, MEIC and pMEIC with (|C0|=1, |C1|=14); (B) MEI-N, 2.9 2.8 2.7 2.6 CE pMEIC (|C0|=1, |C1|=14) AC pMEIC (|C0|=3, |C1|=14) 3.4 Mining time (Seconds) ED A PT Mining time (Seconds) M MEIC and pMEIC with (|C0|=2, |C1|=18) for the Connect dataset Threshold pMEIC (|C0|=2, |C1|=14) B 3.2 2.8 2.6 Threshold pMEIC (|C0|=2, |C1|=14) pMEIC (|C0|=2, |C1|=16) pMEIC (|C0|=2, |C1|=18) Fig Mining time of (A) pMEIC with (|C0|=1, |C1|=14), (|C0|=2, |C1|=14) and (|C0|=3, |C1|=14); and (D) pMEIC with (|C0|=2, |C1|=14), (|C0|=2, |C1|=16) and (|C0|=2, |C1|=18) for the Connect dataset As seen in Figures 8, 10, 12 and 14, pMEIC is always better than MEI-N and MEIC in terms of mining time for the Mushroom, T10I4D100K, Accidents and Pumsb datasets ACCEPTED MANUSCRIPT Figures 9, 11, 13 and 15 show that the mining time is smallest when the gap |C1| - |C0| is smallest 2000 A Mining time (seconds) 1500 B 1500 1000 1000 500 500 CR IP T Mining time (Seconds) 2000 0 Threshold MEIC (|C0|=1, |C1|=14) MEI-N (|C0|=1, |C1|=14) Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=1, |C1|=14) pMEIC (|C0|=2, |C1|=18) MEI-N (|C0|=2, |C1|=18) Fig Mining time of (A) MEI-N, MEIC and pMEIC with (|C0|=1, |C1|=14); (B) MEI-N, 0.2 0.1 pMEIC (|C0|=3; |C1|=14) B 0.3 0.2 0.1 4 Threshold ED Threshold pMEIC (|C0|=1, |C1|=14) Mining time (Seconds) 0.4 A M Mining time (Seconds) 0.3 AN US MEIC and pMEIC with (|C0|=2, |C1|=18) for the Mushroom dataset pMEIC (|C0|=2, |C1|=14) pMEIC (|C0|=2, |C1|=14) pMEIC (|C0|=2, |C1|=16) pMEIC (|C0|=2, |C1|=18) PT Fig Mining time of (A) pMEIC with (|C0|=1, |C1|=14), (|C0|=2, |C1|=14) and (|C0|=3, |C1|=14); and (B) pMEIC with (|C0|=2, |C1|=14), (|C0|=2, |C1|=16) and (|C0|=2, |C1|=18) for the CE Mining time (Seconds) 60 60 A AC Mining time (Seconds) 80 Mushroom dataset 40 20 B 40 20 0.15 0.16 0.17 0.18 0.19 0.20 0.15 0.16 0.17 0.18 0.19 0.20 Threshold MEIC (|C0|=1, |C1|=14) pMEIC (|C0|=1, |C1|=14) Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=2, |C1|=18) MEI-N (|C0|=1, |C1|=14) MEI-N (|C0|=2, |C1|=18) Fig 10 Mining time of (A) MEI-N, MEIC and pMEIC with (|C0|=1, |C1|=14); (B) MEIN, MEIC and pMEIC with (|C0|=2, |C1|=18) for the T10I4D100K dataset ACCEPTED MANUSCRIPT 7.3 A Mining time (Seconds) Mining time (Seconds) 7.15 7.1 7.05 6.95 B 7.2 7.1 6.9 0.15 0.16 0.17 0.18 0.15 0.16 Threshold pMEIC (|C0|=1, |C1|=14) pMEIC (|C0|=2, |C1|=14) pMEIC (|C0|=2, |C1|=14) 0.18 Threshold pMEIC (|C0|=2, |C1|=16) pMEIC (|C0|=2, |C1|=18) CR IP T pMEIC (|C0|=3, |C1|=14) 0.17 Fig 11 Mining time of (A) pMEIC with (|C0|=1, |C1|=14), (|C0|=2, |C1|=14) and (|C0|=3, |C1|=14); and (B) pMEIC with (|C0|=2, |C1|=14), (|C0|=2, |C1|=16) and (|C0|=2, |C1|=18) for the T10I4D100K dataset 400 Mining time (Seconds) A B AN US Mining time (Seconds) 400 300 200 100 300 200 100 0.0010 0.0011 0.0012 Threshold 0.0014 0.0009 0.0010 0.0011 0.0012 0.0013 0.0014 Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=2, |C1|=18) pMEIC (|C0|=1, |C1|=14) MEI-N (|C0|=2, |C1|=18) ED MEIC (|C0|=1, |C1|=14) MEI-N (|C0|=1, |C1|=14) 0.0013 M 0.0009 Fig 12 Mining time of (A) MEI-N, MEIC and pMEIC with (|C0|=1, |C1|=14); (B) MEIN, MEIC and pMEIC with (|C0|=2, |C1|=18) for the Accidents dataset 8.5 CE 7.5 0.0009 15 Mining time (Seconds) PT A AC Mining time (Seconds) 9.5 B 10 0.0010 pMEIC (|C0|=1, |C1|=14) pMEIC (|C0|=3, |C1|=14) 0.0011 0.0012 Threshold pMEIC (|C0|=2, |C1|=14) 0.0009 0.001 0.0011 0.0012 Threshold pMEIC (|C0|=2, |C1|=14) pMEIC (|C0|=2, |C1|=16) pMEIC (|C0|=2, |C1|=18) Fig 13 Mining time of (A) pMEIC with (|C0|=1, |C1|=14), (|C0|=2, |C1|=14) and (|C0|=3, |C1|=14); and (B) pMEIC with (|C0|=2, |C1|=14), (|C0|=2, |C1|=16) and (|C0|=2, |C1|=18) for the Accidents dataset ACCEPTED MANUSCRIPT 100 A Mining time (Seconds) 50 B 50 0.0045 0.0050 0.0055 0.0060 0.0065 0.0070 0.0045 0.0050 0.0055 0.0060 0.0065 0.0070 Threshold MEIC (|C0|=1, |C1|=14) pMEIC (|C0|=1, |C1|=14) Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=2, |C1|=18) MEI-N (|C0|=1, |C1|=14) MEI-N (|C0|=2, |C1|=18) CR IP T Mining time (Seconds) 100 Fig 14 Mining time of (A) MEI-N, MEIC and pMEIC with (|C0|=1, |C1|=14); (B) MEIN, MEIC and pMEIC with (|C0|=2, |C1|=18) for the Pumsb dataset Mining time (Seconds) A 6.5 B 6.5 AN US Mining time (Seconds) 5.5 5.5 0.0009 0.0010 0.0011 0.0012 Threshold pMEIC (|C0|=2, |C1|=14) M pMEIC (|C0|=1, |C1|=14) pMEIC (|C0|=3, |C1|=14) 0.0009 0.001 0.0011 0.0012 Threshold pMEIC (|C0|=2, |C1|=14) pMEIC (|C0|=2, |C1|=16) pMEIC (|C0|=2, |C1|=18) ED Fig 15 Mining time of (A) pMEIC with (|C0|=1, |C1|=14), (|C0|=2, |C1|=14) and (|C0|=3, |C1|=14); and (B) pMEIC with (|C0|=2, |C1|=14), (|C0|=2, |C1|=16) and (|C0|=2, |C1|=18) for the PT Pumsb dataset CE 6.2 Memory usage The memory usage of MEI-N and MEIC are the same for all datasets Therefore, in this AC section, we only compare the memory usage of the MEIC and pMEIC algorithms The memory usage is determined by summing the memory used for all the information in the search tree Figures 16-21 show that the memory usage of pMEIC is better than that of MEIC for Chess, Connect, Mushroom, T10I4D100K, Accidents and Pumsb, respectively ACCEPTED MANUSCRIPT 80 A Memory usage (GBs) Memory usage (GBs) 80 60 40 20 B 60 40 20 0 30 40 50 60 70 30 80 50 60 70 80 Threshold Threshold pMEIC (|C0|=1, |C1|=14) MEIC (|C0|=1, |C1|=14) 40 MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=2, |C1|=18) CR IP T Fig 16 Memory usage of (a) MEIC and pMEIC with (|C0|=1, |C1|=14); (b) MEIC and pMEIC with (|C0|=2, |C1|=18) for the Chess dataset 60 Memory usage (GBs) A 40 20 B 40 20 AN US Memory usage (GBs) 60 Threshold MEIC (|C0|=1, |C1|=14) pMEIC (|C0|=1, |C1|=14) Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0=2|, |C1|=18) M Fig 17 memory usage of (a) MEIC and pMEIC with (|C0|=1, |C1|=14); (b) MEIC and 400 200 CE 600 Memory usage (GBs) A PT Memory usage (GBs) 600 ED pMEIC with (|C0|=2, |C1|=18) for the Connect dataset Threshold MEIC (|C0|=1, |C1|=14) pMEIC (|C0|=1, |C1|=14) B 400 200 Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=2, |C1|=18) AC Fig 18 Memory usage of (a) MEIC and pMEIC with (|C0|=1, |C1|=14); (b) MEIC and pMEIC with (|C0|=2, |C1|=18) for the Mushroom dataset ACCEPTED MANUSCRIPT A Memory usage (GBs) Memory usage (GBs) B 0.15 0.16 0.17 0.18 0.19 0.20 0.15 Threshold MEIC (|C0|=1, |C1|=14) 0.16 0.17 0.18 0.19 0.20 Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=2, |C1|=18) pMEIC (|C0|=1, |C1|=14) CR IP T Fig 19 Memory usage of (a) MEIC and pMEIC with (|C0|=1, |C1|=14); (b) MEIC and pMEIC with (|C0|=2, |C1|=18) for the T10I4D100K dataset 10 Memory usage (GBs) A B AN US Memory usage (GBs) 10 0.0009 0.0010 0.0011 0.0012 0.0013 Threshold MEIC (|C0|=1, |C1|=14) 0.0014 0.0009 pMEIC (|C0|=1, |C1|=14) 0.0010 0.0011 MEIC (|C0|=2, |C1|=18) 0.0012 0.0013 0.0014 Threshold pMEIC (|C0|=2, |C1|=18) M Fig 20 Memory usage of (a) MEIC and pMEIC with (|C0|=1, |C1|=14); (b) MEIC and pMEIC with (|C0|=2, |C1|=18) for the Accidents dataset 0.5 0.0050 0.0055 CE 0.0045 0.0060 B 0.5 0.0065 0.0070 Threshold MEIC (|C0|=1, |C1|=14) 1.5 Memory usage (GBs) ED A PT Memory usage (GBs) 1.5 pMEIC (|C0|=1, |C1|=14) 0.0045 0.0050 0.0055 0.0060 0.0065 0.0070 Threshold MEIC (|C0|=2, |C1|=18) pMEIC (|C0|=2, |C1|=18) AC Fig 21 Memory usage of (a) MEIC and pMEIC with (|C0|=1, |C1|=14); (b) MEIC and pMEIC with (|C0|=2, |C1|=18) for the Pumsb dataset Conclusions and future studies This study presented the problem of mining EIs with subset and superset itemset constraints (C0  X  C1) First, based on the MEI algorithm, we present the MEIC algorithm in which each EI becomes checked with regard to the constraints before being added to the ACCEPTED MANUSCRIPT results Second, two propositions for quick pruning of the nodes that not satisfy the constraints are developed Using these, the pMEIC algorithm was proposed for mining EIs with a subset and superset itemset constraint The experimental results show that pMEIC outperforms MEIC in terms of mining time and memory usage In the future studies it is worth examining further issues related to EIs, such as mining EIs CR IP T from huge datasets, mining top-rank-k ECIs, and mining maximal EIs Moreover, one could also study mining EIs involving other types of constraints, as well as mining erasable closed itemsets with constraints AN US Acknowledgments This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2015.10 M References ED Agrawal R., Imielinski T., Swami A.N.: Mining association rules between sets of items in large databases In SIGMOD KDD’93, 207-216, 1993 PT Agarwal V., Bharadwaj K.K.: Predicting the dynamics of social circles in ego networks using CE pattern analysis and GA K-means clustering Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(3), 113-141, 2015 AC Cheng P., Lee I., Lin C.W., Pan J.S.: Association rule hiding based on evolutionary multiobjective optimization Intelligent Data Analysis, 20(3), 495-514, 2016 Das C., Maji P.: Possibilistic biclustering algorithm for discovering value-coherent overlapping δ-biclusters International Journal of Machine Learning and Cybernetics, 6(1), 95-107, 2015 Deng Z.H.: DiffNodesets: An efficient structure for fast mining frequent itemsets Applied Soft Computing, 41, 214-223, 2016 ACCEPTED MANUSCRIPT Deng Z.H.: Mining top-rank-k erasable itemsets by PID_lists International Journal of Intelligent Systems, 28(4), 366-379, 2013 Deng Z.H., Fang G., Wang Z., Xu X.: Mining erasable itemsets In ICMLC’09, pp 67-73, 2009 Deng Z.H., Xu X.R.: Fast mining erasable itemsets using NC_sets Expert Systems with CR IP T Applications, 39(4), 4453-4463, 2012 Khader N., Lashier A., Yoon S.W.: Pharmacy robotic dispensing and planogram analysis using association rule mining with prescription data Expert Systems with Applications, 57, 296-310, 2016 AN US Han J., Pei J., Yin Y.: Mining frequent patterns without candidate generation In SIGMOD KDD’00, 1-12, 2000 Jia W., Zhao D., Shen T., Ding S., Zhao Y., Hu C.: An optimized classification algorithm by M BP neural network based on PLS and HCA Applied Intelligence, 43(1), 176-191, 2015 ED Le T., Vo B.: MEI: An efficient algorithm for mining erasable itemsets Engineering Applications of Artificial Intelligence, 27, 155-166, 2014 PT Le T., Vo B., Nguyen G.: A survey of erasable itemset mining algorithms Wiley CE Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5): 356-379, 2014 AC Lee G., Yun U., Ryang H.: Mining weighted erasable patterns by using underestimated constraint-based pruning technique Journal of Intelligent and Fuzzy Systems, 28(3), 1145-1157, 2015 Lin C.W., Gan W., Fournier-Viger P., Hong T.P., Tseng V.S.: Weighted frequent itemset mining over uncertain databases Applied Intelligence, 44(1), 232-250, 2016 Nanda S.J., Panda G.: Design of computationally efficient density-based clustering algorithms Data & Knowledge Engineering, 95, 23-38, 2015 ACCEPTED MANUSCRIPT Nguyen G., Le T., Vo B., Le B.: A new approach for mining top-rank-k erasable itemsets In ACIIDS’14, pp 73-82, 2014 Nguyen G., Le T., Vo B., Le B.: Discovering erasable closed patterns In ACIIDS’15, pp 368-376, 2015 Nguyen G., Le T., Vo B., Le B.: EIFDD: An efficient approach for erasable itemset mining CR IP T of very dense datasets Applied Intelligence, 43(1), 85-94, 2015 Parkinson S., Somaraki V., Ward R.: Auditing file system permissions using association rule mining Expert Systems with Applications, 55, 274-283, 2016 Sahoo J., Das A.K., Goswami A.: An efficient approach for mining association rules from AN US high utility itemsets Expert Systems with Applications, 42(13), 5754-5778, 2015 Sun L., Mu W., Qi B., Zhou Z.: A new privacy-preserving proximal support vector machine for classification of vertically partitioned data International Journal of Machine M Learning and Cybernetics, 6(1), 109-118, 2015 Vo B., Le T., Coenen F., Hong T.P.: Mining frequent itemsets using the N-list and subsume ED concepts International Journal of Machine Learning and Cybernetics, 7(2), 253-265, 2016 PT Wang G., Song Q., Zhu X.: An improved data characterization method and its application in CE classification algorithm recommendation Applied Intelligence, 43(4), 892-912, 2015 Yun U., Lee G.: Sliding window based weighted erasable stream pattern mining for stream AC data applications Future Generation Computer Systems, 59, 1-20, 2016 Zaki M.J., Hsiao C.J.: Efficient algorithms for mining closed itemsets and their lattice structure IEEE Transactions on Knowledge and Data Engineering, 17(4), 462-478, 2005 ... MANUSCRIPT Erasable Itemsets) algorithm, we present the MEIC (Mining Erasable Itemsets with subset and superset itemset Constraints) algorithm in which each EI is checked with regard to the constraints. .. underlying concept of EI PT mining and the problem statement of mining erasable itemsets with subset and superset CE itemset constraints Section reviews related works on EI mining In Section 4, a... (k-1) -itemset ED X is checked with all the remaining erasable (k-1) -itemsets for combination to generate candidate erasable k -itemsets Only a small number of the remaining erasable (k-1)-itemsets

Ngày đăng: 12/12/2017, 07:52