DSpace at VNU: A fast algorithm for classification based on association rules

2012 IEEE International Conference on Granular Computing A Fast Algorithm for Classification Based on Association Rules Abstract-In Loan T.T Nguyen BayVo Faculty of Information Technology VOV Broadcasting College II Ho Chi Minh, Viet Nam nguyenthithuyloan@vov.arg.vn Information Technology College Ho Chi Minh City, Viet Nam vdbay@itc.edu.vn Tzung-Pei Hong Hoang Chi Thanh Department CSIE National University of Kaohsiung Kaohsiung City, Taiwan, R.O.C tphong@nuk.edu.tw Department of Informatics Ha Noi University of Science Ha Noi, Viet Nam thanhhc@vnu.vn this paper, we propose a new method for mining class-association rules using a tree structure Firstly, we design a tree structure for storing frequent itemsets of datasets Some theorems for pruning nodes and computing information in the tree are then developed We then propose an efficient algorithm for mining CARs based on them Experimental results show that our approach is more efficient than those used previously Keywords-accuracy, classification, class-association rules, data mining, tree structure I INTRODUCTION A lot of methods for mining classification rules have been developed in recent years such as C4.5 and ILA These methods are, however, based on heuristics and greedy approaches to generate rule sets that are either too general or too overfitting for a given dataset They thus often yield high error ratios Recently, a new method for classification from data mining, called the Classification Based on Associations (CBA), has been proposed for mining Class Association Rules (CARs) This method has more advantages than the heuristic and greedy methods in that the former could easily remove noise, and the accuracy is thus higher It can additionaly generate a rule set that is more complete than C4.5 and ILA Thus, some algorithms for mining classification rules based on association rule mining have been proposed Examples include CPAR [17], CMAR [4], CBA [6-7], MMAC [10], MCAR [11], ACME [12], Noah [1], LOCA and PLOCA [8], and the use of the Equivalence Class Rule-tree (ECR-tree) [16] Some researchers have also reported that classifiers based on class-association rules are more accurate than those of traditional methods such as C4.5 [9] and ILA [13-14] both theoretically [15] and with regard to experimental results [6] 978-1-4673-2311-6/12/$31.00 ©2012 IEEE All the above methods focused on the design of the algorithms for mining CARs or building classifiers but did not discuss much with regard to their mining time Vo and Le [16] has proposed a new method for mining CARs using an ECR-tree An efficient algorithm, named ECR-CARM, was proposed in their study ECR-CARM scanned the dataset once and was based on object identifiers intended to quickly determine the support of itemsets A tree structure for fast mining CARs was applied It, however, was time consuming for generate-and-test candidates because the authors grouped all values with the same attributes into one node in the tree In this paper, we improve by modifying the tree structure Each node in the tree contains one value of attributes instead of their being grouped Some theorems are also designed Based on the tree and these theorems, we propose an algorithm for mining CARs efficiently II PRELIMINARY CONCEPTS Let D be the set of training data with n attributes A J, A2, , An and IDI objects (cases) Let C { cJ, C2,' , cd be a list of class labels Specific values of attribute Ai and class C are denoted by lower case letters a and c, respectively Definition 1: An itemset is a set of some pairs of attribute and a specific value, denoted {(Ail ail), (Ai2• a,2), , (Aim aim)} Definition 2: A class-association rule r has the form of {(Ail ail), , (Aim aim)} � c, where {(Ail ail), , (Ami aim)} is an itemset and CEC is a class label Definition 3: The actual occurrence ActOcc(r) of a rule r in D is the number of rows of D that match r's condition Definition 4: The support of r denoted Sup(rJ is the number of rows of D that match r's condition, and belongs to r's class • = Table An example of training dataset OlD A aJ aJ a2 a3 a3 a3 aJ a2 B bJ b2 b2 b3 bJ b3 b3 b2 C c1 c1 c1 c1 c2 c1 c2 c2 class A n n Y n Y y n MINING CLASS-ASSOCIATION RULES Tree structure We modify the ECR-tree structure [16] into an MECR-tree structure (M stands for Modification) as follows In the ECR-tree, the authors arranged all itemsets with the same attributes into one group and joined itemsets in different groups together This led to consumption of more time for generate-and-check itemsets In our work, each node in the tree contains one itemset along with the following information: a) Obidset: a set of object identifiers that contain the itemset b) (Cj,C2, ,Ck) - where Cj is the number of records in Obidset which belong to class Cj, and c) pos - store the position of the class with the maximum count, i.e., pos = argmax{cJ 'E[J,k] In the ECR-tree, the authors did not store Ci and pos: therefore, the algorithm had to compute them for all nodes However, we need not compute the information of some nodes in the MECR-tree using theorems that are presented in Section IV.B For example, consider node containing itemset X= {(A, a3), (B,b3)} Because X is contained objects and 6, all of them belong to class y Therefore, we have a node in the tree as {(A,a3),(B,b3)} or more simply as x a3 b The pos is 46(�,O) 46(�.O) (underlined at position of this node) because the count of class y is maximum (2 as compared to 0) The latter is another representation of the former for saving memory when we use the tree structure to store itemsets We use bit presentation for storage of the itemset's attributes For example, AB can present as 11 in bit presentation and therefore, the value of these attributes is With this presentation, we can use bitwise operation to make itemsets join faster B 1: Given two nodes Y For example: Consider r = {(A, a1)} -+ y from the dataset in Table 1, we have ActOcc(r) = and Super) = because there have three objects with A = a1 in that two objects have the same class y III Theorem Proposed algorithm In this section, some theorems for fast mining CARs are designed Based on these theorems, we propose an efficient algorithm for mining CARs and att z x values z Obidsetz (cw"" CZk) attl x values, Obidset, (C II,···,Clk) , If att, = attz and values, f valuesz, then Obidsetl !l Obidsetz = Proof: Since att, = attz and values, f valuesz, there exist a vall E valuesl and a valz E valuesz such that vall and valz have the same attribute but different values Thus, if a record with OlD, contains val" it cannot contain valz Therefore, VOID E Obidset" and it can be inferred that OlD " Obidsetz Thus, Obidset, !l Obidsetz = In this theorem, we divide the itemset into form attxvalues for ease of use Theorem infers that, if two itemsets X and Y have the same attributes, they don't need to be combined into the itemset XY because Sup(XY) = O For example, consider the two nodes Ix al 127(�,1) and x a2 38(1,1) , in which Obidset( {(A, aJ)}) = 127, and Obidset( {(A, a2)}) = 38 Obidset( {(A, aJ), (A, a2)}) = Obidset( {(A, aJ)}) !l Obidset( {(A, a2)}) = Similarly, Obidset( {(A, a1), (B, bJ)}) = 1, and Obidset( {(A, aJ), (B, b2)}) = It can be inferred that Obidset( {(A, aJ), (B, b1)}) !l Obidset( {(A, a1), (B, b2)}) = because both of these two itemsets have the same attributes AB but with different values Theorem and 2: Given two itemset2 Obidsetz (CZI"'" CZk) nodes itemset l Obidset, (cll''''' Clk) , if itemset, c itemsetz and IObidsetd = IObidsetzl then ViE [1, k]: Cli = CZi We have itemsetl c itemsetz This means that all records containing itemsetz also contain itemset" and therefore, Obidsetz i (Lines and 5) to generate a candidate child node I With each pair (I i, Ij), the algorithm checks whether li.att *' Ii-att or not (Line 6, using Theorem 1) If they are different, it computes the three elements att, values, Obidset for the new node (Lines 7-9) Line 10 checks if the number of object identifiers of Ii is equal to the number of object identifiers of (by Theorem 2) If this is true, then by Theorem 2, the algorithm can copy all information from node Ii to node (Lines 11-12) Similarly, in the event that the result of Line lO is false, the algorithm checks Ii with 0, and if the numbers of their object identifiers are the same (Line 13), the algorithm can copy the information from node Ij to node (Lines 14-15) Otherwise, the algorithm computes the O.count by using O.Obidset and O.pos (Lines 17 - I S) After computing all of the information for node 0, the algorithm adds it to Pi (Pi is initialized empty in Line 4) if O.count[O.pos] � minSup (Lines 19-20) Finally, CAR-Miner will be recursively called with a new set Pi as its input parameter (Line 21) The procedure ENUMERATE-CAR(I, minConj) generates a rule from node I It first computes the confidence of the rule (Line 22), if the confidence of this rule satisfies minConf (Line 23), then it adds this rule into the set of CARs (Line 24) 1xa3 12346(�,2) 578(l,�) ENUMERATE-CAR(l, Figure The proposed algorithm for mining CARs 1xa2 127(�,1) 38 (O,�) 456(�,1) 150,1) 238(O,�) 467(�,O) ie[l.k] 19 20 An example • With node Ij = 2xb2 238(O ,�) : Because their attributes are different, three elements are computed such as O.att = li.att u Ij.att = I = or 11 in bit presentation; O.itemset = li.values u Ij.values = a2 u b2 = a2b2, and O.Obidset = li.obidset n Ij.Obidset = {3,S} n {2,3,S} = {3,S} Because of I/i.Obidsetl = lo Obidsetl, the algorithm copies all information of Ii to O This means that O.count = li.count = (,2,,0) and O.pos = Because O.count[O.pos] = > minSup, add to Pi => Pi = { • 3xa2b2 38(O , �) With node Ij = } 2xb3 467(�,O) : Because their attributes are different, three elements are computed such as O.att = li.att u Ij.att = I = or 11 in bit presentation, O.itemset = li.values u Ij.values = a2 u b3 = a2b3, and O.Obidset = li.Obidset n Ij.Obidset = {3,S} n {4,6,7} = {0} Because the o.count[O.pos] = < minSup, is not added to Pi • With node Ij = 4xc 123460,2) : Because their attributes are different, three elements are computed such as O.att = li.att u Ij.att = I = or 101 in bit presentation, O.itemset = li.values u Ij.values = a2 u cl = a2c1, and O.Obidset = li.Obidset !l Ij.Obidset = {3,8} !l {1,2,3,4,6} = {3} The algorithm computes additional information including O.count = {O,l} and O.pos = Because the O.count[O.pos] = 2: minSup, o is added to Pi • With node Ij = => Pi = { 4x c2 578(1,�) 3x a2b2 5x a2c1 38(0,�) , 3(0,1) for different mInImum support thresholds We used a minConf= 50% for all experiments I eren mmSups Table Nurnber 0f ru Ies f,or d'ff, Dataset => Pi = { : Because their attributes are German Lymph Led7 3x a2b2 5x a2c1 5x a2c2 } , 38(0,�)' 8(0,!) 8(0,!) After Pi is created, the CAR-Miner is called recursively with parameters Pi, minSup, and minConf to create all children nodes of Pi Rules are easily to generate in the same step for traversing Ii (Line 3) by calling procedure ENUMERATE CAR(li, minConj) For example, when traversing node Ii = Vehicle • Ix a2 38(0,�) , the procedure computes the confidence of the candidate rule, conf = li.count[li.pos]/l/i.Obidsetl = 2/2 = l Because conf2: minConf(60%), add rule {(A,a2)} � n (2,1) into the rule set CARs The meaning of this rule is "If A = a2 then class = n" (support = and confidence = lOO%) IV A EXPERIMENTAL RESULTS #attrs 12 21 18 19 #c1asses 2 10 #distinct values 737 1077 63 24 1434 #Objs 699 1000 148 3200 846 The expenmental datasets had dIfferent features The Breast, Gennan and Vehicle datasets had many attributes and distinctive values but had very few numbers of objects (or records) The Led7 dataset had only a few attributes, distinctive values and number of objects B Dataset Breast The algorithms used in the experiments were coded on a personal computer with C#2008, Windows 7, Centrino 2x2.53 GHz, and 4MBs RAM The experimental results were tested in the datasets obtained from the VCI Machine Learning Repository (http://mlearn.ics.uci.edu) Table shows the characteristics of the experimental datasets Table The charactenstlcs 0f the expenmenta I datasets Numbers ofrules ofthe experimental datasets Table shows numbers of rules of datasets in Table #rule 5586 9647 33183 496808 41056 77362 182576 752643 222044 360375 1346318 4039186 503 886 908 1045 825 2240 4487 126221 The results from Table show that some datasets had a lot of rules For example, Lymph dataset had 4,039,186 rules with a minSup = 1% The German dataset had 752,643 rules with a minSup = 1%, etc C Execution time Experiments were then made to compare the execution time between CAR-Miner and ECR-CARM [16] The results are shown in Table Table The execute time for different minSups Characteristics ofexperimental datasets Dataset Breast German Lymph Led7 Vehicle 0.5 0.3 4 1 0.5 0.3 0.8 0.6 0.4 0.2 Breast } different, three elements are computed such as O.att = li.att u Ij.att = I = or 101 in bit presentation, O.itemset = li.values u Ij.values = a2 u c2 = a2c2, and O.Obidset = li.obidset !l Ij.Obidset = {3,8} !l {5,7,8} = {8} The algorithm computes additional information including O.count = {O,l} and O.pos = Because the O.count[O.pos] = 12: minSup, is added to Pi minSup(%) German Lymph Led7 Vehicle minSup(%) 0.5 0.3 0.1 4 1 0.5 0.3 0.1 0.8 0.6 0.4 0.2 ECR-CARM 0.06 0.088 0.249 17.136 0.71 1.069 1.929 5.751 1.568 2.467 7.815 23.397 0.064 0.065 0.066 0.067 16 0.243 0.345 2.091 CAR-Miner 0.059 0.079 151 1.517 0.644 0.988 1.75 4.823 1.28 1.84 5.603 14.75 0.058 0.063 0.064 0.065 0.144 0.242 0.31 1.059 The results from Table show CAR-Miner to be more efficient than ECR-CARM in all of the experiments For example: Consider the Breast dataset with a minSup = 0.1%, the mining time for the CAR-Miner was l.517 second, while that for the ECR-CARM was 17.136 second V CONCLUSIONS AND FUTURE WORK This paper proposed a new algorithm for mining CARs using a tree structure Each node in the tree contained some information for fast computing the support of the candidate rule In addition, using Obidset, we were able to compute the support of itemsets quickly Some theorems were been also developed Based on these theorems, we did not need to compute the information for a lot of nodes in the tree Mining itemsets from incremental databases has been developed in recent years [2-3, 5] We can see that it saves a lot of time and memory when compared with mining from integration databases Therefore, in the future, we will study how to use this approach for mining CARs REFERENCES G Giuffrida, W.W Chu, D.M Hanssens, "Mining classification rules from datasets with large number of many-valued attributes" The 71h International Conference on Extending Database Technology: Advances in Database Technology (EDBT'OO), Munich, Germany, 2000, 335-349 T.P Hong, C.l Wang, "An efficient and effective association-rule maintenance algorithm for record modification", Expert Systems with Applications, 37(1), 2010, 618 - 626 T.P Hong, C.W Lin, Y.L Wu, "Maintenance of fast updated frequent pattern trees for record deletion", Computational Statistics and Data Analysis, 53(7), 2009, 2485 - 2499 W Li, l Han, l Pei, "CMAR: Accurate and efficient classification based on multiple class-association rules", The lSI IEEE International Conference on Data Mining, San Jose, California, USA, 2001, 369-376 C.W Lin, T.P Hong, W.H Lu, "The Pre-FUFP algorithm for incremental mining", Expert Systems with Applications, 36(5), 2009, 9498-9505 B Liu, W Hsu, Y Ma, "Integrating classification and association rule mining", The 4th International Conference on Knowledge Discovery and Data Mining, New York, USA, 1998, 80-86 10 11 12 13 14 15 16 17 B Liu, Y Ma, C.K Wong, "Improving an association rule based classifier", The 4th European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, 2000, 80-86 L.T.T Nguyen, B Yo, T.P Hong, H.C Thanh, "Classification based on association rules: A lattice based approach", Expert Systems with Applications, 39(13), 2012, 11357-11366 lR Quinlan, "C4.5: program for machine learning", Morgan Kaufmann, 1992 F Thabtah, P Cowling, Y Peng, "MMAC: A new multi-class, multi-label associative classification approach", The 41h IEEE International Conference on Data Mining, Brighton, UK, 2004, 217-224 F Thabtah, P Cowling, Y Peng, "MCAR: Multi-class rd classification based on association rule", The ACSIIEEE International Conference on Computer Systems and Applications, Tunis, Tunisia, 2005, 33-39 R Thonangi, V Pudi, "ACME: An associative c1 ��sifier bas :d on maximum entropy principle", The 16 InternatIOnal Conference Algorithmic Learning Theory, LNAI 3734, Singapore, 2005, 122-134 M.R Tolun, S.M Abu-Soud, "ILA: An inductive learning algorithm for production rule discovery", Expert Systems with Applications, 14(3), 1998, 361370 M.R Tolun, H Sever, M Uludag, S.M Abu-Soud, "ILA-2: An inductive learning algorithm for knowledge discovery", Cybernetics and Systems, 30(7), 1999, 609 - 628 A Veloso, W Meira Jr., MJ Zaki, "Lazy associative classification", The 2006 IEEE International Conference on Data Mining (ICDM'06), Hong Kong, China, 2006, 645-654 B Yo, B Le, "A novel classification algorithm based on association rules mining", The 2008 Pacific Rim Knowledge Acquisition Workshop (Held with PRlCAJ'08), LNAI 5465, Ha Noi, Viet Nam, 2008, 6175 X Yin, J Han, "CPAR: Classification based on predictive association rules", SIAM International Conference on Data Mining (SDM'03), San Francisco, CA, USA, 2003, 331-335 ... classification approach", The 41h IEEE International Conference on Data Mining, Brighton, UK, 2004, 217-224 F Thabtah, P Cowling, Y Peng, "MCAR: Multi-class rd classification based on association rule",... of fast updated frequent pattern trees for record deletion", Computational Statistics and Data Analysis, 53(7), 2009, 2485 - 2499 W Li, l Han, l Pei, "CMAR: Accurate and efficient classification. .. and Knowledge Discovery, Lyon, France, 2000, 80-86 L.T.T Nguyen, B Yo, T.P Hong, H.C Thanh, "Classification based on association rules: A lattice based approach", Expert Systems with Applications,

Định dạng
Số trang	5
Dung lượng	497,8 KB