A CLASSIFICATION ALGORITHM BASED ON ASSOCIATION RULE MINING

4 145 0
A CLASSIFICATION ALGORITHM BASED ON ASSOCIATION RULE MINING

Đang tải... (xem toàn văn)

Thông tin tài liệu

2012 International Conference on Computer Science and Service System Ketnooi.com share A Classification Algorithm Based on Association Rule Mining Yang Junrui Xu Lisha College of Computer Science and Technology Xi’an University of Science and Technology Xi’an, China E-mail: Yangjunrui66@sina.com College of Computer Science and Technology Xi’an University of Science and Technology Xi’an, China E-mail: beizhan09@163.com He Hongde College of Computer Science and Technology Xi’an University of Science and Technology Xi’an, China E-mail: laohu0526@126.com In [5], CMAR algorithm was presented based on the CBA algorithm, using the deformation method of FP-Growth [6] The CMAR algorithm finds frequent patterns and generates classification association rules simultaneously, uses χ weights to measure the strength of the rules and then classify a new instance, overcomes the bias of using a single rule It greatly improves the efficiency of the algorithm using CR-tree, a high degree compression tree, to store, back, pruning the classification rules While the CMAR algorithm does not take full advantage of the characteristics of classification, there are many redundant nodes in the FP-tree Trie, also known as dictionary tree or word search tree, is a variant of the hash tree The typical applications of this tree structure is used to store a large number of strings (but not limited to) The core idea of Trie-tree is using the common prefix of the string to reduce the cost of the query time to improve efficiency This paper presents a classification algorithm based on the trie-tree of association rules that named CARPT, which effectively reduces the number of scanning during the stage of association rule mining by changing the data storage structure and storage manner It removes the frequent items that cannot generate frequent rules directly by adding the count of class labels during the construction of Trie-tree, so, time and space can be saved effectively Based on this, the algorithm it also draws the pruning idea of CDDP [7] algorithm, reduces the number of candidate frequent item sets once again Abstract—The main difference of the associative classification algorithms is how to mine frequent item setsǃ analyze the rules exported and use for classification This paper presents an associative classification algorithm based on Trie-tree that named CARPT, which remove the frequent items that cannot generate frequent rules directly by adding the count of class labels And we compress the storage of database using the twodimensional array of vertical data format, reduce the number of scanning the database significantly, at the same time, it is convenient to count the support of candidate sets So, time and space can be saved effectively The experiment results show that the algorithm is feasible and effective Keywords-data mining; associative classification algorithm; Trie-tree I classification; INTRODUCTION The classification rules mining and association rules mining are two important areas of data mining [1] The classic associative classification algorithm based on class association rules named CBA[2] which integrated the above two important mining technologies was presented by Bing Liu of National University of Singapore in the knowledge discovery in databases (KDD) International Conference held in New York, 1998 Since then the prelude of the associative classification was opened Good classification accuracy of associative classification algorithm has been confirmed in the past ten years through a number of studies and experiments The earliest Associative classification algorithm CBA generates classification association rules using the iterative method which similar to the Apriori [3] algorithm In order to generate and test longer item sets, the database needs to be scanned many times, so the number of rules increases exponentially, and more system resources are consumed For the rules which have the same support and confidence, CBA algorithm sort and select them randomly, this reduces the classification accuracy in many cases The IRARC algorithm presented in [4] introduces into the importance of properties in the rough theory, improved the randomness of the CBA algorithm in the choice of rules 978-0-7695-4719-0/12 $26.00 © 2012 IEEE DOI 10.1109/CSSS.2012.511 II THEORIES AND DEFINITIONS A Association Rules Mining Association rules mining in a transaction database can be described as follows: Set I = {i1, i2, , im} is a collection of items, D = {t1, t2, , tn} is a transaction database consisted of a series of transactions with a unique identifier TID, each transaction ti (i = 1, 2, , n) corresponds to a subset of I Each ik (k = 1, 2, , m) is an “attribute – value” pairs, known as data item 2056 (Item) The collection of items I called data set, named item sets for short, an item set contains k items is named k-itemset Definition Association rules is manifested as the relationship between the item sets, we express it as: X=>Y, where X and Y are item sets, X called rule antecedent, Y, called rule consequent Definition Let I1 ⊆ Iˈthe support of itemset I1 in data set D is defined as the percentage of a transaction with I1 in D, namely Support˄I1˅= {t ∈ D I1 ⊆ t} / D Definition Association rules that defined in I and D, namely I1=>I2, are given to meet a certain degree of confidence Confidence of rules refers to the ratio of the number of transactions containing I1 and I2 and transactions containing I1, namely Confidence (I1=>I2) = Support䯴 I1 * I 2䯵 , Support䯴 I1䯵 where I1, I2 ⊆ I, I1  I2= ∅ Definition Frequent Itemsets is defined as all the item sets that satisfy user-specified minimum support (Minsupport) in T for I and D, namely the non-empty subset of I that greater than or equal to Minsupport Theorem 1: For any given database D, let minsup stands for the minimum support, I is a frequent item If rule R: iėc is not frequent rule to all of the category labels, then all the frequent rules in D not include the frequent item i Proof: Assume that R˃: Iėc is an any frequent rule of DB If I contain only one item, then R˃is a single item rule For rule R: iėc not frequent to all of the category labels, R˃can not include item i If I was a conclusion of several items and item i was included in I, then for R˃: Iėc, there must be a sub-rule R: iėc For R is sub-rule of R˃, so R˃.countİR.count, namely R˃.supİR.supİminisup, it is a contradiction of ĀR˃is a frequent ruleā, we can see that rule R˃can not include item i Therefore, Theorem is proved to be established B Associative classification Definition Set C = {c1, c2,…, cm}, where ci˄i=1, 2,…, m˅is the value of the category attribute, it named Category labels Definition Method of mining association rules with class labels as rule consequent using association rule mining algorithm is known as associative classification Associative classification is essentially classification that based on association rules, which both reflects the application characteristics of knowledge-classification or prediction and embodies the inherent associated characteristics of knowledge.[8] Associative classification in data mining is divided into the following four steps 1) Attributes can be discrete and also can be continuous, for a continuous attribute value, discrete it fist 2) Mine all possible rules(PR) that frequent and accurate using a standard association rule mining algorithm, such as Apriori and FP-Growth, these 2057 frequent rule item sets that meet the minimum confidence constitute the set of Categorical Association Rules(CARs) 3) Construct a classifier base on the categorical association rules that mined 4) Classify the category unknown data using the classifier C Trie-tree Amir used Trie-tree to mine association rules in [9] Trietree can be defined as following: Set S= {s1,s2,…,sn} as a collection of strings that defined on the set of characters Ȉ, All of the non-terminal nodes except the root node are represented by a character of Ȉ, each leaf node corresponds to a string which happens to be character connection of the path that from the root to the leaf node In most reference, the character in each node is known as buckets, such as node ABCD, A is a bucket, B, C, D are each a bucket, the path from each bucket to the root node stands for a frequent itemset Property of Trie-tree: If a sub-tree takes a nonfrequent bucket for root node, then all the buckets of the subtree are not frequent Proof: According to the property of Apriori, Superset of any non-frequent item sets is non-frequent Itemset that represented by a bucket is right the superset of itemset that represented by the parents bucket of the path in Trie-tree Associative classification algorithm CARPT proposed in this paper is based on Trie-tree, it reduces the number of scanning during the stage of association rule mining by changing the data storage structure and storage manner and removes the frequent items that cannot generate frequent rules directly by adding the count of class labels during the construction of Trie-tree, so as to achieve the purpose of improving the efficiency of the algorithm So, how to construct a Trie-tree? First of all, find all the frequent 1-itemset as the first layer of buckets in Trie-tree, and then arrange them according to a certain order Let the set of frequent 1-itemset I = {i1,i2,…,in} and its order is ˈ so in is the rightmost bucket of the first layer of Trie-tree, and we can see from the property flowing, frequent item in, as the last item of the sort has no child node Property In order , there cannot be frequent itemset which contains two or more items take in for a prefix When p>q, frequent item ip cannot take iq for a prefix According to , the construct process of Trietree can be simply described as follows: Initialize the Trie-tree The initial Trie-tree contains only one bucket in of the first layer; Add the second bucket in-1 of the first layer to the Trietree, add the sub-tree that take in for root node after in-1, at the same time, cut off all the non-frequent non-empty subset of the sub-tree that take in-1 for root node Similarly, add the third bucket in-1 of the first layer to the Trie-tree, ĂĂˈ until the nth bucket i1 of the first layer is added, a Trie-tree contains all the frequent items is constructed III improve the achievement of Trie-tree and reduce the number of its nodes Scan the database D, record the support of each item and the support of category label that items correspond, results are as follows: a:4(A:1,B:1,C:2); d:3(A:0,B:1,C:2); f:3(A:1,B:1,C:1); k:3(A:1,B:0,C:2),find out the items whose support and the corresponding category label support are both greater than given minimum support, then we get F={a, d, k} The resulting improved vertical bitmap of the twodimensional array as shown in TABLE III ALGORITHM CARPT A description of the data structure involved in CARPT has been given above, we will now introduce the general process of the algorithm that proposed in this paper that named classification algorithm based on Trie-tree of associative rules (CARPT) Preprocessing, discretization of continuous attributes and determine of the frequent items should be completed before the commencement of the algorithm The training dataset D in TABLE I is given as an example, let minimum support=2, and minimum confidence=60% TABLE I TID the training dataset D Items a, c, f, i a, d, f, j b, e, g, k a, d, h, k a, d, f, k TID TABLE III Class A B A C C TID d 1 f 1 0 k 0 1 A 1 0 d 1 k 0 1 A 1 0 B 0 C 0 1 Figure Trie-tree after improved Contrast Figure and Figure 2, the Trie-tree in Figure has 11nodes, while the one in Figure has only It can be seen that the number of nodes of Trie-tree is reduced after adding the count of category labels; the storage space is effectively saved and the generation efficiency of Trie-tree is improved In addition, according to Theorem 1, the cropped Trietree contains only the items that can generate frequent rules Therefore, if the category labels non-frequent itself, it will not be included in the tree for it can not generate frequent rules This case does not have universal significance, not much affect the classification accuracy, so it can be ignored vertical bitmap of two-dimensional array for database D a 1 1 a 1 1 Reconstruct the Trie-tree, shown in Figure Scan the database D once, count the support for each item, get the frequent 1-itemset F= {a, d, f, k} that meets the minimum support threshold Database D can be described as a two-dimensional array shown in TABLE II, in which the horizontal position said the item number and types of properties, the vertical position said the transaction number TABLE II The improved vertical bitmap of the twodimensional array B 0 C 0 1 According to the definition and the construction method of Trie-tree, drawing the pruning ideology of the CDDP algorithm, we can obtain the Trie-tree shown in Figure 1from TABLE II IV ALGORITHM TESTING AND ANALYSIS In order to test the performance of CARPT, we compared it with CBA and CMAR The experimental datasets using datasets that come from the UCI machine learning database [10] , before the experiment started, we discretized the datasets by weak Let minsupport=1%, support error threshold=0.01, minconfidence=50%, confidence error threshold=20% Test the accuracy of the algorithm, the results are shown in Figure Figure Trie-tree The next work is to export those association rules that meet the given minimum confidence and take category labels for rule consequent from Trie-tree Actually, we can see that f is in the frequent itemset F, but there is any rule contain f appears in the associative classification rules which we finally mine after carefully study, that is all the rules contain f not meet the minimum confidence So, item f can be removed directly According to Theorem 1, remove the frequent item f that can not generate frequent rules directly when transform the database into vertical bitmap of two-dimensional array to 2058 $FFXUDF\  frequent rules to improve efficiency of the algorithm by add the support count of the category labels; reduce the number of database scanning using two-dimensional array of vertical data format to compressed database storage and add pruning strategy to the construction process of Trie-tree, all of these can save time and space effectively The experimental results show that the algorithm is feasible and effective &%$ &0$5 &$537        $XWR +\SR ,RQR 6LFN REFERENCES 6RQDU 9HKLFOH [1] 'DWDVHWV Figure comparison of classification accuracy [2] Memory Usage (MB) In addition, we tested memory usage of algorithm CMAR and CARPT, the results are shown in Figure Datasets arrange from left to right according to their size 110 CMAR [3] CAPRT 100 [4] 90 80 70 Auto [5] Sonar Iono Vehicle Sick Datasets Hypo [6] Figure comparison of memory usage We can easily find from the results above that the classification accuracy of the CARPT algorithm is improved and the efficiency of the algorithm is also improved after adding the count of category labels, using the Trie-tree storage structure and dynamic pruning strategy Compared with CMAR, CARPT can effectively reduce the memory usage and the effect of large data sets is relatively significant V [7] [8] [9] CONCLUSIONS This paper presents a classification algorithm of associative rules based on Trie-tree that named CARPT The algorithm removes the frequent items that cannot generate [10] 2059 Fan M, Meng X Data Mining Concepts and Techniques [M] Beijing: Mechanical Industry Press, 2001 Liu B, Hsu W, Ma Y Integrating classification and association rule mining [C] Proc of the KDD New York, 1998: 80-86 Agrawal R, Srikant R Fast algorithms for mining association rules [A] In VLDB’94[C] Santiago, Chile, Sept 1994, pp 487-499 Hu W, Li M Classification algorithm of associative based on the importance of attribute [J] Computer Engineering and Design, 2008.5 Li W, Han J, Pei J CMAR: Accurate and efficient classification based on multiple class-association rules [A] In ICDM’01[C] San Jose, CA, 2001, pp 369-376 Han J, Pei J, Yin Y Mining frequent patterns without candidate generation[A] In SIGMOD’00[C] Dallas, TX, May 2000.1-12 Qin Ch Association rules mining algorithm based on Trie, Journal of Beijing Jiao tong University, June 2011 Zhang J Associated with the classification algorithm and its system implementation, Journal of Nanjing Normal University, 2008 Amir, A., Feldman, R and Kashi, R (1997).A New and Versati1e Method for Association Generation In: Principles of Data Mining and Knowledge Discovery, Proceedings of the First European Symposium(PKDD’97) Trondheim Norway, 1997, pp 221-231 Merz C J, Murphy P UCI repository of machine learning database[EB/OL].http://www.cs.uci.edu/mlearn/MLRepositor y.html,1996

Ngày đăng: 25/08/2016, 18:43

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan