A an efficient algorithm for association rule mining

4 98 0
A an efficient algorithm for association rule mining

Đang tải... (xem toàn văn)

Thông tin tài liệu

International Journal of Advanced Computer Science, Vol 1, No 4, Pp 142-145, Oct 2011 An Efficient Algorithm for Association Rule Mining Maryam Shekofteh Manuscript Received: 9, Sep., 2011 Revised: 11, Oct., 2011 Accepted: 25, Oct.,2011 Published: 30, Oct.,2011 Keywords Data mining, Association rule mining, Frequent closed item-set Abstract The efficient frequent item-set mining is the most important problem in association rule mining To date, a number of algorithms have been developed in the field But the high number of such items in data base results in redundancy in rules To overcome the problem, recent studies deal with frequent closed item-set mining as it is significantly smaller than the whole frequent items and has similar strength In this paper, a new algorithm called FC-Close is introduced for frequent closed item-set mining is introduced This algorithm employs a pruning technique to improve the efficiency of frequent closed item-set mining The results of the tests show that FC-Close is more efficient that the existing FP-close algorithm been used for frequent itemsets [3, 7] A popular condensed representation method is using to frequent closed itemsets Compared with frequent itemsets, the frequent closed itemsets is a much more limited set but with similar power In addition, it decreases redundant rules and increases mining efficiency Many algorithms have been presented for mining frequent closed itemsets In this paper a new algorithm called FC-Close is introduced for frequent closed item-set This algorithm which is the developed based on FP-growth [5], employs a pruning technique to improve its efficiency The rest of the paper is structured as follow: section introduces frequent close item-set mining and related concepts Section sketches out topics and structures of the new algorithm Section describes our developed algorithm The evaluation of findings is presented in section and section is devoted to conclusion Introduction Problem Development Association Rule Mining (ARM) is one of the most important data mining techniques ARM aims at extraction, hidden relation, and interesting associations between the existing items in a transactional database It is highly useful in market basket analysis for stores and business centers For example, database mining of a department store customers reveal that those who buy milk would buy butter in 60% of occasions, and such principle is observed in 80% of transactions In this example, the above-mentioned probability is called confidence percentage, and a percentage of transactions which cover this rule is termed support percentage To find the rules, user should set a minimum amount for support and confidence which are called minimum support (min-sup) and minimum confidence (min-conf) respectively [1] The main step in association rule mining is the mining frequent itemsets In effect, with frequent itemsets in hand, generating association rules would be highly straightforward Frequent itemsets mining often generates a very large number of frequent itemsets and rules As such, it reduces the efficiency and power of mining To overcome the problem, in recent years, condensed representation has Let D be a transactional database Each transactional database includes a set of transactions Each transaction t is represented by in which x is a set of items and TID is the unique identifier of transaction Further, let us consider I = {i1, i2, …, in} as the complete set of distinct items in D Each non- empty subset y of I is termed an itemset, and if includes k items, it would be called k-itemset The number of transactions existing in D including itemset y, is called the support of itemset y, denoted as sup(y) and it is usually represented in percentage Given a minimum support, min-sup, an itemset y is frequent itemset, if sup(y) min-sup This work was supported by Islamic Azad University, Sarvestan Branch, Shiraz, Iran Maryam Shekofteh is with Department of Computer Engineering, Sarvestan Branch, Shiraz, Iran Shekofteh-m@iau.sarv.ac.ir Definition1- Closed Itemset: An itemset y is a closed itemset if there is not any superset of y like y' that sup(y) = sup(y') Related Literature A FP-tree and FP –growth Method: As FC-close is the extended version of FP-growth, an introduction to FP-growth and FP-tree structure is needed In FP-growth a new structure called FP-tree is used FP-tree is a dense data structure for saving that has all necessary information on frequent item set in a database Each branch of FP-tree presents one frequent item set and the nodes along the branch are the count of items in descending order 143 Maryam Shekofteh: An Efficient Algorithm for Association Rule Mining Each node in FP-tree has three fields: item-name, count, and node –link where item name includes the name of item which the nodes has Count shows the number of transactions along the covered path to the node, and node-link indicates the next node in FP-tree which includes the same item If there is no such node, the node link is null Likewise, FP-tree has a header table related to its own The single items of database are saved in this table in a descending order Each entry to the header table includes two fields: the item name and the node-link start which refers to the first node in FP-tree which has the item name Compared with Apriori [1] and its types which require considerable pass from database, FP-growth need just two passes during mining of all frequent item sets In the first pass, support of each item is calculated and the repeated single items (repeated items with length 1) are put in the database in a descending order At the second pass, an FP-tree which includes all frequent information of database is created In other words, each transaction of the ordered database is read and each time one transaction is added to the FP-tree structure (To add a new transaction in FP-tree, if the transaction has similar prefix with other added transactions, for the items inside the prefix no node is considered and just the support number (number of fields) is added Therefore, mining on the database leads to the mining on FP-tree Figure shows the first FP-tree of the second pass with minimum support of 20% While adding item I to the itemset y where y i is called z, the path from the father node of this node (node i) to the root node in FP-tree related to y, is called prefix path z Let us review further information on FP-growth Once FP-tree is created, mining of frequent patterns of FP-tree with FP-growth algorithm is performed The algorithm FP-growth performs based on recursive deletion In this algorithm the database is repeatedly limited to the existing is named itemsets where the database limited to item and it is shown by T To create conditional pattern the conditional pattern each itemset , all prefix patterns ( beginning from root) is written After creating conditional pattern of one itemset, its conditional FP-tree is made To make conditional FP-tree of an itemset, we follow the steps taken in making initial FP-tree In this stage, however, instead of using the whole database we employ the conditional pattern of that item Therefore, total of the number of supports of all items in all conditional patterns related to the item is calculated and if it is higher than threshold, it is added to the header table and FP-tree The mining procedure in conditional FP-tree is conducted recursively until it is null or includes one single branch, otherwise all frequent patterns are extracted B CFI-tree In FP-close algorithm, CFI-tree is introduced as a special database for storing closed frequent itemsets CFI-tree is like an FP-tree It includes a root node which is named along with root Each node under the tree has four fields: Item-name, count, node surface, and node link All nodes with similar item names are connected The node link refers to the next node with the same item name A header table is International Journal Publishers Group (IJPG)© created for items in CFI-tree where the order of items in the table is the same as the order of items in the first made FP-tree on the database first pass Each entry on the header table includes two fields: the item name and head of node-link The node-link links to the first node with the same item name in CFI-tree The surface field is used to test sub-fields The count field is required to compare y with the set z of three as it is regularly tested until it is confirmed z and y and z have similar that there is no case of y count TABLE A SAMPLE DATABASE tid items 10 abcefo acg ei acdeg glace ej abcefp acd acegm acegn Fig Structure of FP-tree The arrangement of a frequent itemset in CFI-tree is similar to the arrangement of a transaction in FP-tree However, to add an item of transaction with prefix similar to the added transactions, the count of nodes is not increased, but the maximum of counts is updated In effect, in FP-close, one newly discovered itemset is put in CFI-tree, unless that item is the sub-set of an item, and they have similar count of occurrences in the tree Figure displays CFI-tree of database in table 1when minimum support equal to 20% In this figure, a node x,c,1 shows that it includes item node x with count c and surface More details on CFI-tree is available in [4] Fig Structure of CFI-tree 144 International Journal of Advanced Computer Science, Vol 1, No 4, Pp 142-145, Oct 2011 FC-Close Algorithm FC-Close algorithm employs the same structures of FP-tree, and CFI-tree for mining frequent closed item-set Here, however, the search space is decreased effectively thanks to an optimal technique called pruning The smaller search space and trees' count implies less time compared with the similar algorithm such as FP-close Let us review the pruning technique A Optimal Prunning Technique in FC-Close Algorithm Suppose y is an itemset One optimal method is that itemset support y and ( y i ) (I is member of conditional pattern y) are compared If support y and ( y i ) are equal, then each transaction including y also includes i This guarantees that each frequent itemset z including y which does not include I, includes frequent super set z i and these two sets have similar support Based on close frequent itemset, counting the itemsets including y which does not include i is not needed Then it would be possible to transfer i to y and delete item i from conditional pattern y which includes item i Another optimization method involves comparison of itemsets ( y i ) and ( z i) (z includes items y-j where j is already added to y) If support of ( y i ) and that of ( z i) is equal, it is guaranteed each frequent itemset has frequent super set ( y i ) , and these two sets have similar count (similar support) According to close frequent itemset, it is possible to avoid a search of the branch including itemset ( z i) Figure shows a pseudo-code of function that performs pruning technique: Pruning (current itemset: y) { For each item i in y's conditional pattern base { Newitem= y i If (support (Newitem)==support(y) { move i to y Remove i from y's conditional pattern base} Newitem= y i If (Newitem)==support ( z Stop containing the itemset (support i) the (z ) branch search single path along with the frequent itemset y Then it should be checked that whether itemset is a close frequent itemset In the second line, if the itemset ( y x) is a close frequent itemset, ( y x) is put in CFI-tree If FP-tree is not a single path tree, for each item in header item, the item is added to y Then if-closed function is called so that it analyzes whether the itemset y is a close frequent itemset If so, y is put in CFI-tree In the next line, FP-tree of Y is made and the function of pruning is called Then FC-close is recursively called Figure displays FC-Close pseudo-code function: FC-Close(T) Input T:FP-tree Global: y: a linked list of items CFI-tree: CFI-tree Output: the CFI-tree which contains CFI Method: (1)If T only contains a single path p{ (2)Generate all frequent itemset from p{ (3) For each x in frequent itemset (4) If not if-closed ( y x ) { (5) Insert x into CFI-tree} (6)Else for each i in header-table of T{ (7) Append i to head (8)If not if-closed(y) (9) { (10) Insert y into CFI-tree (11) Construct the y's FP-tree Ty (12) Pruning(y) (13) FC-Close(Ty) (14) Remove i from y} Fig FC-Close Algorithm Pseudo-Code Results In this section, our developed algorithm, i.e FC-algorithm is compared with the existing FP-close algorithm To so, a computer with the following specifics is employed It runs a Pentium processor with 3GHz and 1GB Ram memory It carries 200 GB disk with Wondows XP 2003 All codes are implemented by C++ The results are tested on two databases where one of them is a dense database called Chess and the other one is T40.I10.D100k sparse database The specifications of these databases are shown in Table The information on both databases are taken from [11] that TABLE i) } } Fig Pseudo-Code of Pruning Function B Mining on Frequent Itemsets using FC-Close In this article, a new algorithm called FC-Close is introduced using pruning technique This is a developed algorithm of FP-growth method Like FP-growth, FC-Close is recursive In the first call, one FP-tree is made of the first database pass A link list y includes items which make the current call conditional pattern If there is just one single path in FP-tree, each frequent itemset X created in this SPECIFICATIONS OF TESTED DATABASES Dataset #transactions Avg transaction size Chess 3196 3553 T40I10D100k 100000 3954 Figures and contracts the time consumed for algorithm execution in our developed algorithm, i.e FP-Close with that of existing FP-close in Chess and T40.I10.D100K databases International Journal Publishers Group (IJPG)© 145 Maryam Shekofteh: An Efficient Algorithm for Association Rule Mining N Pasquier, Y Bastide, R Taouil, & L Lakhal, “Discovering frequent closed itemsets for association rules,” (1999) Proc Int'l conf Database Theory, pp 398-416 [8] J Pei, J Han, & R Mao, “CLOSET: An efficient Algorithm for mining frequent closed itemsets,” (2000) ACM SIGMOD workshop research issue in Data mining and knowledge Discovery, pp 21-30 [9] J Wang, J Han, & J Pei, “CLOSET: Searching for the best strategies for mining frequent closed itemsets,” (2003) proc Int'l Conf Knowledge Discovery and Data Mining, pp 236-245 [10] M.J Zaki & C Hsiao, “Charm: An efficient algorithm for closed itemset mining,” (2002) Proc SIAM Int'l Conf Data Mining, pp 457-473 [11] http://fimi.cs.helsinki.fi, 2003 [12] http://www.cs.bme.hu/~bodon, 2005 [7] Fig Comparison of execution time on Chess database Fig Comparison of execution time on Chess database As shown in both figures and 6, FC-Close has higher efficiency over FP-Close Conclusion In this article, FC-Close is introduced as an effective algorithm for mining close frequent itemsets This algorithm decreases search space and FP-tree size using pruning technique The experiments show that FC-Close has higher efficiency over FP-close algorithm References R Agrawal & R Srikant, “Fast algorithms for mining association rules,” (1994) Proceeding of the VLDB, Santiago de chile [2] C-C Chang & C-Y Lin, “perfect hashing schemes for mining association rules,” (2005) Oxford university press on behalf of the british computer society, vol 48, no [3] B Goethals, “Survey on Frequent pattern mining,” (2004) Department of computer science university of Helsinki [4] G Grahne & J Zhu, “Efficiently using prefix-trees in mining frequent itemsets,” (2003) IEEE ICDM Workshop on Frequent Itemset Mining Implementations [5] J Han, J Pie, Y Yin, & R Mao, “Mining frequent pattern without candidate generation,” (2003) Data mining and knowledge discovery [6] J.S Park, M-s Chen, & P.S Yu, “An effective hash based algorithm for mining association rules,” (1995) ACM SIGMOD international conference on management of Data, vol 24, pp 175-186 [1] International Journal Publishers Group (IJPG)© ... order At the second pass, an FP-tree which includes all frequent information of database is created In other words, each transaction of the ordered database is read and each time one transaction... database called Chess and the other one is T40.I10.D100k sparse database The specifications of these databases are shown in Table The information on both databases are taken from [11] that TABLE... pattern without candidate generation,” (2003) Data mining and knowledge discovery [6] J.S Park, M-s Chen, & P.S Yu, An effective hash based algorithm for mining association rules,” (1995) ACM

Ngày đăng: 11/11/2018, 08:43

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan