1. Trang chủ
  2. » Luận Văn - Báo Cáo

MINING ASSOCIATION RULES WITH ADJUSTABLE INTERESTINGNESS

36 278 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 36
Dung lượng 251,21 KB

Nội dung

MINING ASSOCIATION RULES WITH ADJUSTABLE INTERESTINGNESS BY NGUYEN THANH TRUNG SUPERVISED BY DR. HA QUANG THUY A THESIS SUBMITTED THE DEGREE OF BACHELOR OF SCIENCE AT THE FACULTY OF TECHNOLOGY VIETNAM NATIONAL UNIVERSITY, HANOI JUNE, 2003 i ACKNOWLEDGEMENTS This thesis for bachelor’s degree has been accomplished for three months. During this time, many people have made substantial contributions in one way or another that I would like to mention herein. First and foremost, I would especially like to thank my research advisor, Dr. Ha Quang Thuy for his invaluable guidance and tremendous motivation that he pro- vided at every step of this work. His enthusiastic support and untiring interest in the subject is deeply appreciated. I have gain immensely from his deep technical in- sight and thoroughness in problem solving. Some portions of this thesis have been previously published in the Conference of Junior Scientists 2002 of Vietnam National University, Hanoi, and I owe thanks to Dr. Do Van Thanh, M.Sc. Pham Tho Hoan, B.Sc. Phan Xuan Hieu for their valu- able contributions as the co-authors of that paper. My thanks also go to all of my lecturers at Faculty of Technology of Vietnam Na- tional University Hanoi who provided me with indispensable scientific knowledge throughout four school years. Special thanks to the following individuals, and many others who are not mentioned by name, for their teaching: M.Sc. Le Quang Hieu, M.Sc. Nguyen Quang Vinh, M.Sc. Nguyen Dinh Viet, M.Sc. Pham Hong Thai, Dr. Nguyen Tue, M.Sc. Nguyen Nam Hai, M.Sc. Dao Kien Quoc, M.Sc. Le Anh Cuong, Asoc.Prof. Trinh Nhat Tien, Dr. Dinh Manh Tuong, M.Sc. Vu Ba Duy, Asoc.Prof. Nguyen Quoc Toan, M.Sc. Ngo Le Minh, Asoc.Prof. Ngo Quoc Tao. Without the knowledge they equipped me, my thesis would never take shape. I am particularly grateful to my family for providing me with a source of strength and encouragement, and giving me the best possible education, and imbibing in me a thirst for learning. Last but not the least my girlfriend Nguyen Thi Thu Thuy who sacrificed time and energy so that this work could be completed. I appreciate it, and hope that the effort has been worthwhile. ii ABSTRACT Over the last several years, the problem of efficiently generating large numbers of association rules has been an active research topic in the data mining community. Many different algorithms have been developed with promising results. There are two current approaches to the association rule mining problem. The first is to mine the frequent itemsets regardless of their coefficients. The second is to assign weights to the items to reflect their importance to the users. However, they both rely on the using of the minimum support which may confuse us. Practically, we may want to mine the best rules to our knowledge instead of those which satisfy a certain threshold, especially if this threshold is an equation. To overcome this prob- lem, we introduce the concept of adjustable interestingness and propose a novel ap- proach in mining association rules based on adjustable interestingness. Our algo- rithm only works with the most interesting rules, thus reducing significantly search space by skipping many uninteresting itemsets and pruning those that cannot gen- erate interesting itemsets at the earlier stage. Therefore, the total time needed for the mining is substantially decreased. iii TABLE OF CONTENTS Acknowledgements i Abstract ii Table of contents iii List of tables and figures iv CHAPTER 1: Introduction 1 1.1. What is data mining? 1 1.2. Data mining versus query tools 2 1.3. Mining association rules 3 1.4. Outline of the thesis 5 CHAPTER 2: Mining association rules with weighted items 6 2.1. Introduction 6 2.2. Problem definition 7 CHAPTER 3: Mining association rules with adjustable interestingness 10 3.1. Interestingness and interesting itemsets 10 3.2. Interestingness constraints 11 3.3. Motivation behind interesting itemsets and adjustable interestingness 12 CHAPTER 4: Algorithm for mining association rules with adjustable interestingness (MARAI) 14 4.1. Motivation 14 4.2. Preliminaries 15 4.3. Basic properties of itemset-tidset pairs 18 4.4. MARAI: Algorithm design and implementation 20 4.5. Experimental Evaluation 25 CHAPTER 5: Conclusion 28 References a Appendix b iv LIST OF TABLES AND FIGURES Table 1. Database of a stationery store 8 Table 2. Transactions of a stationery store 9 Table 3. Itemsets sorted into descending order of their interestingness 11 Table 4. Itemsets sorted into descending order of the interestingness 17 Table 5. All interesting itemsets 18 Table 6. Database characteristics 25 Figure 1. Example database and frequent itemsets 4 Figure 2. Example database 15 Figure 3. The MARAI algorithm 22 Figure 4. Search process using adjustable interestingness 23 Figure 5. Performance of the MARAI algorithm on Cosmetic 26 Figure 6. Performance of the MARAI algorithm on Census 27 1 CHAPTER 1 INTRODUCTION In this chapter, we introduce the concept of data mining, and explain why it is re- garded as such important developments. As companies is the background of mining association rules. 1.1. What is data mining? There is confusion about the exact meaning between the terms ‘data mining’ and ‘knowledge discovery in databases (KDD)’. At the first international KDD confer- ence in Montreal in 1995, it was proposed that the term ‘KDD’ be used to describe the whole process of extraction of knowledge from data. An official definition of KDD is: ‘the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data’ [2]. The knowledge which is discovered must be new, not obvious, and human must be able to use it for a particular purpose. It was also proposed that the term ‘data mining’ should be used exclusively for the discovery stage of the KDD process. The whole KDD steps include selection, preprocessing, transformation, data mining and the interpretation or evaluation. Data mining has been focused on as it is the most significant and most time-consuming among KDD steps. The sudden rise of interest in data mining can partly be explained by the following factors [2]: 1. In the 1980s, all major organizations built infrastructural databases, containing data about their clients, competitors, and products. These databases form a potential gold-mine; they contain gigabytes of data with much ‘hidden’ information that cannot easily be traced using SQL (Structure Query Language). Data mining algo- 2 rithms can find interesting regularities in databases, whereas, SQL is just a query language; it only helps to find data under constraints of what we already know. 2. As the use of networks continues to grow, it will become increasingly easy to connect databases. Thus, connecting a client’s file to a file with demographic data may lead to unexpected views on the spending patterns of certain population groups. 3. Over the past few years, machine-learning techniques have expanded enor- mously. Neural networks, genetic algorithms and other simple, generally applicable learning techniques often makes it easier to find interesting connections in data- bases. 4. The client/sever revolution gives the individual knowledge worker access to cen- tral information systems, from a terminal on his or her desk. 1.2. Data mining versus query tools What is the difference between data mining and a normal query environment? What can a data mining tool do that SQL cannot? It is significant to realize that data mining tools are complementary to query tools. A data mining tool does not replace a query tool but give a lot of additional possi- bilities [2]. Suppose that we have a large file containing millions of records that de- scribe customers’ purchases in a supermarket. There is a wealth of potentially use- ful knowledge which can be found by trigger normal queries, such as ‘Who bought butter and bread last week?’ , ‘Is the profit of this month more than that of last month?’ and so on. There is, however, knowledge hidden in the databases that is much harder to find using SQL. Examples would be the answers to questions such as ‘What products were often purchased together?’, or ‘What are the subsequent purchases after buying a gas cooker?’ . Of course, these questions could be an- swered using SQL but proceeding in such a way could take days or months to solve the problem, while a data mining algorithm could find the answers automatically in 3 a much shorter time, sometimes even in minutes or a couple of hours. It is said that if we know exactly what we are looking for, use SQL; but if we know only vaguely what we are looking for, turn to data mining. 1.3. Mining association rules There are various kinds of methods to mine the information from the database, such as mining association rules, multi-level data generalization and summarization, classification, and clustering [4]. The most common type is mining association rules. The problem of mining association rules in databases was first introduced in 1993 by Agrawal [1]. An example of such a rule might be that “90% of customers pur- chase bread and butter also purchase milk and coffee”. Since its introduction, As- sociation Rules Mining (ARM) [1] has become one of the core data mining tasks. ARM is an undirected or unsupervised data mining technique, which work on mas- sive data, and it produces clear and understandable results. ARM is aimed at find- ing regularities in data The following is a formal statement of the problem [1]: Let } , , , { 21 m i i i I = be a set of literals, called items. A set of items is also called an itemset. An itemset with k items is called a k-itemset. Let D be a set of transactions, where each transaction T is a set of items such that I T ⊆ . Associated with each transaction is a unique identifier, called its TID . We say that a transaction T contains X , a set of some items in I , if T X ⊆ . The support of of an itemset X , denoted ) , ( D X σ , is the number of examples in D where it occurs as a subset. An itemset is frequent or large if its support is more than a user-specified minimum support (min_sup) value. An association rule is an implication of the form Y X ⇒ where I X ⊆ , I Y ⊆ and φ = ∩ Y X . X is called the antecedent of the rule, and Y is called the consequence of the rule. The rule Y X ⇒ has support s in the transaction set D if s% of transac- tions in D contain both X and Y . That is, the support of the association rule 4 Y X ⇒ is the probability that Y X ∪ occurs in the set of transactions in the data- base D ; it is denote by Y) support(X ∪ . The rule Y X ⇒ holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y . The confidence of the association rule Y X ⇒ is the probability that a transaction contains Y given that the transaction contains X , or it may be given methamati- cally as ) ( / ) ( X support Y X support ∪ . Example 1.1. Consider a set of itemsets } F E, D, C, B, A, { = I . Let D be a set of four transactions as following: Transaction identification Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Frequent pattern Support {A} 75% {B} 50% {C} 50% {A, C} 50% Figure 1. Example database and frequent itemsets For rule C A ⇒ : support = support({A} ∪ {C}) = 50% confidence = support({A} ∪ {C}) / support({A}) = 66% The problem of discovering all association rules can be decomposed into two sub- problems [1]: 1. Find all acts of items (itemsets) that have transaction support above minimum support. The support for an item is the number of transactions that contain the item- set. Recall that an itemset is frequent or large if its support is more than a user- specified minimum support (min_sup) value. Min. support 50% Min. confidence 50% 5 Example 1.2. From the above database, we obtain four frequent itemsets {A}, {B}, {C} and {A, C} with supports of 75%, 50%, 50% and 50% respectively. 2. Use the large itemsets to generate the desired rules. Here is a straightforward al- gorithm for this task. For every large itemset l , find all non-empty subsets of l . For every such subset a , output a rule of the form ) ( a l a − ⇒ if the ratio of support( l ) to support( a ) is at least minconf. We need to consider subsets of l to generate rules with multiple consequents. Example 1.3. From the frequent itemset {A, C} found in example 1.2, we can gen- erate two rules whose confidences are greater than or equal to minconf value. Itemset {A, C}   %100 %50 %50 })support({C {C})}suuport({A confidence %66 %75 %50 })support({A {C})}suuport({A confidence AC rule CA rule == ∪ = == ∪ = ⇒ ⇒ As the problem of generating rules from the itemsets in step 2 is straightforward [1], we will not mention it over again in this thesis. 1.4. Outline of the thesis The remainder of this thesis is as follows. In chapter 2, we state the definition of mining association rules with weighted items. The main aim of this chapter is to provide a background for weight based problems we base our approach on. In chapter 3, we describe the main idea of the thesis. A new term, adjustable interest- ingness, is also introduced here. After the extensive discussion of mining associa- tion rules with adjustable interestingness in chapter 3, we devote chapter 4 to the algorithm for it. Experiments on real databases are also described. Finally, we con- clude the thesis with a summary and a discussion of future work. [...]... interesting than itemset {1, 4}, or the interestingness of itemset {1, 2, 5} is greater than that of itemset {1, 4} 9 CHAPTER 3 MINING ASSOCIATION RULES WITH ADJUSTABLE INTERESTINGNESS In this chapter, we design a new concept, adjustable interestingness Furthermore, a novel approach in mining association rules based on adjustable interestingness is introduced 3.1 Interestingness and interesting itemsets... thesis A new term, adjustable interestingness, is also introduced here After the extensive discussion of mining association rules with adjustable interestingness in chapter 3, we devote chapter 4 to the algorithm for it Experiments on real databases are also described Finally, we conclude the thesis with a summary and a discussion of future work 5 CHAPTER 2 MINING ASSOCIATION RULES WITH WEIGHTED ITEMS... reflected, and we can mine the rules with interestingness For example, we can add the weights to the sales transactions, where the items are under promotion, or with more profits The unweighted association rules would be the same if the database did not change, thus it cannot provide a flexible way for the users to adjust the priority of the rules Therefore, the mining association rules for weighted items... users Some previous works focused on mining frequent itemsets with weighted items [5] and different supports [6] The association rules, mentioned in previous chapter, are called the ‘unweighted’ association rules [6] as the items are treated uniformly Example 2.1 The following rule is the unweighted binary association rule from [1]: (Bread = Yes) => (Ham = Yes) with support = 60% & confidence = 80%... There may be some itemsets with smaller supports than minsup value, however, they can generate more useful rules By setting the minimum support for each item, we generate closed sets using a triple minsup-itemset-tidset and then restrict the number of itemsets to be found, thus the search space is fairly reduced 13 CHAPTER 4 ALGORITHM FOR MINING ASSOCIATION RULES WITH ADJUSTABLE INTERESTINGNESS (MARAI)... itemset Xi nor itemset Xj can be eliminated, both of which lead to different closures, then can generate itemsets with different interestingness 4.4 MARAI: Algorithm design and implementation In this section, we now present MARAI, an algorithm for mining association rules with adjustable interestingness The pseudo-code for MARAI appears in Figure 3 The algorithm start by setting the min_int value to... WITH WEIGHTED ITEMS In the last section, we discussed about mining association rule for unweighted case In the following, we introduce the conceptual framework of weight and apply it to mining association rules The concept of weight will be used in the coming chapters 2.1 Introduction There have been two approaches to the association rule mining problem The first one is to mine the frequent itemsets... the new algorithm, called MARAI algorithm, is proposed to solve above problem The framework of our proposed algorithm for mining association rules with adjustable interestingness is similar to the CHARM algorithm, but the detailed steps contain some significant differences To begin with, we also mine only the closed sets [7] Closed sets are lossless in the sense that they uniquely determine the set of... comprehensible than “Please list itemsets whose interestingness are greater or equal to 0.5” 12 Furthermore, it is impractical to generate entire set of interesting itemsets Our purpose is to mine only most interesting ones Hence, we design a new concept, adjustable interestingness, in this thesis Related work Our past work [5] addressed the problem of mining association rules with different supports, provided that... acronyms for Weight, Support and Interestingness, respectively Table 4 Itemsets sorted into descending order of the interestingness An itemset is interesting if its interestingness is greater than or equal to a userspecified minimum interestingness (min_int) value, i.e., if ω ( X ) ≥ min_int An interesting itemset is called closed if there exists no proper superset Y ⊃ X with σ ( X ) = σ (Y ) The term . 3 MINING ASSOCIATION RULES WITH ADJUST- ABLE INTERESTINGNESS In this chapter, we design a new concept, adjustable interestingness. Furthermore, a novel approach in mining association rules. 1 1.1. What is data mining? 1 1.2. Data mining versus query tools 2 1.3. Mining association rules 3 1.4. Outline of the thesis 5 CHAPTER 2: Mining association rules with weighted items 6. 3.3. Motivation behind interesting itemsets and adjustable interestingness 12 CHAPTER 4: Algorithm for mining association rules with adjustable interestingness (MARAI) 14 4.1. Motivation 14

Ngày đăng: 31/07/2014, 02:43

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w