Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Association Rule Mining (Phát luật kết hợp) Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke © Tan,Steinbach, Kumar Introduction to Data Mining Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, 4/18/2004 ‹#› Definition: Frequent Itemset Itemset (tập đối tượng) – A collection of one or more items Example: {Milk, Bread, Diaper} – k-itemset An itemset that contains k items Support count () (Tổng số hỗ trợ) – Number of occurrences of an itemset TID Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – E.g ({Milk, Bread,Diaper}) = Support (Độ hỗ trợ) – Fraction of transactions that contain an itemset – E.g s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset (Tập đối tượng thường xuyên) – An itemset whose support is greater than or equal to a minsup threshold © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Definition: Association Rule Association Rule (Luật kết hợp) – An implication expression of the form X Y, where X and Y are itemsets – Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics (Các độ đo đánh giá luật) – Support (s) (Độ hỗ trợ) Support determines how often a rule is applicable to a given data set – Confidence (c) (Độ tin cậy) Confidence determines how frequently items in Y appear in transaction that contain X © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Definition: Association Rule Association Rule – An implication expression of the form X Y, where X and Y are itemsets – Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics TID Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – Support (s) (Độ hỗ trợ) Example: Support determines how often a rule is applicable to a given data set {Milk , Diaper} Beer – Confidence (c) (Độ tin cậy) Confidence determines how frequently s items in Y appear in transaction that contain X (Milk , Diaper, Beer ) |T| 0.4 (Milk, Diaper, Beer ) c 0.67 (Milk , Diaper ) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold Brute-force approach (phương pháp vét cạn): – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Mining Association Rules Example of Rules: TID Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke {Milk,Diaper} {Beer} {Milk,Beer} {Diaper} {Diaper,Beer} {Milk} {Beer} {Milk,Diaper} {Diaper} {Milk,Beer} {Milk} {Diaper,Beer} Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Mining Association Rules Example of Rules: TID Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Mining Association Rules Two-step approach: Frequent Itemset Generation (sinh tập đối tượng thường xuyên) – Generate all itemsets whose support minsup Rule Generation (Sinh luật kết hợp) – Generate high confidence rules from each frequent itemset, – Each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE ABCDE © Tan,Steinbach, Kumar Introduction to Data Mining BCDE Given d items, there are 2d possible candidate itemsets 4/18/2004 ‹#› Property under Row/Column Scaling Grade-Gender Example (Mosteller, 1968): Male Female High Low 10 Male Female High 30 34 Low 40 42 70 76 2x 10x Mosteller: Underlying association should be independent of the relative number of male and female students in the samples © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Property under Inversion Operation Transaction Transaction N A B C D E F 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 (a) © Tan,Steinbach, Kumar (b) Introduction to Data Mining (c) 4/18/2004 ‹#› Example: -Coefficient -coefficient is analogous to correlation coefficient for continuous variables Y Y X 60 10 70 X 10 20 30 70 30 100 0.6 0.7 0.7 0.7 0.3 0.7 0.3 0.5238 Y Y X 20 10 30 X 10 60 70 30 70 100 0.2 0.3 0.3 0.7 0.3 0.7 0.3 0.5238 Coefficient is the same for both tables © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Property under Null Addition A A B p r B q s A A B p r B q s+k Invariant measures: support, cosine, Jaccard, etc Non-invariant measures: correlation, Gini, mutual information, odds ratio, etc © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Different Measures have Different Properties Sym bol Measure Range P1 P2 P3 O1 O2 O3 O3' O4 Q Y M J G s c L V I IS PS F AV S Correlation Lambda Odds ratio Yule's Q Yule's Y Cohen's Mutual Information J-Measure Gini Index Support Confidence Laplace Conviction Interest IS (cosine) Piatetsky-Shapiro's Certainty factor Added value Collective strength Jaccard -1 … … 0…1 … … -1 … … -1 … … -1 … … 0…1 0…1 0…1 0…1 0…1 0…1 0.5 … … … … -0.25 … … 0.25 -1 … … 0.5 … … … … Yes Yes Yes* Yes Yes Yes Yes Yes Yes No No No No Yes* No Yes Yes Yes No No Yes No Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes No No No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes** Yes Yes Yes No No Yes Yes No No Yes Yes Yes No No No No No No No No No No No No No No No Yes No* Yes* Yes Yes No No* No No* No No No No No No Yes No No Yes* No Yes Yes Yes Yes Yes Yes Yes No Yes No No No Yes No No Yes Yes No Yes No No No No No No No No No No No Yes No No No Yes No No No No Yes Yes Introduction to Data 3 Mining Yes Yes No No No No Klosgen's K © Tan,Steinbach, Kumar 4/18/2004 ‹#› No Support-based Pruning Most of the association rule mining algorithms use support measure to prune rules and itemsets Study effect of support pruning on correlation of itemsets – Generate 10000 random contingency tables – Compute support and pairwise correlation for each table – Apply support-based pruning and examine the tables that are removed © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Effect of Support-based Pruning All Itempairs 1000 900 800 700 600 500 400 300 200 100 0 0 0 0 1 0 -1 -0 -0 -0 -0 -0 -0 -0 -0 -0 Correlation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Effect of Support-based Pruning Support < 0.01 0 -1 -0 -0 -0 -0 -0 -0 -0 -0 -0 Correlation 0 50 50 100 100 150 150 200 200 250 250 0 300 -1 -0 -0 -0 -0 -0 -0 -0 -0 -0 300 Support < 0.03 Correlation Support < 0.05 300 250 200 150 100 50 Correlation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› 0 -1 -0 -0 -0 -0 -0 -0 -0 -0 -0 Support-based pruning eliminates mostly negatively correlated itemsets Effect of Support-based Pruning Investigate how support-based pruning affects other measures Steps: – Generate 10000 contingency tables – Rank each table according to the different measures – Compute the pair-wise correlation between the measures © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Effect of Support-based Pruning Without Support Pruning (All Pairs) All Pairs (40.14%) Conviction Odds ratio 0.9 Col Strength 0.8 Correlation Interest 0.7 PS CF 0.6 Jaccard Yule Y Reliability Kappa 0.5 0.4 Klosgen Yule Q 0.3 Confidence Laplace 0.2 IS 0.1 Support Jaccard -1 Lambda Gini -0.8 -0.6 -0.4 -0.2 0.2 Correlation 0.4 0.6 0.8 J-measure Mutual Info 10 11 12 13 14 15 16 17 18 19 20 21 Red cells indicate correlation between the pair of measures > 0.85 40.14% pairs have correlation > 0.85 © Tan,Steinbach, Kumar Introduction to Data Mining Scatter Plot between Correlation & Jaccard Measure 4/18/2004 ‹#› Effect of Support-based Pruning 0.5% support 50% 0.005