Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implication means co-occurrence, not causality! © Tan,Steinbach, Kumar Introduction to Data Mining Definition: Frequent Itemset Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper} – k-itemset • An itemset that contains k items Support count (σ) – Frequency of occurrence of an itemset – E.g σ({Milk, Bread,Diaper}) = Support – Fraction of transactions that contain an itemset – E.g s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold © Tan,Steinbach, Kumar Introduction to Data Mining Definition: Association Rule Association Rule – An implication expression of the form X → Y, where X and Y are itemsets – Example: {Milk, Diaper} → {Beer} Rule Evaluation Metrics – Support (s) • Fraction of transactions that contain both X and Y – Confidence (c) Example: {Milk, Diaper} ⇒ Beer s= σ (Milk, Diaper, Beer) = = 0.4 |T| • Measures how often items in Y σ (Milk, Diaper, Beer) appear in transactions that c= = = 0.67 contain X σ (Milk, Diaper ) © Tan,Steinbach, Kumar Introduction to Data Mining Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds ⇒ Computationally prohibitive! © Tan,Steinbach, Kumar Introduction to Data Mining Mining Association Rules Example of Rules: {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements © Tan,Steinbach, Kumar Introduction to Data Mining Mining Association Rules Two-step approach: Frequent Itemset Generation – Generate all itemsets whose support ≥ minsup Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive © Tan,Steinbach, Kumar Introduction to Data Mining Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE ABCDE © Tan,Steinbach, Kumar Introduction to Data Mining BCDE Given d items, there are 2d possible candidate itemsets Frequent Itemset Generation Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database – – Match each transaction against every candidate d Complexity ~ O(NMw) => Expensive since M = !!! © Tan,Steinbach, Kumar Introduction to Data Mining Computational Complexity Given d unique items: – Total number of itemsets = 2d – Total number of possible association rules: d d − k R = ∑ × ∑ k j = − +1 d −1 d −k k =1 j =1 d d +1 If d=6, R = 602 rules © Tan,Steinbach, Kumar Introduction to Data Mining 10 Property under Row/Column Scaling Grade-Gender Example (Mosteller, 1968): Male Female Male High Low 10 Female High 30 34 Low 40 42 70 76 2x 10x Mosteller: Underlying association should be independent of the relative number of male and female students in the samples © Tan,Steinbach, Kumar Introduction to Data Mining 68 Property under Inversion Operation A Transaction Transaction N B C D E F 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 (a) © Tan,Steinbach, Kumar (b) Introduction to Data Mining (c) 69 Example: φ-Coefficient φ-coefficient is analogous to correlation coefficient for continuous variables Y Y Y X 60 10 70 X 10 20 30 70 30 100 0.6 − 0.7 × 0.7 φ= 0.7 × 0.3 × 0.7 × 0.3 = 0.5238 Y X 20 10 30 X 10 60 70 30 70 100 0.2 − 0.3 × 0.3 φ= 0.7 × 0.3 × 0.7 × 0.3 = 0.5238 φ Coefficient is the same for both tables © Tan,Steinbach, Kumar Introduction to Data Mining 70 Property under Null Addition A A B p r B q s A A B p r B q s+k Invariant measures: x support, cosine, Jaccard, etc Non-invariant measures: x correlation, Gini, mutual information, odds ratio, etc © Tan,Steinbach, Kumar Introduction to Data Mining 71 Different Measures have Different Properties Sym bol Measure Range P1 P2 P3 O1 O2 O3 O3' O4 Φ λ α Q Y κ M J G s c L V I IS PS F AV S ζ Correlation -1 … … Yes Yes Yes Yes No Yes Yes No Lambda 0…1 Yes No No Yes No No* Yes No Odds ratio 0…1…∞ Yes* Yes Yes Yes Yes Yes* Yes No Yule's Q -1 … … Yes Yes Yes Yes Yes Yes Yes No Yule's Y -1 … … Yes Yes Yes Yes Yes Yes Yes No Cohen's -1 … … Yes Yes Yes Yes No No Yes No Mutual Information 0…1 Yes Yes Yes Yes No No* Yes No J-Measure 0…1 Yes No No No No No No No Gini Index 0…1 Yes No No No No No* Yes No Support 0…1 No Yes No Yes No No No No Confidence 0…1 No Yes No Yes No No No Yes Laplace 0…1 No Yes No Yes No No No No Conviction 0.5 … … ∞ No Yes No Yes** No No Yes No Interest 0…1…∞ Yes* Yes Yes Yes No No No No IS (cosine) No Yes Yes Yes No No No Yes Piatetsky-Shapiro's -0.25 … … 0.25 Yes Yes Yes Yes No Yes Yes No Certainty factor -1 … … Yes Yes Yes No No No Yes No Added value 0.5 … … Yes Yes Yes No No No No No Collective strength 0…1…∞ No Yes Yes Yes No Yes* Yes No Jaccard No Yes Yes Yes No No No Yes Yes Yes No No No No K Klosgen's © Tan,Steinbach, Kumar − − − 0 Yes 3 3 Introduction to Data Mining 72 No Support-based Pruning Most of the association rule mining algorithms use support measure to prune rules and itemsets Study effect of support pruning on correlation of itemsets – Generate 10000 random contingency tables – Compute support and pairwise correlation for each table – Apply support-based pruning and examine the tables that are removed © Tan,Steinbach, Kumar Introduction to Data Mining 73 Effect of Support-based Pruning All Itempairs 1000 900 800 700 600 500 400 300 200 100 Correlation © Tan,Steinbach, Kumar Introduction to Data Mining 74 Effect of Support-based Pruning Support < 0.01 Support < 0.03 300 300 250 250 200 200 150 150 100 100 50 50 0 Correlation Correlation Support < 0.05 300 250 Support-based pruning eliminates mostly negatively correlated itemsets 200 150 100 50 Correlation © Tan,Steinbach, Kumar Introduction to Data Mining 75 Effect of Support-based Pruning Investigate how support-based pruning affects other measures Steps: – Generate 10000 contingency tables – Rank each table according to the different measures – Compute the pair-wise correlation between the measures © Tan,Steinbach, Kumar Introduction to Data Mining 76 Effect of Support-based Pruning x Without Support Pruning (All Pairs) All Pairs (40.14%) Conviction Odds ratio 0.9 Col Strength 0.8 Correlation Interest 0.7 PS CF 0.6 Jaccard Yule Y Reliability Kappa 0.5 0.4 Klosgen Yule Q 0.3 Confidence Laplace 0.2 IS 0.1 Support Jaccard -1 Lambda Gini -0.8 -0.6 -0.4 -0.2 0.2 Correlation 0.4 0.6 0.8 J-measure Mutual Info 10 11 12 13 14 15 16 17 18 19 20 21 x Red cells indicate correlation between the pair of measures > 0.85 x Scatter Plot between Correlation & Jaccard Measure 40.14% pairs have correlation > 0.85 © Tan,Steinbach, Kumar Introduction to Data Mining 77 Effect of Support-based Pruning x 0.5% ≤ support ≤ 50% 0.005