1. Luật kết hợp trong khai phá dữ liệu (Association Rule in Data Mining) Trong lĩnh vực Data Mining, mục đích của luật kết hợp (Association Rule AR) là tìm ra các mối quan hệ giữa các đối tượng trong khối lượng lớn dữ liệu. Nội dung cơ bản của luật kết hợp được tóm tắt như dưới đây. Cho cơ sở dữ liệu gồm các giao dịch T là tập các giao dịch t1, t2, …, tn. T = {t1, t2, …, tn}. T gọi là cơ sở dữ liệu giao dịch (Transaction Database) Mỗi giao dịch ti bao gồm tập các đối tượng I (gọi là itemset) I = {i1, i2, …, im}. Một itemset gồm k items gọi là kitemset Mục đích của luật kết hợp là tìm ra sự kết hợp (association) hay tương quan (correlation) giữa các items. Những luật kết hợp này có dạng X =>Y Trong Basket Analysis, luật kết hợp X =>Y có thể hiểu rằng những người mua các mặt hàng trong tập X cũng thường mua các mặt hàng trong tập Y. (X và Y gọi là itemset). Ví dụ, nếu X = {Apple, Banana} và Y = {Cherry, Durian} và ta có luật kết hợp X =>Y thì chúng ta có thể nói rằng những người mua Apple và Banana thì cũng thường mua Cherry và Durian. Theo quan điểm thống kê, X được xem là biến độc lập (Independent variable) còn Y được xem là biến phụ thuộc (Dependent variable)
Association Rules Hawaii International Conference on System Sciences (HICSS-40) January 2007 David L Olson Yanhong Li Fuzzy Association Rules • Association rules mining provides information to assess significant correlations in large databases • IF X THEN Y – Initial data mining analysis – Not predictive • SUPPORT: degree to which relationship appears in data • CONFIDENCE: probability that if X, then Y Association Rule Algorithms • APriori • Agrawal et al., 1993; Agrawal & Srikant, 1994 – Find correlations among transactions, binary values • Weighted association rules • Cai et al., 1998; Lu et al 2001 • Cardinal data • Srikant & Agrawal, 1996 – Partitions attribute domain, combines adjacent partitions until binary Fuzzy Analysis Deal with vagueness & uncertainty • Fuzzy Set Theory – Zadeh [1965] • Probability Theory – Pearl [1988] • Rough Set Theory – Pawlak [1982] • Set Pair Theory – Zhao [2000] Fuzzy Association Rules • Most based on APriori algorithm • Treat all attributes as uniform • Can increase number of rules by decreasing minimum support, decreasing minimum confidence – Generates many uninteresting rules – Software takes a lot longer Gyenesei (2000) • Studied weighted quantitative association rules in fuzzy domain – With & without normalization – NONNORMALIZED • Used product operator to define combined weight and fuzzy value • If weight small, support level small, tends to have data overflow – NORMALIZED • Used geometric mean of item weights as combined weight • Support then very small Algorithm • Get membership functions, minimum support, minimum confidence • Assign weight to each fuzzy membership for each attribute (categorical) • Calculate support for each fuzzy region • If support > minimum, OK • If confidence > minimum, OK • If both OK, generate rules Demo Model: Loan App Case 10 Age 20 26 46 31 28 21 46 25 38 27 Income 52623 23047 56810 38388 80019 74561 65341 46504 65735 26047 Risk -38954 -23636 45669 -7968 -35125 -47592 58119 -30022 30571 -6 Credit Result Red Green Green Amber Green Green Green Green Green Red Fuzzified Age 1.2 Membership value 0.8 0.6 0.4 0.2 Age 25 35 Young Figure 2: The membership functions of attibute Age 40 Middle 50 100 Old Fuzzify Age Case 10 Age 20 26 46 31 28 21 46 25 38 27 Young 1.000 0.9 0.4 0.7 1 0.8 Middle 0.1 0.4 0.6 0.3 0.4 0.2 Old 0 0.6 0 0.6 0 Calculate Support for Each Pair of Fuzzy Categories • Membership value – Identify weights for each attribute – Identify highest fuzzy membership category for each case • Membership value = minimum weight associated with highest fuzzy membership category • Support – Average membership value for all cases Support by Single Item Category Weight Sup(Rjk) Age Young R11 0.45 0.261 Age Middle R12 0.45 0.135 Age Old R13 0.45 0.059 Income High R21 0.55 0.000 Income Middle R22 0.55 0.490 Income Low R23 0.55 0.060 Risk High R31 0.70 0.320 Risk Middle R32 0.70 0.146 Risk Low R33 0.70 0.233 Credit Good R41 0.80 0.576 Credit Bad R42 0.80 0.244 Support • If support for pair of categories is above minimum support, retain • Identifies all pairs of fuzzy categories with sufficiently strong relationship • For outcomes, R51 (On Time) strong, R52 (Default) not Support by Pair: minsup 0.25 R11R22 0.235 R22R41 0.419 R11R31 0.207 R22R51 0.449 R11R41 0.212 R31R41 0.266 R11R51 0.230 R31R51 0.264 R22R31 0.237 R41R51 0.560 Support by Triplet: minsup 0.25 R22R41R51 0.417 R22R31R41 0.198 R22R31R51 0.196 R31R41R51 0.264 Quartets • None qualify, so algorithm stops Confidence • Identify direction • For those training set cases involving the pair of attributes, what proportion came out as predicted? Confidence Values: Pairs Minimum confidence 0.9 R22→R41 0.855 R41R22→R51 0.995 R41→R22 0.727 R41R51→R22 0.744 R22→R51 0.916 R22R51→R41 0.928 R51→R22 0.697 R31R41→R51 0.993 R31→R41 0.831 R31R51→R41 1.000 R41→R31 0.462 R51R41→R31 0.472 R31→R51 0.825 R51→R31 0.410 R41→R51 0.972 Rules • IF Income is Middle THEN Outcome is On-Time – R22→R51 support 0.490 confidence 0.916 • IF Credit is Good THEN Outcome is On-Time – R41→R51 support 0.576 confidence 0.972 • IF Income is Middle AND Credit is Good THEN Outcome is On-Time – R22R41→R51 support 0.419 confidence 0.995 • IF Risk is High AND Credit is Good THEN Outcome is On-Time – R31R41→R51 support 0.266 confidence 0.993 Rules vs Support Rules vs Confidence Higher order combinations • Try triplets – If ambitious, sets of 4, and beyond • Here, none • Problems: – Computational complexity explodes – Doesn’t guarantee total coverage • That also would explode complexity • Can control by lowering minsup, minconf Simulation Testing • Selected 550 cases – Held out 100 • Randomly assigned weights to each fuzzy region of each attribute – minsup {0.35, 0.45, 0.55, 0.65} – minconf {0.7, 0.8, 0.9} Simulation Results