V IE T N A M N A T IO N A L U N IVERSITY, HANOI FACULTY OF TE C H N O LO G Y P H A N X U A N HI E U A I R A L L E L MINING FOR FUZZY ASSO CI A TI ON RULES Major: Information T echnology C o d e : 1.01.10 M ASTER THESIS R E SE ARCH A D V I S O R Dr HA QLIANG TH UY H a n o i - 003 Table o f contents I isi o f figures I ¡st i)f t a b l e s Notations & Ab br ev iations A c k n o w l e d g e m e n t s Abstract Chapter Introduction to Data mining 1.1 Data ing 1.1.1 Data mining: M otiv atio n 1.1.2 Data mining: Def inition 10 1.1.3 Main steps in Knowledge discovery in databases ( K D D ) 10 1.1 Major approaches and techniques in Data ing 12 1.2.1 Major approaches and techniques in Data m i n i n g 12 1.2.2 Kinds o f data could be m i n e d 12 1.2 Applications o f Data m ining 13 1.2.1 Applications o f Data m i n i n g 13 1.2.2 Classification o f Data mining syst em s 14 1.3 focu se d issues in Data mining 14 Chapter Association rules 16 2.1 Association rules: Motivation 16 2.2 Association rules mining - Problem statement 17 2.3 Main research trends in Association rules ing 18 Chapter f u zzy association rules mining 21 Quantitative association r u l e s 21 3.1.1 Association rules with quantitative and categorical attributes 21 3.1.2 Methods o f data discretization 22 3.2 Fuzzy association r u l e s 24 3.2.1 Data discretization based on fuzzy s e t 24 3.2.2 Fuzzy association r u l e s 27 3.2.3 Algorithm for fuzzy association rules mining 31 3.2.4 Relation between fuzzy association rules and quantitative o n e s 35 3.2.5 Experiments and conclusions 36 Chapter Parallel mining o f fuzzy association r u le s 41 4.1 Several previously proposed parallel algorithms 42 4.2 A new parallel algorithm for fuzzy association rules ing 50 Our a p p r o a c h 50 4.2.2 The new a l g o r i t h m 54 4.2.3 Proof o f correctness and computational complexity 54 4.3 Experiments and conclusions 57 C o n clus io n .59 Achievements throughout the dissertation 59 Future w o r k 60 R ef er en ce 61 A p p e n d i x 65 L ist o f f i g u r e s Figure - The volume o f data strongly increases in the past two d e c a d e s l igure - Steps in KDD pr o cess 11 Figure - Illustration o f an association rule 16 Figure - "Sharp boundary problem" in data discretization 23 Figure - Membership functions o f fuzzy sets associated with “Age" attribute 25 Figure - Membership functions o f "CholesterolJLow” and "Cholesterol H i g h " 25 Figure - The processing time increases dramatically as decreasing th qfm in su p 36 Figure - Number o f itemsets and rules strongly increase as reducing the fm in su p 37 Figure - The number o f rules enlarges remarkably as decreasing the fm in s u p 37 Figure 10 - Processing time increases largely as slightly increasing number o f attrs 38 f igure 1 - Processing time increases linearly as increasing the number o f r e c o r d s 38 Figure 12 - Optional choices for T-norm o per ator 39 Figure 13 - The mining results reflect the changing o f threshold v a lu e s 40 f igure 14 - Count distribution algorithm on a 3-processor parallel s y s t e m 43 Figure 15 - Data distribution algorithm on a 3-processor parallel s y s t e m 44 Figure 16 - The rule generating time largely reduces as increasing the m in co n f 48 Figure 17 - The number o f rules largely reduces when increasing the m in co n f 48 figure 18 - The illustration for division algorithm 55 f’igire 19 - Processing time largely reduces as increasing the number o f process 57 Figure 20 - Mining time largely depends on number o f process (logical, physical) 58 Figure 21 - The main interface window o f F u z z y A R M tool 65 Figure 22 - The sub-window for adding new furry s e ts 66 Figure 23 - The window for viewing mining results 66 List of tables Table I - An example o f transactional dat ab as es 17 Table - Frequent itemsets in sample database in table with support = % 17 Table - Association rules generated from frequent itemset A C W 18 Table - Diagnostic database o f heart disease on 17 p a tie n ts 21 fable - Data discretization for attributes having finite v a l u e s 22 fable - Data discretization for "Serum cholesterol" attribute 23 fable - Data discretization for "Age" attribute 23 Table - The diagnostic database o f heart disease on 13 patients 27 fable - Notations used in fuzzy association rules mining alg o r it h m 32 fable 10 - The algorithm for mining fuzzy association rules 32 fable 1- T | : Values o f records at attributes after fuzzifying 33 Table 12 - C i: set o f candidate 1-itemsets 33 fable 13 - F2: set o f frequent 2- it e m s e ts 34 fable 14 - Fuzzy association rules generated from database in table 35 fable 15 - The sequential algorithm for generating association r u l e s 48 fable 10 - Fuzzy attributes received after being fuzzified the database in table 51 fable 17 - Fuzzy attributes dividing algorithm among p r o c e s s o rs 53 Notations & Abbreviations Abbreviations: Word or phrase Knowledge Discovery in Databases Abbreviation KDD Keywords: Dala mining, association rules, binary association rules, quantitative alien rules, fuzzy association rules, parallel algorithm C h ap ter Introduction to Data m ining 1.1 Data m in in g 1.1.1 Data mining: Motivation The past two decades lias seen a dramatic increase in the amount o f information or data being stored in electronic devices (i.e hard disk, C D -R O M etc.) This ac cum ula tion o f data has taken place at an explosive rate It has been estimated that the amount o f information in the world doubles every two years and the size and number o f databases are increasing even faster Figure l illustrates the data explosion [3| Figure I - Th e volume o f data st rongl y increases in the past t wo decades We are drowning in data, but starving for useful knowledge The vast amount o f accumulated data is actually a valuable resource because information is the vital factor for business operations, and decision-makers could make the most o f the data to gain precious insight into the business before making decisions Data mining, the extraction o f hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most significant information in their data collection (databases, data warehouses, data repositories) The automated, prospective analyses offered by data mining go beyond the normal analyses o f past events provided by retrospective tools typical o f decision support systems Data mining tools can answer business questions that traditionally were too time-consuming to resolve This is where data mining & knowledge discovery in databases demonstrates its obvious benefits for today’s competitive business environment Nowadays Data Mining & K D D has been becoming a key role in computer science and knowledge engineering areas 10 The initial application o f data mining is only in commerce (retail) and finance (stock market) However, data mining is now widespreadly and successfully put into other fields such as bio-informatics, medical treatment, telecommunication, education, ctc 1.1.2 Data mining: Definition Before discussing some definitions o f data mining, I have a small explanation about terminology so that readers can avoid unnecessary confusions As mention-ed before, we can roughly understand that data mining is a process o f extracting nontrivial, implicit, previously unknown, and potentially useful knowledge from huge sets o f data Thus, we should name this process as knowledge discovery in database (KDD) instead o f data mining However, most o f the researchers agree that the two above terminologies (Data mining and KDD) are similar and they can be used interchangeably They explain for this “humorous misnomer” that the core motivation o f KDD is the useful knowledge, but the main object they have to deal with during mining process is data Thus, in a sense, data mining and KDD imply the same meaning However, in several materials, data mining is sometimes referred to as one step in the whole K D D process [3] [43] There are numerous definitions o f data mining and they are all descriptive I would like to restate herein some o f them that are widely accepted Definition one: W J Frawley, G Piatetsky-Shapiro, and C J Matheus 19 [43]: "Knowledge discovery in databases, also known Data mining, is the nontrivial process o f identifying valid, novel, potentially useful, and ultimately understand-able patterns in data ” Definition two: M Holshemier va A Siebes (1994): "Data Mining is the search fo r relationships and global patterns that exist in targe databases hut are 'hidden' among the vast amount o f data, such as a relationship between patient data and their medical diagnosis These relationships represent valuable knowledge about the database and the objects in the database and, i f the database is a faithful mirror, o f the real world registered by the database " 1.1.3 Main steps in Knowledge discovery in databases (KDD) The whole KDD process is usually decomposed into the following steps [3J [ 14] [23]: 11 D ata selection-, selecting or segmenting the necessary data that needs to be mined from large data sets (databases, data warehouses, data repositories) according to some criteria D ata p rep ro cessin g : this is the data clean and reconfiguration stage where some techniques arc applied to deal with incomplete, noisy, and inconsistent data This step also tries to reduce data by using aggregate and group function, data compression methods, histograms, sampling, etc Furthermore, discretization techniques (binning, histograms, cluster analysis, entropy-based discretization, segmentation) can be used to reduce the number o f values for a given continuous attribute by dividing the range o f the attribute into separated intervals After this step, data is clean, complete, uniform, reduced, and discretized D ata transformation-, in this step, data are transformed or consolidated into forms appropriate for mining Data transformation can involve data smoothing and normalization After this step, data are ready for the mining step D ata mining: this is considered to be the most important step in KDD process It applies some data mining techniques (chiefly borrowing from machine learning and other fields) to discover and extract useful patterns or relationships from data K now ledge representation an d evaluation: the patterns identified by the system in previous step are interpreted into knowledge that can then be used to support human decision-making (e.g prediction and classification tasks, summarizing the contents o f a database or explaining observed phenomena) Knowledge representation also converts patterns into user-readable expressions such as trees, graphs, charts & tables, rules, etc selection mm / seiecieo -u*- data mining - Figure - Steps in K DD process presentation & evaluation 12 1.1 M ajor a p p ro a c h e s and te c h n iq u e s in Data m ining 1.2.1 Major approaches and techniques in Data mining Data mining consists o f many approaches They can be classified according to functionality, kind o f knowledge, type o f data to be mined, or whatever appropriate criteria 114], describe major approaches below: Classification & prediction', this method tries to arrange a given object into an appropriate class am ong the others The number o f classes and their name are definitely known For example, we can classify or anticipate geographic regions according to weather and climate data This approach normally uses typical techniques and concepts in machine learning such as decision tree, artificial neural network, k-min support vector machine, etc Classification is also called supervised learning Association rules: this is a relatively simple form o f rule e.g “ 80 percent o f men that purchase beer also purchase dry beef" Association rule is now successfully applied in supermarket (retail), medicine, bio-informatics, finance & stock market, etc Sequential/tem poral p a ttern s mining', this method is somewhat similar to association rules except that data and mining results (a kind o f rule) always contain a temporal attribute to exhibit the order or sequence in which events or objects effect upon each other This approach plays a key role in finance and stock market thanks to its capab lity o f prediction Clustering & segmentation', this method tries to arrange a given object into a suited category (also known as cluster) The number o f clusters may be dynamic and their labels (names) are unknown Clustering and segmentation are also called unsup;rvised learning Concept description & sum m arization', the main objective o f this method is to describe or summarize an object so that the obtained information is compact and condeised Document or text summarization may be a typical example 1.2.2 Kinds of data could be mined Data mining can work on various kinds o f data The most typical data types are as follows: 52 {1 4, , , 9} The processor P I, = {1, 5, , 7, 9}; The processor P 3: I ,.3 = {2, , 9} The processor P I |.4 = {2, 9}: The processor P5: 1|.5 = ¡3 4, , 9} The processor P6: I ,,6 = {3, 5, , 9} We divide the f based on the first two attributes Age and Cholesterol The initial fuzzy attributes are now distributed among processors and each processor receives fuzzy attributes This division is "ideal1' because the number o f processors ( ) is equal to the multiplication o f number luzzy sets associated with attribute Age (3) and number o f fuzzy sets associated with attribute C holesterol (2) (i.e = 3*2) The optimal division is where we could equally disperse fuzzy attributes to all processors in the system In the case o f being unable to obtain a optimal division, we will use "the most reasonable” one This means that several processors are in idle state while others work hard I would like to present an algorithm used for fuzzy attributes division, it first tries to Find the optimal solution, if not it will return “the most reasonable” one The division algorithm is formally described below: Given a database D with = {i, i-> in} is set of n attributes, and T = {t | , I2 tni} is set oI in records After being fuzzified I) 1, and T are converted into D|,, I t., and T respectively I, = i(i, f ' i l ' l [ I , [ b - f t f l [ *2' f i , , f i n ’ l , I.i„, f i n k " l } - Where, f,,11 and kl fire the u b fuzzy set and the number o f fuzzy sets associated with attribute i, For example, the database in table we have I = {Age, Serum C holesterol, BloodSugar, H eart D isease) and after converting we receive I| as shown in table 16 In this case k } = /0 - kj = k, - are numbers o f fuzzy sets associated with original attributes in I Let FN = ( k 1j u ¡k 2j u u {kn} = {S|, s2, sv} (v < n as the may be pairs such as (l] 1leikki Mannila, Flannu Toivonen and A Inkeri Verkamo (1994), “Efficient Algorithms for Discovering Association Rules”, In K D D -I994: A A A I W orkshop on Knowledge D iscovery in Databases, pages 181-192, Seattle, Washington [21 | L A Zadch f 1965) “Fuzzy sets” Informal Control 338-353 122'] M Klemettinen IT Mannila P Ronkainen H Toivonen and A.I Verkamo (1994) “Finding Interesting Rules from Large Sets o f Discovered Association Rules”, In Proc 'd International Conference on Inform ation and K now ledge M anagem ent, pages 401-408, Gaithersburg, Maryland (23'] Manoel Mendonca (2000) M ining Software Engineering Data: A S u rvey, University o f Maryland, Department o f Computer Science, A V Williams Building #3225 College Park, MD 20742 124], Mohammed J Zaki and Ching-Jui Hsiao (1999), CHARM : An Efficient A lgorithm fo r C losed Association Rules M ining, RPI Technical Report 99 125»] Mohammed J Zaki and Mitsunori Ogihara (1998), “Theoretical Foundations o f Association Rules” In 3' A C M SIG M O D W orkshop on Research Issues in Data M ining and Know ledge Discovery 63 | | M oh am med Zaki, Srinivasan Parthasarathy, and Mitsunori Ogihara ( 0 ), "Parallel Data Mining for Association Rules on Shared-Memory Systems", In K now ledge a n d Inform ation Systems Vol Number 1, pages 1-29 (27), MPI (1995) MPI: A M essage-Passing Interface Standard Message Passing Interface Forum [ MPI (1997), M PI-2: Extensions to the M essage-P assing Interface, Message Passing Interface Forum 129], MPI (1997), M PI-2 Journal o f D evelopm ent, Message Passing Interface Forum [30], Nicolas Pasquier, Yves Bastide Rafik Taouil, and Lotfi Lakhal (1999), D iscovering Frequent C losed Item sets fo r Association R ides, Laboratoire d ’informatique Université Blaise Pascal - Clermont-Ferrand II, Complexe Scientifique des Cézeaux | | Osmar R Zaiane, Mohammad El-Hajj, and Paul Lu (2002), Fast P arallel A ssociation Rule M ining Without C andidacy G eneration, University o f Alberta, Edmonton, Alberta, Canada [32], Qin Ding and William Perrizo (2001), Using Active Netw orks in P arallel M ining o f Association Rules Computer Science Department, North Dakota State University, Fargo ND 58105-5164 133 | R Agrawal and P Yu (1998) "Online Generation o f Association Rules" In IE E E International Conference on D ata M ining [34], Rakesh Agrawal and John Shafer ( 1996), Parallel m ining o f association rules: Design, im plem entation an d experience Research Report R.I 10004 IBM Almaden Research Center, San Jose, California [35], Rakesh Agrawal and Ramakrishnan Srikant (1994), “Fast Algorithms for Mining Association Rules" In Proc o f the " International C onference on Very Large D atabases, Santiago, Chile [36], Rakesh Agrawal, Tomasz Imielinski, and Arun Swami (1993), “ Mining association rules between sets o f items in large databases” In Proc o fth e A C M SIG M O D C onference on M anagem ent o f Data, pages 207-216 Washington, DC [37'] Ramakrishnan Srikant and Rekesh Agrawal (1995), “Mining Generalized Association Rules” , In Proc o f the 21s' International Conference on Very L arge D atabases, Zurich, Switzerland [38;] Ramakrishnan Srikant and Rakesh Agrawal (1996) Association Rules in Large Relational Tables" In Proc hit Conf on M anagem ent o f Data Montreal Canada [39*] R J Miller and Y Yang (1998), Association Rules over Interval D ata, Department o f Computer & Information Science Ohio State University USA [401 RS/6000 SP 997) Practical \4P i Program m ing, Yukiva Aovama & Jun “Mining Quantitative o f the A C M S IG M O D Nakuno, int i Technical Support Urgum/nlion www.rcabooks.ibm.com 64 [41], T Murai and Y Sato (2000) "Association Rules from a Point o f View o f Modal Logic and Rough Sets" In proceeding o f the forth Asian Fuzzy Sym posium Tsukuba, Japan, pp 427-432 142] I ran Vu Ha Phan Xuan Hieu Bui Quang Minh and Ha Quang Thuy (2002) “ A Model for Parallel Association Rules Mining from The Point o f View o f Rough Set" In Proc o f International C onference on East-Asian Language Processing a n d Internet Inform ation Technology, Hanoi [43], Usama M Fayyad Gregory Piatetsky-Shapiro Padhraic Smyth, and Ramasamy IJthurusamy (1996) Advances in K now ledge D iscovery a n d D ata M ining, AAAI Press / The MIT Press [44 ] Wei Wang, Jiong Yang, and Philip S Yu (2001), Efficient M ining o f W eighted Association Rules ( W AR) IBM Watson Research Center [45], Weil MacDonald Elspeth Minty, Tim Harding, Simon Brown (1997), W riting M essage-P assing Parallel Programs with M PL Edinburgh Parallel Computing Centre The University o f Edinburgh [ Zijian Zheng, Ron Kohavi and Llew Mason (2001) Real W orld Perform ance o f Association Rule A lgorithm s, Blue Martini Software, 2600 Campus Drive, San Mateo CA 94403, USA [47] Zimmermann IT (1991) Fuzzy Set Theory a n d Its Applications Kluwer Academic Publishers 65 Appendix The FuzzyARM (Fuzzy Association Rules Mining) tool was intentionally developed for experiments I describe several characteristics as follows: • ll was written in MS Visual C++ and can run on all MS Win-32 platforms • Hardware: IBM PC Pentium IV 1.5 GHz 512 Mb RAM • Testing data: heart and diabetes disease data, vehicle and auto data, and other synthetic datasets Some screenshots o f FuzzyARM (Fuzzy Association Rules Mining) tool are displayed below: g Fuzzy Assocation Rules Mining File NJme Help |C\Users\hieupx\Pfojects\Data\mydaia\heartVheart name Qpen database f>!dkl Associated fuzzy sets Fields to be mmed MaxHearfRate OldPeak Rest BP Vessel* _!i! T] [Âge_0ld Agei Young Y Aqe Middle H T Ives hold value Angine BloodSugai Cheit Pain Choleiteiof Membership function Range ID Start 230 64 End 54 77 Value (const or fur |A) Increase ír D 000 Clast ECG Sex Thai Datatype 3eal Add I 49 Modify Reno ve ±L _ I J j The slope of the peak exercise ST segment Calculate membership value Raw dala Sev 10 10 10 10 10 10 10 10 10 10 10 10 00 10 00 00 Chest 40 40 40 40 30 4.0 40 40 40 30 40 40 30 30 20 30 RestBP 1300 1100 1220 1280 1800 1360 1100 100 1320 1250 1200 1300 94 125 1320 152 Choie 206 239 286 255 274 315.0 275 239 353 309 177 254 199 2730 288 277 Preprocessed data Blood 00 00 00 00 10 00 00 00 00 00 00 00 00 00 10 00 I ECG 2.0 00 20 00 2.0 00 2.0 20 00 00 2.0 2.0 0.0 20 2.0 on Maxi f l 132 Z j 126 116 161 150 125 118 125 132 131 120 147 179 152 159 172 H i T iJ Figure 2! I AgeJVoung I Age Middle ! Sex 10 00“ 00 00 0.00 59 00 000 59 00 1.00 00 70 00 00 00 0.00 00 73 000 0.98 00 00 000 00 000 53 00 00 00 000 000 79 00 00 00 00 55 0.56 000 n rin nsq nn I Age_0ld 098 79 79 73 1.00 000 0.57 00 83 00 0.00 1.00 00 n 73 jl ! Mmsup: | io m Minconf | The main interface wi n d o w o f F u z z y A R M too! 85 (*) Sex_0.0 00 000 00 000 000 000 00 000 00 000 00 000 00 nnn JÜ Mme 66 Add or modify fuwy set *J Futtv set name |Age_Middle TNoihoWvól i - Aoo «anges id the fU22.y set Start 1290 J lalT End I Add Start End Value (const a function) 29 64 64 77 (A) Increase h Decrease bneaily 00 ! Modrfy Remove ! i OK Cancel Figure 22 - The su b - wi nd ow for adding new furry sets UodSugai_Q.O ECG_2 Thal_7.0 Clavs_2 } (0 13) { ChestPam_4.Q, BloodSugar_0 Angna_1.0, Thal_70, Class_2 ) lü 13) ( CheslPain_3 BloodSuga» 0 Ar*gma_0 Thal_3 C1a«_1 ) iû 13) { 6loodSwgai_0 ECG_2.0 Angma_0.0 Thal_3 Class_1.) (0 20) i BloodSugai_QQ ECG_0.0 Angina_0.0 Thal_3 Class_1 } ( û 11) { Age_Qld Sex_1 ChestPain_4 BfoodSugai_Ü ECG_2 Clasí_2 ) 10 12) {Age_0 ld 5ex_1 ChestPam_4 Blood$ugai_0 Angina_1.0 Class_2 (0 12) táge_ÜW Sex_1.0 Cbe$lPan_4 Ú BbodSugar_0 Thal_7 Class_? } lû 101 ÍA o e Old Sex ChestPain 4.0 Anaína Thai 7.0 Class 10 11) II CormiderK luzzy association iule: ♦(085 10) Age Middle AMD ECG 0 AND Thal_3 -> BloodSugai 0 AMDCia«1 ♦1085 lû) Age Middle AND ECG ÛÛ AND Thal_3 -> Bkx>d$»jgaf_0 ANDC la u j ♦(085.011) Age_Old ANDECG 0 AND Thaï AND Clatj o'BloodSugai 0 ♦(085 011) Age Old ANDS e x J AND Angiia 10 AMD Thaï -> BloodSugai 00 ♦1085 011)- Aga üld AND Sex_1 AMD ECG 20 AND Thaï 70 -> BloodSogaijfû ♦(Q.85.0111 AgeJDId AND S e x J AND ECG.2.0 AND Thal_7Û •>Cld4i_2 ♦10 85 11) •CheslPam AND BloodSugai 0 AND Angina 00 ■> Class ♦(0.85 12) Age_Ûld ANDECG AND AngnaJ -> Sex' 10 ♦(0.85 12) Age OU ANDSe* 10 AND ChestPain AMD BloodSugai 0 AND Thaï Cia» ♦1085 13) Age_Middle AND $ex_1 AND Clast_1 Angr^ 0 ♦(©85 I 3) Age üld AND BloodSugai_0.0 AND Angma_1 AND Cla$s_2 *> Sex_1 Û ♦|085 013) BloodSugar_0 AND ECG_2.Ü AND Angma_1 *> Che$lPain_4 ♦[CD85 13) Se*_ÛÜ AND 6lood$ugai_0 AND ECG_0Ô ■> Angina_0Q ♦(Ü85 013) SexJJO AND 6toodSugai_0 AND ECG_0 •> Thal_3 AND Clas:_1 *(085 13) Sax_0 Û AND BloodSugai_û0 AND ECG_0.0 »> Thaï 3.0 AND Clau !♦ (O 85 13) Sex 0 AND ECG_0 AND Thaï 3.0 -> Angina 0 AND Class |♦(CD85 013) Sex_00 AND ECG_0 AMD Thal_3 •> Angma_0 AND C la « J Figure 23 - Th e wi n dow for viewi ng mini ng results zi ... mining, association rules, binary association rules, quantitative alien rules, fuzzy association rules, parallel algorithm 9 C h ap ter Introduction to Data m ining 1.1 Data m in in g 1.1.1 Data mining: ... 3.2 Fuzzy association r u l e s 24 3.2.1 Data discretization based on fuzzy s e t 24 3.2.2 Fuzzy association r u l e s 27 3.2.3 Algorithm for fuzzy association rules. .. patients 27 fable - Notations used in fuzzy association rules mining alg o r it h m 32 fable 10 - The algorithm for mining fuzzy association rules 32 fable 1- T | : Values o f records