Tap chi Tin hpc va Bieu khien hpc, T.25, S.l (2009), 69-78 MOT THUAT TOAN TIM TAP RUT GON TRONG BANG QUYET DjNH KHONG DAY DU H O A N G THI LAN GIAO Khoa CNTT, Truang Dgi hgc Khoa hgc- Dgi hgc Hue Abstract The aim of this work is to generalize the concept of knowledge reduction to class of incomplete decision tables By innovating the method of J Liang and Z Xu ]3], which based on rough entropy, we establish a heuristic algorithm for finding a reduct of incomplete decision table Thetimecomplexity of this algorithm is 0(fcn^ log n), where fc is the number of conditional attributes and n is the number of objects in data table Tdm tat Bai bao md rong khai niem rut gon tri thiic len Idp cac bang quyet dinh khong day du Bang each cai tien phuong phap cua Jiye Liang va Zongben Xu 13], dua vao entropy tho da thiet lap mot thuat toan Heuristic de tim rut gpn cua bang quyet dinh khong day du Do phiic tap cua thuat toan la Oikn logn), vdi fc la sd thuoc tinh dieu kien va n la sd ddi tugng bang M C ^ D A U Ddi vdi cac md hinh dii lieu ldn, rat de xay tinh trang dii lieu bi thieu bdi nhieu ly khac Viee rut ggn tri thiie se gap khd khan cic he thdng thdng tin khong day dd ndi chung va dac biet b i n g quyet dinh, cic luat quyet dinh d u a se khdng day dd Thdng thudng, dii lieu thieu thong tin rai vao mot ba trudng hgp 19]: khong cd thong tin ve dii lieu (khdng biet mot ngudi nao dd cd sd dien thoai hay khdng), khdng tdn tai (khdng cd chiic vu), tdn tai nhung khdng biet (ngay sinh) Trong khudn khd bii bio, ta chi nghien ciiru b i n g quyet dinh khdng day dd vdi gii tri thugc tfnh bi mat tdn tai nhung khdng biet Mdt thuat t o i n tim t a p rut ggn d y a vao khii niem entropy tho vdi bing chiia du lieu bi thieu loai cung da dugc de xuat, C A C K H A I N I E M 2.1 He thong thong tin khong day dd Cho A = (17, A) la mdt he thdng thdng tin vdi U la t a p hiru ban, khic rdng cac ddi tugng va A la tap hiru ban, k h i c rdng c i c thugc tfnh Vdi mdi u e U vk a ^ A ta ky hieu u(o) la gia tri thudc tfnh a cda ddi t u g n g u Neu X C A l i mdt t a p cic thugc tinh ta ky hieu tt(X) la bg gdm cac gii tri u(a) vdi a G X Vi viy, neu ti va t; la hai ddi tugng thude U, ta se ndi ii(X) = vix) neu uia) = via) vdi mgi thudc tfnh o G X T i p t a t c i c i c gii tri cda thudc tfnh a\k Va Bing quyet dinh la mdt he thdng thdng tin vdi t i p thude ti'nh A gdm hai t i p rdi C vk D, dd t i p C dugc ggi la tap thugc tinh dieu kien v i D la t i p thugc tinh quyet 70 HOANG THI LAN GIAO dinh He thdng thdng tin A dugc ggi Ii khdng day dd neu tdn tai thugc tfnh a e Avk ddi tuong u G {/ m i gia tri •u(a) bi mat hay ndi each khic Va chua gia tri null Gii tri bing chung ta se ky hieu bdi ky ty "*" Tuong ty nhu vay, ta cd khii niem bing quyet dinh khdng day dd 2.2 Quan he khong phan biet du'o'c tren he thdng thong tin khong day dii Cho he thdng thdng tin A = (f/, A) Vdi mdi tap cic thugc tfnh B C A, ta dinh nghia quan he hai ngdi IND(B) tren U xac dinh bdi: IND(B) = {iu, v)eU xU\ uiB) = •u(B)} , , IND(B) dugc ggi la quan he B—khdng phin biet dugc De kiem chting dugc rang day la mdt quan he tuang duong tren U Neu (u, v) e IND(B) thi hai ddi tugng u vk v khdng phan biet dugc bdi cac thudc tfnh B Ldp tuang duong chiia phan tii u dugc ky hieu \U]B Khi dd quan he IND(B) dugc xac dinh hoan toan bdi cac ldp tuong duang \U]B, U £ U Tap hgp thuang cda quan he IND(B) dugc ky hieu U/B, tuc la U/B = {]U]B \ueU} Trong trudng hgp he thdng khdng day dd, ta dinh nghia quan he hai ngdi tren U, ky hieu SIM(B), vdi mdi B C A Dinh nghla 2.1 [3] SIM(B) = {iu,v) G t/ X (7 Va e B,uia) = via) hoac u(a) = * hoac via) = *} Rd rang, SIM(B)= P|SIM({6}) beB Ky hieu 5B(U) = {V £ U \ iu,v) e SIM(i3)} va ggi la Idp tolerance cda quan he SIM(i3), Ssiu) la tap tdi da cic ddi tugng v khong phin biet dugc vdi u bang tap thudc tinh B Khi dd tren U ta phan Idp cic ddi tugng dya vao quan he SIM(B), mdi Idp la mdt tap SBiu), u&U Hg cic Idp dugc ky hieu (7/SIM(S), day la mgt phd cda U (Uugc/ SBiu) = U), vk mdi phan td hg deu khic rdng quan he SIM(S) cd tfnh phin xa (5B(U) D {U}) Ndi chung, day khdng phii la mot phan hoach cda U vk SIM(i3) ciing khdng phii la mgt quan he tuong duang Tuy nhien, trudng hgp he thdng day dd, quan he SIM(B) triing vdi quan he IND(B), v i dd Ssiu) = [u]B,iu G U hay (7/SIM(B) = U/B Vi du 2.1 Xet he thdng thdng tin khdng day dd vdi ba thudc tfnh: A ={Than nhiet, Dau dau, Dau ca} cho Bing MOT THUAT TOAN TIM TAP RtJT GON u1 Bdng Than nhiet Dau dau cao * rat cao cd khdng * cao CO cao * binh thudng CO binh thirdng khdng cd * 71 D a u CO khdng CO khdng cd cd khdng cd * Ta co: 5^(1) = { , , 8}; 5^(2) = {2, 8}; 5^(3) = { , } ; S ^ ( ) = { , , } ; 5x(5) = { , , } ; ^ ( ) = {6,8}; 5^(7) = {7}; 5^(8) = { , , , , , 8} U/SmiA) = {{1,8}; {2,8}; {1,3}; {4,5, 8}; {4,5, 8}; {6, ;{7,8};{1,2,4,5,6,8}} Vdi B = { Than nhiet, Dau co}, SB(1) = {1,3,8};5B(2) = {2,8}; SB(3) = {1,3,6,8};5B(4) = {4,5,8}; 5B(5) = {4,5,8};5S(6) = {3,6,8}; 5B(7) = { , } ; B ( ) = {1,2,3,4,5,6,7,8} D i n h n g h l a Thudc tinh dieu kien c G C dugc ggi l i khong cot yeu bing cjuyet dinh T neu (7/SIM(C \ {c}) = C//S1M(C \ {c} U D) Ngugc lai, c dugc ggi l i cdt yeu, Bing quyet dinh khdng day dd T dugc ggi la ddc lap neu mgi thuoc tinh c G C deu cdt yeu Tap t a t c i c i c thugc tinh cdt yeu T dirge ggi la loi va dugc ky hieu Corc(C), Tir day, ta quy udc viet Core thay cho Core(C), D i n h n g h l a T i p c i c thugc tinh R C G dugc ggi la mgt rut ggn cda tap thugc tinh dieu kien C neu T ' = iU,RUD) l i ddc lap v i U/SlMiR) = U/SlMiRUD) Ro rang la cd t h e cd nhieu t i p rut ggn cda G Ta ky hieu RED(C) l i t a p t a t c i cic rut ggn cda G T, Mgt thugc tfnh la cdt yeu vac chi nd thugc v i o mgi tap rut ggn cda C, Hai dinh nghla tren la sy m d rgng cda cic dinh nghia ve thugc tinh khdng cdt yeu va rut ggn b i n g quyet dinh (day dd), Rd rang T day dd, t a cd U/SIMiG\{c}) = U/SIMiiG\{c})lJD) 4* ^ •^ U/iG\{c}) = U/iiG\{c})UD) C a r d ( n ( C \ {c})) = C a r d ( n ( C \ {c}) U D) c la thudc tfnh khdng cdt yeu itheo 12]) 72 HOANG THI LAN GIAO ENTROPY THO Khai niem entropy thd da dugc gidi thieu ll, 6], d day, ta nhac Iai khil niem entropy thd cda tri thiic he thdng khdng day dd Jiye Liang va Zongben Xu de xuat Dinh nghla 3.1 Cho he thdng thdng tin khdng day dd A = ([/, A) vkB C A Entropy thd cda tri thiic B la gii tri ^(^)._f:^£M,og-|^, ^ n \SBixi)\ vdi Card([/) = n, 15B(a;)] la ky hieu cda Card(5B(a;)) va loga: la ky hieu cda Iog2a; Khi dd ^— la xae suat de mdt ddi tuong U thudc ldp SBixi) vk , ^ ,—rr li xic suat cua n ^ • ' ^ ' \SBixi)\ mgt ddi tugng ldp SBixi) va bang Xi Tir dinh nghia tren, ta cd cic tfnh chat cda entropy thd he thdng thdng tin khdng day dd A=(C/,A) [3] Tinh chat 3.1 Cho P,Q Q A Neu ton tgi mgt song dnh h : t//SIM(P) -^ U/SlMiQ) cho \hiSpixi))\ = \Spixi))],i=l,2, -,\U], thi EiP) = EiQ) Dieu cd nghia la entropy thd cda tri thiic la bat bien ddi vdi tap cac Idp f7/SIM(P) dang cau Tinh chat 3.2 (Don dieu giim) Cho B,G C A Neu B CG (ro rdng [//SIM(C) C U/SlMiB)) thi EiC) ^ EiB) Tfnh chat niy chi rang sd Idp tolerance C//SIM(B) cang Idn thi entropy thd cda tri thuc cang giim Tfnh chat 3.4 (Tuang duong) Cho G C A Khi id [//SIM(C) = C//SIM(A) neu vd chi neuEiG) = EiA) Nhu vay, neu tap G C A thi sy phan Idp cda quan he SIM(C) tuong duang vdi sy phan Idp cda quan he SIM(A) va chi entropy cda hai tap thugc tfnh bang Tfnh chat 3.5 (Cyc dai) Cho G C A Gid tri cue igi cua entropy tho cila tri thuc G bang \U\\og\U\, igt iugc Scix) = Uix e U Luc id C//SIM(C) = {U} Entropy dat eye dai mgi Idp,cda £//SIM(C) deu bang U Dieu dd eung ed nghia la thdng tin nhan dugc tir tap thudc tfnh tuong ung cd tin cay be nhat Tinh chat 3.5 (Cyc tieu) Cho G C A Gid tri cue tieu cda entropy tho cua tri thUc C bang 0, igt iugc Scix) = {x}ix G U Entropy dat cue tieu mdi Idp tolerance chi ehiia dung mdt phan tii va nhu vay thdng tin nhan duge dua vao tap thugc tfnh tuong iing chac chan nhat Menh de 3.1 Cho T = ([/, C U D ) la bdng quyet iinh khong day iu Khi id, tdp R C C Id mgt rut ggn cua tap thugc tinh Heu kiin G bdng neu vd chi neu R la tap toi tieu thod MOT THUAT TOAN TIM TAP RUT GON 73 mdnEiR) = EiRUD) Chung minh Bat T' = iU,RUD) R la tap rut ggn -^ T'dgc lap vkU/SIMiR) O Vc G i?, U/SIMiR = U/SIMiR U D) ^ vk U/SIMiR) = U/SIMiRUD) Rlktkptoitieuva EiR) = EiRUD) i theo tfnh chat tuang duong) \ {c}) / U/SIMiR \{c}UD) Y NGHIA CUA THUOC TINH Dinh nghla 4.1 Cho T = (C/, CU D) la bing quyet dinh khdng day dd Y nghia cda thudc tfnh c C, ky hieu sig(^\ rj,i(c), dugc xac dinh si9c\{c}ic) = EiG\{c})-EiG\{c}UD) Menh de 4.1 a)0^sigc\{c}ic) Khi do, Core(C) = {c G C sigc\(,)(c) > 0} Chung minh a) ^ EiG \ {c}) - EiG \ {c} U D) v\G \ {c} C C \ {c} U D Mat khic, theo ti'nh chat cyc tieu EiG \ {c} U D) ^ va tfnh chat cyc dai thi EiC \ {c}) ^ ]U]log]U\ Ta cd ]U\log\U\ > EiG\{c}) ^ EiG\{c}) - EiC\{c}UD) ^ hay ^ sigc\{c}ic) ^ \U\\og]U\ b) c e C la thudc tfnh cdt yeu G neu va chi neu i7/SIM(C\{c}) y^ C//SIM(C\ {c}UZ)) Tuc la, EiG \ {c}) - EiG \ {c} U D) > hay sigc\{c}(c) > Rd rang, Core(C) = {cGClsigc\{c}(c) > } • Dinh nghia 4.2 Cho T = (t/, CUD) la bing quyet dinh khdng day dd, i? C C va c G G\R Y nghla cda thudc tfnh c ddi vdi R, ky hieu sig^(c), dugc xac dinh siQRic) = EiRUD) - EiRU {c} U D) THUAT TOAN TIM TAP RUT GON Dya vao cic tfnh chat cda entropy thd va y nghia cda mdt thugc tinh, bai bio de xuat mgt thuit toin Heuristic tim t i p rut ggn bing quyet dinh khong day dd Thuat toan xuat phit tir tap Idi (bdi vi mgi tap rut ggn deu chira tap ldi) va tim cich bo sung cic thugc tfnh cho den nhan dugc tip rut ggn thyc sy Thugc tinh dugc uu tien chgn bd 74 HOANG THI LAN GIAO sung tai moi budc la thudc tfnh cd y nghia Idn nhat Cu the, thuat toan cd the dugc trinh bay chi tiet nhu sau Vao: Bing quyet dinh khdng day dd T = ((7, CU D) Ra: Mot rut ggn R cda T Phuc/ng phdp Bl Tinh Core(C) := {c G C ] sigc\{c]ic) > 0} B2 R := Core(C) B3 Tfnh EiR) va E(i?UD) B4 While EiR) i^ EiR\J D) For c G C \ i ? Tfnh sig^(c) Chgn c cho sigjj(c) = maa;{sigjj(c') ] c' G C \ R} R:= RU{c} Tfnh EiR),EiRUD) • B5, i ? : = f l \ C o r e ( C ) B6 For c G i? 1, If EiiR\ {c}) U Core(C)) = £ ( ( i ? \ {c}) U Care(C) U D) then R:= R\{c} B7, R:=RUCoieiC) Do phiic tap tfnh toin cua thuat toin dugc xac dinh bdi vdng lap while d B4 (vdng lap thyc hien tdi da Card(C) Ian, sau mdi budc, sd thugc tinh dugc chgn tang len 1), Tai mdi budc cda vdng lap, ta, can tfnh sig^(c), vdi mdi c a C \ R vk tinh EiR), £(/?U D) Viec tinh sig;j(c) cung dua ve tinh entropy thd Vi viy, neu goi n la sd ddi tugng bing quyet dinh va fc la sd thugc tfnh dieu kien tuong ting, thi tai mdi btrdc cda vdng lap ta can tinh tdi da fc gii tri entropy thd Mat khic, dya via cdng thiie cda entropy thd, ta dinh gia dugc phiic tap tfnh toin cda phep toan la Oin^logn), vi mdi lan tim mdt ldp U/SlMiR), ta phii sap xep cac ddi tugng U vdi phiic tap cda phep sap xep la O(ralogn), Nhu viy cd the khang dinh phuc tap tfnh toin cda thuat toan la O(fcn^Iogn) Vi du 5.1 Xet bing quyet dinh 111] dugc cho bdi Bing Bdng u Ul U2 Cl C2 C3 C4 d low low 4 4 compact 6 4 6 high Us medium high high UQ low UJ high "8 low "3 U4 sub compact compact compact compact sub sub low high low low high low low Bing liru thdng tin ve ciiiec xe hai nhin dugc thdng qua cic thudc tinh dieu kien C = {ciiWeight), C2iDoor), csiSize), c^iGylinder)} va thugc ti'nh quyet dinh D = {diMileaqe)} 75 MOT THUAT TOAN TIM TAP RUT GON Trong b i n g nay, t a cd t a p thudc tinh ldi la {ci} va cd hai t a p rut ggn i?i = {ci, c.3}, R2 = {ci,C4} G i i sir vi mgt ly nao dd ta khong cd dugc thong tin day dd nhu tren, mgt sd gia tri cua cac thudc tinh bi m a t (Bing 3) Bd,ng C4 d Cl C2 C3 Ul low high compact 4 sub low * "2 high compact medium "3 compact * low U4 high low 4 Ms high * compact high Me low low high M7 * sub low low sub M8 u Thyc bien tirng budc t h u i t toan tren ta thu dugc ket q u i sau: Bl Tim ldi: X e t Cl }(«l) = {Ui,t(4} 'S'(CUD)\{ci}("l) = {".} Sc\ }("2) = {u2,^t7} 5'(CUD)\{ci}(«2) = {M2, '(7} Sc\ }("3) = {U3,U5,U6} S{CUD)\{ci}i'^3) {u3,Ufi} Sc\ }(W4) = = 5'(C7UD)\{ci}(«4) = {"4} {u3,U5,Ue} ãS'(CuO)\{ci}("5) = {ã"5, Me} {ô1,'"4} Sc\ }(W5) = Sc\ }("6) = {U3,M5,«6} 'S'(CUD)\{ci}("6) = {«3,«e} Sc\ }("?) = {M2,M7,M8} •S'(CUD)\{ci}("7) = {M2,M7,M8} Sc\ }(W8) = {M7, MS} 5(CUD)\{ci}("8) = {M7,M8} Si9c\{c,}ici) = EiG \ {ci}) - EUG \ {ci}) U D) ^C\{ci} iui)] -E ^ Y l%A{£i}K:oM!iog log 15c\{ci }(".•) ^-—1 n l-5'(f,'\{,.,)),,D("i = ^lag3-i>0 Do Cl la thudc tinh cdt yeu, X e t C9 Sc\{c2}i'^l) = {MI,M6}; Sc\{c2}i^'2) = { M , M , llg}; ãS'c\{