trình bày tổng quan luận án dịch tự động anh - việt dựa trên việc học luật chuyển đổi từ ngữ liệu song ngữ
Trang 1CHUaNG 2: TONG QUAN
Lich su dich may (MT: Machine Translation) d1i tra.i qua bon 50 nam vdinhi~u budc thang tr~m: di tu nhung bu6i d~u d~y haG hile, hy vQng; den nhunghie tha't vQng, chun bude; r6i ll;ii ph1,lch6i va phat tri~n nhu hi~n nay [133].Trang qua trlnh d6, da xua't hi~n nhi~u chien lU<;5e(strategy) va die each tier e~n'dieh khae nhau: tu tho so den phue tl;iP,va m6i chien lU<;5ehay m6i cach tier c?nd6 d~u c6 nhung u'u-khuyet di~m rieng cua chung
Trong chuang nay, trudc her chung toi se di~m qua cac chien lU<;5cdieh co
k bim va nhung each tier c~n chinh tro~g dich may, cac thanh cong va hl;illchecua cac chien lU<;5ehay cac each tier c~n n6i lIen Chung toi cling se d~ C?P denmQt so h~ dich c6 lien quail de'n lu?n an cua chung toi va phuong phap dich du'<;5e
su d1,lngtrong cae h~ do
2.1 CAC CHIEN LU(1C DJCH CO BAN
Neu xet rhea each thilc ehuy~n nglYkhi dieh me)t ngon ngu ngu6n sangmQt ngon ngu dich, ngu'oi ta thuong ehia thanh 2 dqng chinh sail: d~ll1g chuy~nngu trlf'c tier (direct) va dqng chuy~n ngu gian tier (indirect) Trang d~ng giantier, thi tuy theo milc dQ va moi truong trung gian, ngu'oi ta chia thanh 2 d~ng
nho sail: gian tier qua diu true cu phap cali (syntax-based)hay gian tier qua
ngon ngu trung gian (interlingua-based) Giua 2 dl;ingnho n6i tren, cling con t6ntl;ii mQt dl;ingnho thil 3 vdi milc de)gian tier (j giUa 2 mile dQ lIen, d6 la: giantier qua ca'u truc cu phap Cali va phan giai ngu nghla nong (shallow-semanticanalysis) T6m ll;ii, ta co th~ phan cac chien lU<;5edich trong dieh may thanh 4d~ng nhu sail (theo [87], tr 69-80):
Trang 2so d<;tngcali giOi h<;tn Ho<;ttdQng tu'ang doi tot khi dich giii'a cae ngon ngii' cling lo<;tihlnh, c6 51,1' tl1ang ling 1-1 v~ tu v1,1'ng,ngii' phap, nhu'ng.ehung g~p phaikh6 khan khi dich C?P ngon ngii' khac nhau v~ lo<;tihlnh, nhu' : tie'ng Anh (lo<;tihlnh bie'n cacl1,)va tie'ng Vi~t (lo<;tihlnh don l~p) ch~ng h<;tn.Mo hlnh dieh euachie'n lu'<Jcna y nhu' Hlnh 2.1 du'oi:
Lu~t tr~t
tl;(tU
Can ng6; ng[1ngno!
Phan tich hinh thai ddn gian.
. Dieh tung tu (word-by-word)
. S~p xep tr~t t11tu don gian
Tli di~n song ngu
Ciln ngon ngil diC{
lfmh 2.1: Mo hlnh djch tr1,1'ctie'p2.1.2 DJCH CHUYEN DOl CD PHAP (Syntactic-transferA~ A~ , , MT)
Theo chie'n lu'<Jcnay, h~ thong se dich b~ng each phan rich (hlnh thai va cuphap) Cali cila ngon ngii' ngu6n va sail d6 ap dl,lDgnhii'ng lu~t ngon ngii' va tuv1,1'ng(gQi la nhii'ng qui lu~t chuy€n d6i) d€ anh X<;tthong tin van ph<;tffitu ngonngii' ngu6n sang ngon ngii' dich
Trang 3Call ngon ngii' ngu6n
Van ph,!-ffi
ngu6n
Phan rich hInt theEva
eu pMp call ngon ngii'ngu6n
I Ca'u trUe diu ngu6n
I
Chuy€nd6i Ca'utrUeva
TITvvng
TU'di6n nguon
Qui lu~t
ehuv6n d6i
TU'di6n song ngii' ca'u trUe eau dich
Van ph,!-ffi
dieh
T,!-odiu ngon ngii' dieh tIT
ca'u true call dkh
TITdi6n diet
Ca u ngon ngii' dieh H'mh 2.2: Mo hlnh dieh ki~u chuy~n d5i eu pha pD~ nh~n bier ca'u truc cua Call nh~p VaG, nhung h~ thong chuy~n d6i dungnhung phgn m~m g9i la nhung bQ phan rich eu phap (parser) BQ phan rich cu
.
phap se sii' d\lllg mQt giai thu.?t d~ phan rich dva tren mQt bQ van ph~m ngon ngunaG d6 ho?c thong ke tITngu li~u (da gall nhan ngu phap) C6 ra't nhi~u giaithu~t phan rich va trong d6 gi:E thu~t Earley [100] va giai thu~t Tomita [141] lahi~u qua va ph6 bien hdn ca Tudng tv, cling e6 ra't nhi~u van ph~m d~ phanrich, nhu: TG[lOO], LG[132], TAG[92], HSPG, UG, DCG, LFG, nhung hgu betd~u dua den ket qua cay phan rich ell phap giong nhu nhau
Sail khi t~o ra cay cu phap, h~ thong dung nhung qui lu~t chuy~n d5i d~chuy€n sang cay Cll phap cua ngon ngu dich (xet den sv thay d6i vi tri cua tl1
trong ngon ngu dich) va n6 t~o ra ket xua't nhu trong H'mh 2.3 Vi d\!: trong tieng
Vi~t thl tinh tu dung sail daub tu ma n6 b6 nghIa, con tieng Anh thl nguQc l~i.Voi cach dich nay, chung ta khong th€ giai quyet cac trueng helP nh~p nh~ng -ngu nghIa cua nhung tITc6 cung ca'u truc nh1ing khac nghIa nhau Vi d\l: takhong xac dinh duQc nghIa cua tIT"bank" trong Call "I enter the bank" la "nganhang" hay "be song", "day",
Trang 4I see a new computer Tai thfl)' mQf nUl)! tinh 71uJi
Hinh 2.3: Chuy~n d6i cay ell phap ngon ngG'ngu6n sang cay eua ngon ngG'c1ich
2.1.3 D~CH QUA NCON NcD TRUNG ClAN (lnterlingual MT)
Thea chien llrC/enay, b~ se dicb qua ffiQtngon ngu tIling glaD gqi la lien
ngan ngil (interlingua) nbu ffmb 2.4 duoi day:
Ca u ng6n ngiT c1ich
ffmb 2.4: M6 blnh dieh lien ngon ngu
Trang 5MQt lien ngon ngil' ly tuong phai la mQt sl! bi6u di~n dQe l~p vdi ffiQingonngil' tl! nhien va bi6u di~n du<;5emQi Sl!khae bi~t v~y nghladen muetint te nha't
cua mQi ngon ngil' co trong h~ dich do Vi dV: tieng Vi~t (hay cac tieng vung
Bong NamA.) thl phan bi~t cac tu: faa, thoc, gq.o,cO'm, contieng Anh, Phap thl
khong Tudng tl!, tieng Anh thl phan bi~t cac tir: remember, miss, con tieng Vi~t thl chi dung tu nluJ Hay chI lien quail den vi~c tanh dQng/tr"mg thai sa dl;lng
trang phl;lC,thl cling da co nhi~u sl! khac bi~t tint te, nhu: tieng Vi~t phan bi~t
cac tir:mang, mi;ic,dqi, deo, tieng Anh chI phan bi~t put on va wear, can tieng
Nh~t thl phan bi~t tdi 8 tru'ang h<;5pkhac nhau (cho tung loq.i: non, ao, bao tay,
theit llblg, kinA ) Chinh VI v~y, vi~c xay dl!ng mQt h~ lien ngon ngil' du mq.nh
d6 bi6u di~n ta't d cac thong tin cua mQi ngon ngil' co th~ co, eung vdi bQ phangiai va bQ tq.o sinh thich h<;5p1llmQt vi~e vo cung phuc t?P va den nay v~n chuahoan thi~n du<;5c.
Ngoai fa, h~ dich lien ngon ngu can bi phe phan la doi hoi sl! phan giai chitiet nhi~u hdn muc dn thiet cho ba't ky C?P ngon ngil' nao Vi dl;l: Xet cau
"Washington announced that Bill Clinton will visit Vietnam in November" ch~nghq.n, thl h~ dich lien ngon ngil' se phan giai chi tier r~~g "Vlashington" la nghIa
tOaD dl;l(metonymy) d~ ch~"mqt phat ng6n vien cho chinhphil M5i".Nhu'ngtq.i
sao ta lq.i phai ma't thai giG d~ phan giai chi tier nhu' v~y, VI "Washington" trongmQi thu tieng khac (tieng Vi~t, Phap, Buc, Nga, ) d~u dU<;5cdung va hi6u dungnhu' v~y
MQt u'u di6m chint cua h~ lien ngon ngil' so vdi cae h~ dich chuy~n d6i las6 lU<;5ngnhil'ng bQ dich dU<;5Cdung boi h~ dich lien ngon ngil'.Neu ta gQi N la s6lu'<;5ngngon ngil' thalli gia trong h~ diet, thl vdi h~ dich lien ngon ngil', ta cm dn
2 *N bQ dich; it hdn so vdi N*(N-l) bQ dich cua h~ dich chuy6n d6i (theo [84],trang 175)
I::)H.kH.TlfNHIEN
THtr \liEN
I
00800 1
Trang 62.1.4 DJCH CHUYEN DOl CD PIlAP + PRAN GL'\I NGU NGHi~Day 1a chie'n lU<;icmang tint dung boa gilia mUG 69 phan tich Cll phap(syntactic parser) va mUG69 phan gi:H ngu nghla (semantic analyzer) VI ne'u chidung (j mUGd9 phan tich cu phap, thl h~ se kh6ng giai quye't dU<;icnhung truong
h<;ipnh?p nh~ng ngu nghIa ma c6 rung ca'u truc cu phap Con ne'u Cali nao h~
ding phan giai ngit nghla chi tie't nhu trong cach tie'p c?n lien ng6n ngu th1 ra'tkho thlJ'Chi~n va khong phai luGnao rung dn thie't VI v~y, giai phap dtch dung
boa va toi u'ula h~ se chu ye'u dlJ'avao vi~c phan tich CllpMp, va h~ chI phan
giai ngu nghla (j mUGdn thie't d€ dn khu nh?p nh~ng ngu nghla khi c~n mathai
Dtch lien ngon ngu
Dtch chuy€n d6i CllpM P
Dtch trlJ'ctie'p
lfmh 2.5: Cac ehie'n lu<;iCdtch trong dtch may.
Trang lfmh 2.£ co ve hint tam giac (hlnh tMp nay do nhom GETA du'a ral~n d~u tieD vao Dam 1968) rho ta tha'y: c<;lnhleD bell tnli d€ chi mue d9 phant1eh cali ngu6n, c<;lnhxuong bell phai eho tha'y mue d9 t6ng h<;ipcali diet Cangphan uch Sail (cang kho) tm ph~n chuy€n d6i (c<;lnhngang) dmg nga:n l<;li,nghla
la Gong vi~c chuy€n d6i dng it bon, d6ng thai Gongvi~e t6ng h<;ipca u dich eungnhi€u bon
Trang 7Ngon ngii' ngu6n Ngon ngii' dichH'mh 2.6: Muc de>phan UGh,chuy€n d6i va t6ng h~p trong cae chie'n 1u'<;5Cdich.
,Ngaai fa, rhea ffmh 2.6, ne'u ta di tu qnh leu bell tnli (ngon ngii' ngu6n)sang c~nh xu6ng bell bell phai (ngon ngii' dich) rhea du'ong ngang (th€ hi~n Gong
vi~c chuy€n d6i), thl muc de>phan rich ngon ngii' ngu6n cling nhu' muc de>t6ngh<;5pngon ngii' dich se la nhu' nhau Nhu'ng ne'u ta chuy€n d6i rhea du'ong xien v~bell du'di, thl tuy cong vi~c chuy€n d6i se dai bon, nhu'ng vi~c t6ng h<;jpcali dichl?i ngan bon Tuong tlf cha truong h<;jpngu<;jcl?i (nghieng leD tren), thl Gong vi~cphan rich se ngan bon, nhu'ng vi~c chuy€n d6i va t?O Cali ngon ngii' dich se daibon
Theo s1;1'phan rich trong Hlnh 2.5, thl ngu'oi ta da d6ng nhat lien ngon ngii'vdi ngii' nghla cila Cali, nhu'ng rhea Kevin 'Knight [98J (trang 2) thl kh6ng Dend6ng nhat nhu' v~y, vllien ngon ngii' thl phai de>cl~p vdi ngon ngii' ngu6n/dich,nhu'ng co nhii'ng cali ma ngii' nghla trong Cali ngu6n va ngii' nghla trong Cali dichcila no tuy khac nhau, nhu'ng l?i co cling me>tbi€u di€n trong lien ngon ngii'
Trang 8, , ,,',,? ",
2.2 CAC CACH TIEP Ci}.N CUA DJCH MA Y m~N NAY
2.2.1 DJCH MAY D{jA TREN LuA T (RBMT: Rule-Based MT)
Bay la eaeh tiep c~n truy€n th6ng xua't phat tU eaeh laID eua cac h~ lu~td§:n trong h~ chuyen gia trong lInt v1,I'ctri tu~ nhan t<:t° (AI: ArtificialIntelligence) Trang cac h~ xu 19 ngon ngil' t1,I'nbien thl cac lu~t d§:nnay thuangdU<;icxay d1,I'ngb~ng lay boi cae ehuyen gia ngon ngil'
Vi dl,l:d€ phan rich eu phap, nguai ta Ciaxay d1,I'ngcac lu~t van ph<:tIDnhu:
B6i voi kh6i xu 19 ngil' nghTa,nguai ta cling dung cac Iu?t tlf nghI- fa, nhu:
"Dell dQng tu = an -+ chu tIT= dQng Vtlt & d6i tu = db an dI1C;c".
Tudng tlf eho tat d cae cong vi~e khae eua h~ diet, d~u dlfa vao cae Iu?t
do chinh con nguai TIghTra va Qua vao IDay
Vi~c xay d1,I'ngIDQth~ cae lu~t nhu the doi hoi eong suc rat IOn va nhi~ukhi l<:tikhong baa quat bet ID9i truang h<;ip.Tuy nhien, trong IDQtmi~n gioi h<:tn (domain), thl phuong phap nay to ra hi~u qua va chung ta toaD toaD lam ehudU<;ieket qua dich (nghTala ta't ca nhil'ng Call ma thoa cac lu?t Ciadu<;iCxay d1,I'ngthl se du<;icphan rich va dieh t6t)
Trang 9D€ baa quat he't cac hi~n nrc;5ngng6n ngu, ngU'oi ta fighTr~ng cli vi~c them
nhi€u lu~t vao, nhU'ng(theo [40], tr 286) " du co them 1.000 hay cd 10.000 luq.t
thi van kh6ng baa quat htt duc;c " ma trai l~i eang khie'n cho h~ sinh ra
nhi€u cay eu phap ling vdi mQt cali ngu6n nh~p vao Ke't qua la h~ th6ng kh6ngbi€t ehQn cay cu phap nao la dung Ngoai ra, IDQtkhi so lu~t tang leu se khie'neho chinh ngU'oithie't k€ lu~t kh6 ki6m soar dU'c;5ctint hc;5ply ~ua ta't ca cae lu~tIDa mint da t~o ra va chac ehan se e6 nhting lu~t thua, nhiIng lu?t mail thuflnnhau
Th?t V?y, d€ phan rich eu phap, gia su ta e6 bQ van ph~m CFG={N,I,P,S}vdi cac thanh ph~n sail:
. N: cac ky hi~u kh6ng ke't thuc (non-terminal) g6m : S (Sentence: cau), NP(Noun Phrase: ngti danh-"tU ), VP (Verb Phrase: ngu dQng tu ), PP(Preposition Pharse : ngu giOi tu)
. I: cac ky hi~u m\lc ke't thuc (terminal category) g6m :
pro (d~i tu) = {i, you, he, we, }
noun (danh tu) ={man, car, boy, boys, girl, chicken, chair, house, }
del (dinh tu) ={a,the, }
verb (dQngtu) = {sit, sat, eat, borrow, help, }
prep ( gidi tu) = {on, in, to, from, }
. P: cac lu?t van ph~m can nhU' sail :
S j-NP VP;
NP -+ del noun; NP -+ del noun PP
VP -+ verb; VP -+ verb NP
PP -+ prep NP;
Trang 10Voi bQ Iu~t sinh tren, till d6i voi Cali nh~p vaG Ia "I see the man in the car",
se pIlau rich duQc thanh cay cu phap nhu Hinh 2.7 voi gioi ngu "in the car" b6nghla cho danh tit "man" (co nghla If! "nguoi daTI6ng do 0 trong xe hoi") va diiyIf!diy cu phap dung
Hlnh 2.7: K6t qua pIlau rich cu phap Cali "I see the man in the car"
Nhung, n6u ta Cali nh~p "I saw the man in a day", tIll bQ pIlau rich cu phap nay
se pIlau rich y nhu cay lIen, e6 nghla Ia giOi ngu "in a day" thay VIb6 nghla chodQng tit "saw" tIll no I~i b6 nghla cho danh tit "man" va day If! cay eu phap sai.B6 sli'a I6i sai nay, nguoi ta li~n them mQt Iu~t sinh VP ~ verb NP PP VaG bQvan phq.m noi lIen, va bQ phan rich eu phap moi nay Iq.isinh ra them 01 cay euphap nhu H'mh 2.8 duoi (ngoai cay cu phap gi6ng nhu Hinh 2.7 lIen) K6t qua Iah~ dich kh6ng bi6t chQn cay cu phap naG Ia dung Trong thuc t6, voi mQt bQ Iu?tsinh g6m khoang 500 Iu~t, tIll sO'cay cu phap t?O ra eho 01 cali tIling blnhkhoang 10 tit se cB vai tram cay
Trang 11day a
Hinh 2.8: K€t qua phan tfch cu phap Cali "I saw the man in a day"
D~ giai quy€t va'n d~ nay, ngu'oi ta da tlm each chia nho cac TItan ke't thucthanh cac nh6m ti~u lo~i chi ti€t ton (categorical terminals) ma trong d6 co baoham d ngli nghla Di6u nay hi~n nhien lam tang s6lu?t Jen ga'p bQi, va tuy n6c6 th~ khli' nh~p nhftng du'Qcnhling tru'ong hQp ta chu dinh, nhu'ng l~j phat sinhthem nhi~u hi~u ung ph1;1khac ngoai ymu6n
Tom l~i: vdi each ti€p c~n RBMT, chung ta c6 th~ xay dt1ng du'Qc mQt h~th6ng ban d~u mQt cach d€ dang, nhu'ng cang v~ sail, khi qui mo tang leD thichung tra Den kh6 ki~m soar, th~m chi chung c6 th~ bi tt1s1;1Pd6 du'oi chinh sucn?ng cua chung (theo [114]) Cach nay c6 u'u di~m la dt1a tIeD ly thuy€t ngonngli hQc, vi v~y n6 giai quy€t du'Qch~u h€t eac hi~n tu'Qngc6t 16i cua ngon ngli(core phenomena), nhu'ng chung l~i kh6ng giai quy€t du'Qccac hi~n tu'Qng ph1;1
(nhling tru'ong hQp ngo~i l~ ma khong tuan rhea lu~t chinh, du'QcgQi la marginal
phenomena).
Trang 12-. -2.2.2 D!CH MA.Y Dl)'A TREN THONG KE (SMT: Statistical-based MT)
Thay VIxay dt;1'ngcac tli di€n, cac qui lu?t dich b~ng ray nh11trang cac h~dich RBMT, h~ dich nay se dlfa tren th6ng ke d€ xay dt;1'ngcac tu di€n va cacqui lu?t dich d6 mQt cach 1\1'dQng B€ tht;1'chi~n d11<;:1c di~u nay, may dn c6 t?P ngii' li~u song ngii' ra't Ion May tinh se th6ng ke va rut ra xac sua't dich tu'ong lingv~ tu / ngii' hay ca'u truc giii'a hai ngon ngii'; xac sufft cJ;lUy€ndich vi tIt giii'a hai
I ngon ngii'va xac sua't xuffthi~n cua tli/ngii'd6 trong mQtngii'dnh nha't cinh nao
d6 [97]
Ch!ng h,~mtrong h~ dich Vi~t-tAnh[19], ta gQi Cali ngon ngii' ngu6n la v (Vietnamese), cali ngon ngii' dich la e (English), C?P (v,e) la C?P diu d11<;:1Cdich
bai nhau Bai loan cua b~ dicb nay cbinb la: ung voi mQt diu v d11QCcbo ba't k)',
ta di tlID cali e bQp ly nba't (!8 cali d11<;:1Cdicb gftn dung nbfft cila v sang tie'ng Anh) Nghla la ta tlm xac sua't P(v,e) c1,I'cd~i (xac sua't xua't hi~l1 d6ng thai 2 call
v va e) Vl v va e ph1,lthuQc lfin nbau, nen tbeo ly tbuye't xac sua't c6 di~u ki~n
tbl:
P(v,e) =P(v)*P(elv) (2.1)NghIa la bai loan tra tMnh:
TIm argmax P(v,e) = argmax P(v) *PeeIv) (2.2)
P(elv) =P(e)*P(vle)/P(v) (2.4 )
Trang 13VI m~u s6 khang ph\! thuQc vao e, Den :
Trong d6, pee) la ma hlnh ngan ngu (language model) cua ngan ngu dich.
d day, pee) chinh la ma hlnh N-gram, P(vle) chinh la ma hlnh dich (translation
model) va e la call tie'ng Anh t6t nha't tu'dng ung vai call tie'ng Vi~t v Cac tinh cach Thains6 Pee), P(vle) va e duQc ma ta chi tie't trong cang trlnh [19].
Cach tie'p c~n SMT nay khang doi h6i sv phan rich Sail v~ ngan ngu, chunghoan roan tV dQng th1!chi~n cae qua trmh phan rich, chuy6n d6i, t<;toca u b~ngcach d1!atren ke't qua th6ng ke duQe tu kho ngu li~u song ngu huan luy~n Chinh
VI v~y, c6 khi h~ cho ra ke't qua khang doan truae duQe Vai S'!!phat tri6n v~ph~n cung (bQ nha va t6e dQ tint loan) cling nhu ph~n m~m nhu hi~n nay, clingvai nhung giai thu~t 'tlm kie'm"/ s~p xe'p / thay the' mai, dil eho phep each diehnay ngay cang hi~u qua Ngoai ra, do tinh v~n dQng, bie'n d6i eua ngan ngu, Dencae tu v1!ng,van ph<;imcua ngan ngu se bie'n d6i theo Chinh VIV?y ma each tie'pc~n nay c6 lQi the' hdn so vai cac each tie'p e~n ma phai d1!avao tu vl1ng hay lu~tng6n ngu c6 dinh
T6m lai: each tie'p c~n d1!a tren th6ng ke la mQt buae dQt pha v~ phudngphap lu~n trang dich may, nhung ke't qua th1!ete' hi~n nay eua nhung h~ nay contha'p (co 40%) VI V?y, nguoi ta dang nghien CUlldi tie'n n6 b~ng each dua themtri thuc ngan ngu Ngoai Ca,cac ke't qua trung gian cua dieh may th6ng ke la cacbang th6ng ke kh6ng 16, Den cac nha ngan ngu hQc kh6 theo d5i, giai thich haycan thi~p duQc