1. Trang chủ
  2. » Luận Văn - Báo Cáo

Đoán nhận và giải quyết nhập nhằng thực thể tiếng việt trên môi trường web

80 10 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

DAI HOC QUOC GIA HA NOI DOAN NHAN VA GIAI QUYET NHAP NHANG THU^C THE TIENG VIET TREN MOI TRI/OfNG WEB \ r (Bao cao tong hop de tai nghien curu khoa hoc cap DHQGHN) Ma so: QC.07.06 Chu nhiem de tai: ThS Nguyen Cam Tu DAI HOC QUOC GIA HA NOI_ TRUNG TAM THONG TIN THLl VIEN ooowoooo^ Ha Noi - 2008 MUC LUC • • BANG GIAI THIGH CAC C H C VIET T A T DANH SACH NHU>JG NGI/CJI THAM GIA DANH MUC CAC B A N G SO LIEU DANH MUC CAC HINH TOM T A T C A C K E T Q U A NGHIEN CUtJ CHINH CUA DE T A I 5.1 Ket qua ve khoa hoc 5.2 Ket qua phuc vu thuc te 5.3 Ket qua dao tao B A O CAO TONG KET 6.1 Dat van dS 6.2 Tong quan cac van de nghien cuu 6.3 Muc tieu va noi dung nghien cuu cua de tai 6.3.1 Muc tieu nghien cuu 6.3.2 Noi dung nghien cuu 6.4 Dia diem, thoi gian va phuang phap nghien cuu 6.5 Ket qua nghien cuu 6.5.1 Cac cong bo lien quan den de tai 6.5.2 Ket qua dao tao cua de tai 6.5.3 Ket qua ung dung ciia de tai 6.6 Thao luan 6.7 K6tluan 6.8 Tai lieu tham khao PHU LUC : 4 5 5 7 8 18 18 18 19 19 19 19 21 23 BANG GIAI THiCH CAC C H C VIET TAT Chjr viet tat LDA HAC Y nghTa Latent Dirichlet Allocation Hierarchical Agglomerative Clustering DANH SACH NHl/NG NGUQI THAM GIA STT Ho ten Hoc ham, hoc vi, noi cong tac 10 11 12 13 14 Nguyen Cam Tii Ha Quang Thuy Phan Xuan Hieu Nguyen Viet Cuong Nguyen Thi Thuy Linh Nguyen Thu Trang Nguyen Thi Huong Thao Do Thi Minh Viet Truang Thi Thu Hien Le Hong Hai Luu Ngoc Tuan Le Dieu Thu Tran Thi Ngan ThS PGS TS TS ThS HVCH HVCH HVCH ThS ThS ThS HVCH CN CN Khoa CNTT - DH Cong Nghe Khoa CNTT - DH Cong Nghe DH Tohoku Vien KH&CN tien tien Nhat Ban Khoa CNTT - DH Cong Nghe Khoa CNTT - DH Cong Nghe Khoa CNTT - DH Cong Nghe Khoa CNTT - DH Cong Nghe Khoa CNTT - DH Cong Nghe Khoa CNTT - DH Cong Nghe Khoa CNTT - DH Cong Nghe K49CA - DH Cong Nghe K50CA - DH Cong Nghe DANH MUC CAC B A N G S LIEU • • BSng Danh sdeh cac thuc thi kho du Heu thu nghiem 14 • DANH MUC CAC HJNH • Hinh Giai quy6t nhap hhkng thuc th6 su dung phuang phap khong gian vector Hinh Mo hinh doan nhan va giai quyet nhap nhang thuc the tieng Viet 12 Hinh PhAn tir khoa (term part) va phan chu de (topic part) tuang ung ciia mpt tai li?u dau vao () day so lan xuat hien cua mot chu de li le thuan vai so cua no phan phoi chu detra ve 13 Hinh Ket qua thuc nghiem vai tap thuc the thu nhat va khong su dung phan tich chii de (Lamda=0.2) 16 Hinh Ket qua thuc nghiem vai tap thuc the thu nhat va sii dung phan tich chii de 16 Hinh Ket qua xac dinh nhap nhang khong dung phan tich chii de an tap gom hai thuc the goc la hai cau thii Hong Son va Van Quyen 17 Hinh Ket qua xac dinh nhap nhang dung phan tich chii de an tap gom hai thuc thS g6c la hai cau thii Hong Son va Van Quyen (Lamda=0.3) 17 TOM T A T C A C K ^ T Q U A NGHIEN CIJU C H I N H C O A D ^ T A I 5.1 Kit qua ve khoa hoc - Mot bao cao khoa hoc chuan bi gui dang tap chi ACM TALIP • Cam-Tu Nguyen, Xuan-Hieu Phan, Thu-Trang Nguyen, Susumu Horiguchi, Quang-Thuy Ha (2008) Web Search Clustering and Labeling with Hidden Topics The ACM Transaction on Asian Language Processing, (to be submited) - Mot bao cao khoa hoc giii dang Hoi nghi Khoa hoc Quoc te • Dieu-Thu Le, Cam-Tu Nguyen, Xuan-Hieu Phan, Quang-Thuy Ha, and Susumu Horiguchi (2008) Matching and Ranking with Hidden Topics towards Online Contextual Advertising The 2008 lEEE/WIC/ACM International Conference on Web Intelligence., Sydney, Australia, December 2008 (accepted) - Mot bao cao khoa hoc da dugc trinh bay tai Hoi thao Khoa hoc Quoc gia va giii toan van noi dung • Le Dieu Thu, Tran Thi Ngan, Nguyen Cam Tii, Nguyen Thu Trang Xay dung Ontology nham ho trg tim kiem ngu nghTa ITnh vuc y te Hoi thdo Quoc gia ldn thu XI "Mot so vdn de chon loc cua CNTTvd Truyen thong Hue, 12-14/6/2008 - Mot so bao cao ve trich chgn va xii ly nhap nhang thuc the dugc trinh bay tai Phong Thi nghiem ve tinh "Cong nghe tri thiic va tuang tac nguai-may" 5.2 - Ket qua phuc vu thipc te Mot kho du* lieu ve trich chgn thuc the va xii ly nhap nhang thuc the tieng Viet \ - r r Mot module trich chgn va xii ly nhap nhang tieng Viet viet tren ngon ngu Java va thii nghiem tren kho du lieu noi tren 5.Z Kit qua dao tao (Noi dung nghien ciru ludn dn, ludn vdn, khoa ludn tot nghiep vd cong trinh sinh vien nghien ctru khoa hoc gdn lien vdi noi dung nghien cuu thuc hien de tdi) Mpt luan van Cao hpc cua NCS Nguyin cAm Tii vai dh tai "Hidden Topic Discovery Toward Classification and Clustering" da bao ve cong thdng 5/2008 Mot khoa luan tot nghiep dai hgc cua sinh vien Le Dieu Thu vai de tai "On the Analysis of Large-Scale Dataset towards Contextual Online Advertising" da bao ve cong thang 6/2008 BAO CAO T6NG K^T 6.1 Dat van de m • Doan nhan va giai quydt nhap nhang thuc the (xac dinh dong tham chieu, khii nhap nhang thuc the) la qua trinh nham xac dinh nhiing lan de cap den cac thuc the ciing ten c6 thuc su la noi den ciing mot thuc the thuc te hay khong (Kibble va Deemter, 2000) Vi du, doan van ban sau: John Smith dugc xem xet va chi dinh vao vi tri chii tich hgi dong Trong qua khii, ong Smith dugc xem la mot su lira chgn hoan hao Tuy nhien, John, nguai ban tot ciia ong ta, khong dugc xem xet vao vi tri Doan nhan va giai quyet nhap nhang thuc the huang tai viec xac dinh xem John Smith va "ong Smith" co phai la mot nguai hay khong, va lieu John (a cau 3) co phai la dk cap dSn cung mot nguai hay khong Bai toan thuang dugc ma rgng dk xac dinh cac tham chieu nhu "ong ta" hay tham chi la "nguai ban tot nhat ciia ong ta", nhien a day chiing ta se khong nghien cuu cac truang hgp Giai quySt t6t bai toan se gop phan quan trgng cho viec tang chat lugng cac he thdng tim kiSm, trich chgn va xii ly cacs tham chieu den "nhOng nguai dugc yeu thich" cac ban tin(BNN 2001), hay Doan nhan va vk chii de (Topic Detection and Tracking, Allan 2002) /• \ ' > •> Doan nhan va giai quyet nhap nhang thuc the tren nhieu tdi lieu kiem tra xem nhung l4n tham chieu den ciing mot ten cdc tdi lieu khdc co phai la tham chiiu din ciing mot thuc the hay khong Bai toan tham chi phiic tap hon bai toan xac dinh ddng tham chieu tren mot tai lieu vi cac tai lieu thuang dugc l§y tir nhiiu ngu6n khac nhau, dugc viet bai nhieu tac gia vai nhung qui uoc va each vilt khac (Bagag va Baldwin, 1998) hay tham chi la cac ngon ngu* khac 6.2 Tong quan cac van de nghien CLPU Bagga va Baldwin (1998) trinh bay mot thuat toan cho viec doan nhan va giai quylt nhap nhing thuc thi nhieu tai lieu sii dung mo hinh khong gian vector Nhieu ket qua nghien cuu hien dua tren cac ket qua nghien cuu ciia Bagga va Baldwin Mot s6 he thdng nhu NetOwl ciia ISOQuest va Textract ciia IBM cung da xdc dinh dugc nhieu ten thuc the tham chieu den cung mpt thuc the nhung khong CO khd nang phan biet cdc thuc thi khac nhung co ciing mpt ten • TIPSTER Phase III la he thdng ddu tien xac dinh bai todn phan tich ddng tham chieu tii nhieu tai lieu nhu mot ITnh vuc nghien ciiu vi no la cdng cu trung tam ciia hS thdng sinh tom tdt tir nhieu van ban va viec trdn thdng tin (Bagga va Baldwin, 1998) Hdi nghi hieu van ban lan thii (MUC-6) ciing lihan dinh tham chieu ddng vdn bdn Id mdt bdi todn cd trien vpng nhung lai khdng dugc dua vao pham vi ciia hdi nghi vi nd dugc xem la qua khd Cdc van ban tieng Viet tren Web la mdt ngudn tai nguyen vd ciing phong phii va htju ich tiec rang viec khai thdc ngudn tai nguyen cdn het siic han che Trong nghien cuu chiing tdi hudng tai viec xay dung mdt module cho phep dodn f \ r r nhan va gidi quyet nhap nhang thuc the tieng Viet tir cdc tai lieu tim kiem dugc tra ve ciia mdy tim kiem GOOGLE Ndi dung nghien ciiu ciia chiing tdi tap trung vao cdc md hinh khdng gian vector , cdc phuang phdp thdng ke, hgc ban giam sat va phan cum Ket qua nghien cuu se la CO sd cho nhung nghien cuu bai toan va khai thdc huu hieu han cac tai lieu tieng Viet tren mdi trudng Web 6.3 Muc tieu va ngi dung nghien 6.3.1 Muc tieu nghien CLPU cua de tai CLPU D I tai cd muc tieu gdp phdn tang cudng nang luc nghien ciiu, trien khai ciia nhdm nghien ciiu Khai pha diJ lieu va iing dung tai DHCN theo mdt sd tieu chi nhu sau \ • r f r • y •> Nghien ciiu de xuat md hinh giai quyet nhap nhang thuc the tren Web r Xay dung cdc cdng cu tien ich thiet yeu cho phep trien khai cac iing dung diln hinh ciia tim kiem hudng thuc the, trich chgn thdng tin tren Internet • Dao tao can bd nghien cuu chat lugng cao khudn khd nghien cuu ciia de tai 6.3.2 N^i dung nghien CLPU 6.3.2.1 Cac phirong phap doan nhan va giai quyet nhap nhang thirc the tren Internet Mdt nhirng nghien cuu ddu tien vl gidi quylt nhap nhdng thuc thi tren nhiiu tai lieu la nghien cuu cua nhdm [Bagga, Bold win, 1998] Phuang phdp ciia hg cd thi dugc tdm luge nhu hinh ve dudi day: CoEc^ercDce Unms for doc.Ul UniAiersit)' of Pemisvi\'ania's Pennlight Coreference S)'stem SeatcDceExtractor ^oss-Documeac CorefereDce Cfaaius dDcOl doc.36 1 ' dor.l9 doc.3S 1 dx.zz VSKlDisambimate ' dDC-20 Hinh Giai quyet nhap nhang thirc the sir dung phvang phap khong gian vector Dau vao ciia he thdng Id mdt tap cdc tai lieu chiia cac thuc thi nhap nhdng Trudc het, tap tai lieu dugc cho qua mdt he thdng xac dinh ddng tham chiiu dan tai lieu Dau ciia budc la mdt tap cac danh sach ddng tham chieu, mdi tap tuang iing vdi mdi tai lieu Module trich riit cau (Sentence Extractor) tim tai lieu cac \ r r f cau chiia ddng tham chieu va sinh cac tdm tat tuang iing Cdc tdm tat dugc bieu dien dudi dang vector dd mdi phan tii ciia vector la trgng sd ciia tir khda tuang ling (trgng sd ciia tir cd the dugc tinh theo cac phuang phap nhu TF, hay TF-IDF) Tiep dd, nhdm tdc gia sii dung dp Cosine de dp tuong tu giiia cac cap vector (tuong iing vdi cac cap tdm tat) Neu hai tdm tat cd dp tuong tu nhd han mdt nguong xdc djnh thi chiing dugc xem la ciing ndi ve mdt thuc thi Ddu Gi^ quySt tdt hki to4n n^y se g6p phSn quan cho viec tang ch4t lugmg cac he thdng tim kiem, trich chpn vi xur ly cacs tham chieu den "nhttng ngucri dugrc yeu thfeh" cdc bSn tin(BNN 2001), hay Dote nhan va vit chii dl (Topic Detection and Tracking, Allan 2002) Dodn n h ^ vd giii quylt nhip nhing thuc the tren nhiiu tdi li^u kiim tra xem nhCrng ISn tham chiiu din ciing m6t ten cdc tdi lieu khdc c6 phii li tham chiiu din ciing mpt th\rc thi hay khdng Bii toan niy tham chi phuc tap hon bii toin tnr6c vi cac tii lidu thuang dugrc liy tir nhiiu ngudn khic nhau,c6 li cua nhiiu tic gii vi vdri nhung qui u6c vi cich vilt khac (Bagag vi Baldwin, 1998) hay t h ^ chl li cac ngdn ngii khic Bagga vi Baldwin (1998) trinh biy mdt thuat toan cho viec doan nhan vi giii quylt nhip nhing thuc the nhieu tii lieu sii dung md hinh khdng gian vector Nhieu ket qui nghien curu hi^n d\ra tren cac kit qui nghien cuu niy cua Bagga vi Baldwin Mpt sd h? thdng nhu NetOwl ciia ISOQuest vi Textract ciia IBM ciing da xic djnh dugc nhidu ten thuc thi tham chiiu den cimg mpt thuc thi nhung khdng co ning phan bi?t cac thuc thi khic nhung co cimg mdt ten TIPSTER Phase III li he thdng diu tien xic dinh bii toan phan tich ddng tham chiiu tir nhieu tii lieu nhu mdt linh vuc nghien cuu vi no li cdng cu trum tam ciia he thdng sinh tom tat tir nhieu van bin vi viec trpn thdng tin (Bagga vi Baldwin, 1998) Hdi nghi hiiu van ban lan thu (MUC-6) ciing nhan dinh tham chieu ddng van ban li mdt bii toan c6 trien vpng nhung lai khdng dugc dua vao pham vi ciia hdi nghi vi no dugc xem la qua kho Cic van bin tiing Viet tren Web la mdt ngudn tai nguyen vd cimg phong phii vi hiiu ich tile ring viec khai thac ngudn tii nguyen niy hit sure han chl Kl thira nhutig kit qui nghien cuu dl tii QC.06.07, nghien curu niy chiing tdi hudng tdi viec xay dung mdt module cho phep doan nhan va giii quyet nhap nhang thuc the tiing Viet tir cic tii lieu tim kiim dugc tra vl cua may tim kiim GOOGLE Ndi dung nghien cuu ciia chiing tdi tap trung vio cac md hinh khdng gian vector , cac phucmg phap thdng ke, hpc ban giam sat va phan cum Kit qua nghien ciru se la co sd cho nhimg nghien cim bii toan va khai thac hiru hieu hon cac tai lieu tieng Viet tren md trudng Web Truy van Google Cictii lieu tra ve ttr GOOGLE Tich tir va trich chpn thuc the (Module xay dyng khudn kho dl tii QC.06.07) Tai lieu da dugc xac dinh ten thuc thi Boan nhan va giai quyet nhap nhang thuc thi Van ban tdm tat ve mdi thuc the iu «^au True d^ ki^n bao cau nei qua taa de tii (chi tiet hoa cac chvong muc): f"l ^ - - - - Phdn m& ddu: Gidi thi?u muc tieu, ndi dung nghien cuu ciia dl tii, so luge va qua trinh thuc hi§n dl tai va cic kit qui chinh Chucmg I: Cdc nguyen tdc ca bdn ngu phdp tieng Viet Cimg cip nhung khai niem, nhung nguyen tic co ban nhit ngii phip tiing Viet, vi li CO sd dl cho cic giai phip sau Chuang 2: Mo hinh khong gian vector Gidi thieu cic md hinh khdng gian vector cho bii toin doin nhan va giii quyet nhip nhing cic Ihuc thi tieng Viet Chuang 3: Cdc mo hinh hoc mdy, hoc bdn gidm sdt cho viec phdn tich dong tham chieu Trinh bay cic md hinh hpc miy giam sat/bin giim sat nhim phan tich ddng tham chiiu ttr nhiiu nguin tii lieu Chuang 4: De sudt mo hinh cho viec dodn nhgn vd gidi quyet nhap nhdng thuc the tiing Viet B I suit md hinh cho bii toan doin nhan va giai quylt nhap nhing thuc the phu hgp vdi nhimg dac diim tieng Viet va d\ra tren mo hinh dinh hudng dugc trinh bay chuang vi chuang Chuang 5: Ddnh gid vd kit ludn Binh gia chung vl dl tii, dua nhung kit luan cu thi cung nhu nhung hudng nghien cim tiip theo 11 Tinh da nganh va lien nganh ciia dl tai: - Be tii lien quan din chuyen nganh khac Nhung chuyen nganh chinh dugc liet ke - dudi day o Ngdn ngii hpc o Hpc may va tri tue nhan tao o Xic suit, thdng ke o Toin rdi rac Tinh da/lien nganh thi hien qua viec tich hgp cac tri thuc tir cac chuyen nganh tren de giii quylt nhiing vin dl khudn khd dl tai 12 Phu-ong phap luan va phu-ong phap khoa hoc sir dung de tai: - Thu thap va khio sat cac ndi dung lien quan tir Internet va cac co quan ddi tic cimg ITnh vuc ngdn ngu hpc va xir ly ngdn ngir tu nhien - Kit hgp nghien cuu cong nghe va ly thuyet Td chirc seminar, tham gia cac hdi nghi, hpi thao lien quan din Imh vuc xu ly ngdn ngii tu nhien 13 Svt dung nhirng trang thilt bj nao ciia do-n vi: - Sir dung cac thiet bi hien c6 tai bp mdn Cac he thdng thdng tin, Khoa CNTT r 14 Khi n3ng hfp tic quoc te - - Hgp tic da/dang cd (ten tl chiic va vin dl hgp tic): o Vi^n khoa hoc tien tiin Nhat Bin (jaist) vdi nhung hgp tic vl cic xii ly ngdn ^ ngft tvr nhien, khai phi du lieu text, khai phi du lieu Web o Dai hgc Tohoku Nhit Bin vdi nhung hgp tic vl cic xir ly ngdn ngu tu nhien va khai phi dii Heu text/web 15 Cic hoat dppg nghien cuu cua de tai: - Nghien curu ly thuylt - Diiu tra khio sit - Bien soan tii li^u - Vilt bao cio khoa hpc - Hpi thio khoa hpc - Tap huin - Hoat dpng khic - Che tao san pham - Chay thii kiim nghiem - Hoin thien sin phim D 0 D 0 D 16 Ket qua du* kien: 16.1 Ket qud khoa hoc - Du kien nhiing ddng gdp ciia de tii: o Khoa hpc cdng nghe: li ca sd cho nhimg nghien cuu lien quan tdi khai phi dil lieu Web tiing Viet, o Ve phuang phip luan: ddng gdp mdt md hinh phii hgp cho viec giai quyet phan nao bii toin tren - Sd bii bio, sach, bio cao khoa hgc du kiln se dugc cdng bd (ghi ro): Sd bii bio dang tap chi qudc te: Sd bai bao hdi nghi qudc te: Sd bii bio dang tap chi nudc: 0-1 Sd bai bao hdi nghi toan qudc: Sd sich chuyen khio: 16.2 Kit qud ung dung - Sin pham cdng nghe: o Mdt module thu nghiem doan nhan va giai quyet nhap nhang thuc the tieng Viet tren mdi trudng Web 16.4 Ket qua ve tdng cuang tiem luc cho dan vi muc dich tang cudng nang nghien ciiu khoa hgc ciia cic sinh vien, nghiSn cuu sinh va can bd bd mdn Cac he thdng thdng tin 17 XSng kinh phf d l ngh|: Hai mvoi lam trieu dong - 25.000.000 VND 18 NOI DUNG VA TrflN DO TH^C HI?N CUA DE TAX (CAC CONG VI$C CAN TRifeN KHAI, T H d l HAN TH^C HIEN VA SAN PHAM DAT DU^OfQ TT Hoat d9Dg nghien cuii Thu thSp Va viet t6ng quan tai heu Xiy dung de cuang nghien cim chi tilt 3/2007 4/2007 Bao cao tong quan 4/2007 6/2007 Bao cao chuyen de 5/2007 9/2007 Bio cao chuyen de Xay dung bp dO lieu thir nghiem 4/2007 10/2007 Thir nghifm hf thdng 5/2007 9/2007 Xii ly kit qui 8/2007 9/2007 Cic nguyen tic co bin ng5 phip tiing Viet Md hinh khdng gian vector Thoi gian thvc hifn San pham khoa h^c Tirthtog Den thang Diiu tra khio sit, thi nghiem, thu thap s6 lieu Bp dil lieu hpc chuan He thdng thir nghiem Bio cio danh gia Vi6t bao cao cic chuyen de Cic nguyen tac viec xay dung bp dii lieu thii nghiem Md hinh hpc ban giam sat cho bai toan doan nhan va giai quyet nhap nhang thuc Bao cao chuyen de the tren nhiiu tai lieu Md hinh hpc giam sit cho bai toan doan r ^ * nhan v i giii quyet nhap nhang thuc the 6/2007 12/2007 dd cd it nhat bai bao cap qudc gia tren nhiiu tii lieu Md hinh cho bai toan doan nhan va giai quyet nhap nhang thirc the tieng Viet Hpi thao giiia ky 8/2007 9/2007 Bao cao hpi thao B6 sung s6 Heu/thir nghiem/ung dung He thing xii Iy cac Tiep tuc phat triSn ung dung tren cac mo hinh ngon ngu da dugc de cap cac bao cao chuyen de 9/2007 12/2007 bai toan nen tang tieng Viet Tdng kit s6 lieu 9/2007 11/2007 Bo du Ueu hpc chuan Vilt bio cio tong hgp 12/2007 2/2008 Bio cio tdng ket de tii HOi thio lin cuoi 12/2007 2/2008 Tai lieu hdi thao Ngp sin phim 2/2008 3/2008 Cac bii bio, bao cio, phan mem Nghidm thu 6k tii 3/2008 4/2008 Kit qiia nghidm thu ditii r 19 PHAN B IONH PHi TT Ngi dung Xiy dung dl cuang chi tilt 2.000.000 Thu thap tii lieu v i vilt tdng quan vl dl tii 3.500.000 Thu thip tii li?u (mua, thue) 2.000.000 Dich tii lieu tham khio (sd trang x gii) Vilt tdng quan Dieu tra, khio sit, thi nghiem, thu thap sd lieu nghien cuu Chi phi tiu xe, cdng tic phi f 1.500.000 12.500.000 Chi phi thue mudn 9.000.000 Chi phi boat ddng chuyen mdn 3.500.000 Thue, mua sim trang thiet bi, nguyen vat lieu Thue trang thiet bi Mua trang thidt bi Mua nguyen vat lieu, cay, Yik bao cao khoa hoc, nghiem thu 3.000.000 Vi^t bdo cao 1.000.000 Hoi thao ' Kinh phi (VND) 500.000 Nghiem thu 1.500.000 Chi khac 4.000.000 Il6 (15 * Mua v3n phdng phim 750.000 In Sn, photocopy 750.000 QuSn ly phi 2.500.000 xSng kinh phi 25.000.000 a5 /JL thing r n a m 0 f ' Phe duyet cua Tru-oTig DHCN T/L HIEU TRUONG TRUONGPHONG DAO TAO SAU OAI V\{^C VA NCKH PGS,IS,.^^;gg-'^ \j;^ l^ lUOC GIA HA NOI CONG HOA XA HOI CHU NGHLV \ ^ E T NAM HQC C N G NGHE Doc lap - Tir - Hanh phuc ** = = = = = /HD-KHCN Hd Noi ngdy4b • tlidng ndm 2007 HOP D N G THirC H l £ N D £ TAI NGHIEN ClTU KHOA HOC CAP DAI HQC QUOC GIA HA NOI NAM 2007 - Can cu Quy djnh vi To chuc vd hoat d^ngciia Dgi hqc Qudc gia Ha Npi ban ha^h theo Quyit dinh so 600/TCCB ngdy 01 thdng 10 ndm 2001 cua Dgi hgc Qudc gia Ha Ngi: qui dinh quyen han cua Hieu tru&ng cdc tnrdng Dgi hgc thdnh vien; - Cdn cu cdng vdn sd 1424 ngdy 19 thdng ndm 2007 cua Dgi hgc Qudc gia Hd Ngi vi giao nhi4m vii vd chl tieu kd hoach khoa hgc, cdng ngh^ vd mdi truong nam 2007; - Cdn cu Quyit dinh so 305/QD-NCKH ngdy thdng nam 2007 cua Hieu tru&ng Tru&ng Dgi hgc Cdng ngh^ DHQGHN ve vi^ giao nhi^m vy thuc hien di tdi NCKH cdp Dgi hgc Quoc gia Hd Ngi nam 2007 Trudng qudn ly cho cdc chu nhipn de tdi; - Cdn cu De cuang nghien cuu ciia de tdi da dugc phe duyet, Chung tdi g6m: Ben giao nhiem vu (goi La Ben A): Triro^ng Dai hQC Cong nghe - DHQG HN Dai dien 1^: PGS TS Nguyin Ngoc Binh Chuc vu: Pho Hieu truang Ben nhan nhicm vu (goi la Ben B) Ong (Ba): NCS Nguyin C4m Tu Dan vi cong tac: Khoa Cong nghe Thong tin - Truang Dai hoc Cong nghe ky hap d6ng thuc hien de tai nghien curu khoa hoc cap Dai hpc Quoc gia Ha Npi: Ten de tai: "Doin nhan va giai quyet nhap nhang thirc the tieng Viet tren moi truong web" Mas6:QC07.06 vai nhung di6u klioan thoa thuan nliu sau: Di6u 1: Ben B chiu trach nhiem to chirc trien khai thuc hien cac npi dung nghien curu cua d€ tai theo dung tien thirc hien da dang ki de cuang ngiiien cim va thuc hien d§y du cac nhiem vu dugc Hieu truang Truang Dai hpc Cong nghe giao, ghi quy^t dinli s6 305/QD-NCKH 8/5/2007 DiSu 2: Ben B bao cdo kei qua thuc hien de tai va giao npp cac san pham cua de tai cho ben A theo dung cac qui dinh hien hanh ciia Dai hoc Quoc gia Ha Noi va cua Truang Dai hoc Cong nghe truac 15/5/2008, bao gom: - Mpt mo hinh phu hgp cho viec giai quyet bai loan doan nhan va giai quyet nhap nh^ng cac thuc th^ ti^ng Viet tren moi truang web - 01 bai bao dang tap chi nuac - 01 bai bao hpi nghi quoc l6 - 01 bii bao hoi nghi toan quoc - Mot module thur nghi?m doan nh|n va giai quyet nhap nhang thuc the tieng Viet tren moi tmdng web - T6ng quan ve de tai kem theo file dien tu (Mpt b^i bang tieng Viet, mpt ban bang ti^ng Anh - Highlight; mdi ban dai khoang 400 tir tren mpt trang giay kho A4, font Times New Roman, ca chtr 13pt each dong dan; Npi dung: t6m t^t muc tieu, phuang phap v^ npi dung nghien ciru, kk qu^ dat dugc, danh gia y nghia va tac dpng khoa hpc c6ng ngh$ cua c^c ket qua dat dugc cung nhu cua viec thuc hien d^ tai) Diiu 3: Tong kinh phi cua d^ lai da dugc phe duyet la; 25 000 000 ddng (bang chii: Hai muoi ldm trieu dong chdn) Chi phi cu the nhu du toan cua ban du tru kinh phi (theo mau cua ph6ng TV-KT) Di§u 4: B8n B co trach nhiem su dung kinh phf dugc cip theo dung muc dich, dung che dp t^i chinh hi€n hanh, quyet loan vai phong T^i vu - Ke toan va thuc hien viec nghiem thu dh tai trudc 15/7/2008 Dien 5: Ben A giu quyen sd huu tri tue ddi vdi cac ket qua khoa hpc cua de tai Tat ca cac cdng bo li^n quan den npi dung khoa hpc cua de tai phai ghi ro ngudn tai trg kinh phi nghien cuu theo ma sd ciia de t£li nhu sau: - Ddi vdi bai bao, bao cdo khoa hpc: "Cdng trinh dugc tai trg mpt phan tu d^ tdi mang ma sd: QC.07.06, Dai hpc Qudc gia Ha Npi'' - Ddi vdi luan van (khda luan ): "Luan van (klida luan , ) dugc thuc hien khudn khd de tai mang ma sd: QC.07.06, Dai hpc Qudc gia Ha Ndi'' - Doi vdi bdi bao, bao cao diing d tap chi, ky yeu hdi nghi qudc te (tieng Anh): "ITiis work is (partly) supported by the research project No QC.07.06 granted by Vietnam National University, Hanoi" Dieu 6: Hai ben cam ket thuc hien dung cac dieu Idiodn da ghi hgp ddng Trong qud trinh tliuc hien hgp ddng hai ben cd trach nhiem thdng bao kjp thdi cho nhung vdn dt vudng mac va cung bdn bac, tich cue tim bien phap giai quyet Diiu 7: Hgp ddng lam thdjih ban, mdi ben giir mot ban, hai ban giii cho phdng TV-KT, mdt bdn liru tai phdng HC-QT f ^ M ; ^ I E N BEN A •'-•;-:,;\t^l'fO HIEU TRLTONG DAI DIEN BEN B /] €»A! H O C • ^ - ^ • • ' ' ' G ^ - S ^ g u y ^ n Ngoc Binh NCS Nguyen Cam Tu SUMMARY Project Title: Vietnamese Named Entity Resolution and Tracking crossover Web Documents Code Number: QC.07.06 Implementing Institution: College of Technology, VNUH Cooperating Institution: Duration: Horiguchi Lab, Tohoku University From 6/2007 to 6/2008 Objectives This project aims at enhancing research abilities and deployment of Knowledge Discovery Group at College of Technology in which we focus on the following issues: • • • Study and propose a suitable model for entity resolution and tracking crossover Web documents Build utilities towards real-world and useful applications such as Entity Search and Information Extraction on Internet Enhance research abilities for young researcher in the scope of this project Main Content • Study approaches to the task of entity resolution and tracking crossover Web documents • Propose a framework for this task in Vietnamese • Collect a medium data collection for experiments • Conduct Experiments and draw conclusion about this task in Vietnamese Obtained Results Related to research issues in this project, we have got the following results: • journal paper entitled '''Web Search Clustering and Labeling with Hidden Topics'' to be submitted to The ACM Transaction on Asian Language Processing paper titled "Matching and Ranking with Hidden Topics towards Online Contextual Advertising'' submitted to The 2008 lEEE/WIC/ACM International Conference on Web Intelligence (accepted) paper "Ontology Construction towards Entity Search in Medical Domain in Vietnamese" submitted to National Conference! mater thesis titled "" master thesis titled "Hidden Topic Discovery Toward Classification and Clustering" bachelor thesis titled "On the Analysis of Large-Scale Dataset towards Contextual Online Advertising" corpus for experiments module for entity resolution and tracking crossover Vietnamese Web documents PHIEU DANG KY KfiT QUA NGHIEN CUU CAC DE TAI KHCN Tgn de t^i: Doan nhSn va giai quyet nhSp nhang thuc the tieng Viet tren moi trucmg Web M3 s6: QC.07.06 Co quan qu^n ly de tai: Dai hoc Quoc gia Ha Noi Dja chi: 144, duong Xuan Thuy, C k GiSy - Ha Noi Dien thoai: 7548664 Co quan chii tri de tai: Dja chi: 144, ducmg Xuan Thuy, Cku Giiy - Ha Noi Dien thoai: 7548664 Tong chi phi thuc chi: 25.000.000 d(Hai nham trieu dong chan) Trong do: - Tir ngan sach Nha nuoc: 25.000.000 d (Hai nham trieu dong chan) Thai gian nghien cuu : Thoi gian bSt dSu: 06/2007 Thai gian kk thiic: 06/2008 Ten cac can bo phoi hop nghien ciiu: - Chii nhiem d6 tai: Nguyin C4m Tii - Nhung nguai tham gia: Ha Quang Thuy, Phan Xuan HiSu, Nguyin Viet Cucmg, Nguyen Thi Thuy Linh, Nguyin Thu Trang, Nguyin Thi Huomg Thao, D6 Thi Minh Viet, Trucmg Thj Thu Hiin, Le H6ng Hai, Luu Ngoc Tudn fT6m tat k6t qua nghi&i curu: M§t bao cao khoa hpc chuan bj guri dang tap chi ACM TALIP • Cam-Tu Nguyen, Xuan-Hieu Phan, Thu-Trang Nguyen, Susumu Horiguchi, Quang-Thuy Ha (2008) Web Search Clustering and Labeling with Hidden Topics The ACM Transaction on Asian Language Processing, (to be submited) Mpt bio cdo khoa hpc gfxi dang HOi nghi Khoa hpc Qudc t% • Dieu-Thu Le, Cam-Tu Nguyen, Xuan-Hieu Phan, Quang-Thuy Ha, and Susumu Horiguchi (2008) Matching and Ranking with Hidden Topics towards Online Contextual Advertising The 2008 lEEE/WIC/ACMInternational Conference on Web Intelligence, Sydney, Australia, December 2008 Mpt bdo cao khoa hpc da dupc trinh bay tai Hpi thao Khoa hpc Quoc gia va gui toan van npi dung • Le Dieu Thu, Tran Thj Ngan, Nguyin C4m Tu, Nguyin Thu Trang Xay dung Ontology nhim h6 trp tim ki6m ngu nghia ITnh vuc y te Hgi thdo Qudc gia ldn thu XI "Mot s6 vdn di chon Igc cua CNTTvd Truyin thdng, Hul, 12-14/6/2008 Mpt s6 bao cao vl trich chpn va xu Iy nhap nhing thuc thi dupc trinh bay tai Phong Thi nghiem ve tinh "Cong nghe tri thuc va tuong tac nguai-may" Mpt luan van Cao hpc cua NCS Nguyin Cam Tu vai dl tai "Hidden Topic Discovery Toward Classification and Clustering" da bao ve cong thang 5/2008 Mpt khoa luan tot nghiep dai hpc cua sinh vien Le Dieu Thu vai dk tai "" da bao ve cong thing 6/2008 Mpt kho dir lieu vl trich chpn thuc thi va xu ly nhap nhing thuc thi tiing Viet Mpt module trich chpn va xu Iy nhap nhing tiing Viet viet tren ngon ngu Java va thu nghiem tren kho du lieu noi tren Kien nghi ve quy mo va doi tugng ap dung ket qua nghien cuu: Ket qua nghien cuu c6 the ap dung cho moi nghien curu va trien khai theo huong phan cum, trich chon thong tin tieng Viet tren Internet Mot nhung noi dung nghien cuu thoi su o nuac ta thai gian gan day ra4^ae^t T h s = ^ ^ ^ ^ ^ ... telescope Am thi/c khdng gian wcbsiie Ealing space, website Luot web khong gian 3D Web browser in 3D space Danh b? website, vc khong gian Website directory space draw 7, Du l|ch khAng gian (10)... both Web pages and ads, which helps capture their semantic relations If a Web page and an ad share more common topics, it is likely that they are more relevant • Reducing data sparseness: as Web. .. that might not appear in ads or target Web pages This helps deal with a wide range of Web pages and ads, as well as process future data (i.e previously unseen Web pages and ads) better Moreover,

Ngày đăng: 18/03/2021, 16:31

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w