Thông tin tài liệu
Đ Ạ I H Ọ C Q U Ó C GIA HÀ NỌI **** PHÁT HIỆN SAO CHÉP GIỮA CÁC VĂN BẢN T1ÉNG VIỆT Mã số: Q C 08.17 C hú nhiệm đề tài: Phạm Bao Son Đ A I H Ọ C Q U Ố C G IA h a n õ , T R U N G t ầ m t h ô n g t i n t h u V ịỆ N Ọ O O ỊO O O O O ^Ậ I la Noi 2009 M Ụ C LỤ C B ÁO C ÁO TỐNG K É T I Giới th iệ u 2.Thách th ứ c 3.Tông quan vấn đê nghiên c ứ u 1.Các phương pháp phổ b iế n 3.2.Kiến thức tàng Xây dựng Corpus văn tiếng V iệ t 5.Phương pháp xác định chép văn bàn với sò' liệu lớ n 1.Mơ hình phát văn ban gần trùim lặp với sở liệu lớn 5.2.Lựa chọn đặc trư ng 5.3.Tính Fingerprint cho văn b a n 5.4.Xác định cluster cho văn bán 3 ì 6.2.Ket qua 9 ] I ~> 7.Kết luận Tài liệu tham kh o 6.Thực n g h iệ m 1.Xây dựng phưưnu pháp thí nghiệm phương pháp đánh giá Danh sách n h ũ n g người tham gia thực đề tài (học h àm , học v ị , CO’ quan côn g tác) Chú tr ì để tà i: • TS Phạm Bảo Sơn Những n g i thự c hiện: H ọ tên TT H ọc v ị Cơ quan cône tác B ù i Thê D u y TS Trườ ng Đ H C N ? Lê A n h Cường TS Trườ ng Đ H C N Trư ơng C ông Thành CN Trường Đ H C N N guyễn Q uốc Đ ạt CN T rườ ng Đ H C N N guyễn Q uốc Đại CN T rườ ng Đ H C N Tràn Bình G ianu CN Trường Đ H C N Danh m ụ c b ảng số liệu Bàng I Kết F-measure tiến hành thí nghiệm theo hướng tiếp cận thử (chi sứ dụng đặc trưng mặc định (âm tiết) không sử dụng đặc trưng riê n g qua xừ lý tiêng Việt) 13 Bảng Kết F-measure thí nghiệm với mơ hình phát trùng lặp cua văn ban tiếng Việt đề xuất 13 |\ Danh m ụ c hình Hình I M hình phương pháp phát chép văn ban tiếng V iệ t Hình Biêu đồ thể kết thí nghiệm theo hướng tiếp cận thứ n h ấ t 13 Hình Biểu đồ kết (F-measure) sừ dụng mơ hình đề xuất 14 \ OVERVTEW O bjective P la g ia rism D e te ctio n is one o f the m ost im p o rta n t p ro b le m s a ffe c tin g o u r life and it is a c tiv e ly studied by m any research groups in the vvorld T a c k lin g th is task can b rin g m any advantages to the society, e specially to academ ic since there are m any researches as w e ll as study m aterials pu b lish e d in the In te rn e t vvidely M o re o v e r s o lv in g the Plagiarism D etection problem g re a tly contributes to Search eneines períorm ance T h is p ro je ct is to b u ild up an e ffe ctive m ethod to ta ckle the task o f P lagiarism D etection A d d itio n a lly , the task o f P lagiarism D e te ctio n fo r Vietnam ese c u rre n tly receives v e ry little studies so that in this p ro je ct w e w o u ld lik e to pay m ore attention to solving the Vietnam ese P lagiarism Research • Research com m on P lagiarism D etection m ethods o ve r the w o rld such as D SC 1M atch L A S H , Sim hash, C harikar • • B u ild up a V ietnam ese docum ents corpus C onstruct a fra m e w o rk fo r V ietnam ese P laeiarism D etection in a very laree database Result publications in the International Conferences pubỉished by IEEE c s • Cong Thanh Truong, lh e D uy B u i, Son Bao Pham "N ea r-d u p lica te s detection f o r Vietnamese Docum ents in L a rg e D a t a b a s e 7lh IE E E International Conference on Advanced Language Processing and W eb In íb rm a tio n T e ch n o lo g y" C hina 2008 • Dai Ọuoc N guyen D at Quoc N guyen Son Bao Pham The D u v B ui "A Tem pìate-based A p p ro a c h to A u to m a tic a ìỉy Identiýỵ P rim a ry Text C ontent o f a Fast Web P a g e ” , In The l st IE E E International Conterence on K no\vledee and Systems Engineerine Hanoi Vietnam 2009 Application The Vietnam ese P laaiarism D etection fra m e \vo rk is c u rre n tly applied in the X a lo Search engine o f T in h v a n M e d ia C om pany Academic resuíl U n d e r-iỊru J i(a iiu n theses • C ong Thanh T ru o n g “ N e a r-d u p ìic a te d D e te c tio n f o r Vietnamese D ocum ents in La rge D atubase ", U nd e r-g d u a tio n Thesis, C o lle g e o í I e c h n o lo g y, 2008 • Trần B ìn h G iang, “ Vietnamese B lo g P r o f ilin g ’\ U n d e r-g d u a tio n Thesis, C ollege o f T ech n o lo g y 2009 • Phạm Đ ức Đ ăng, " Vietnamese W o rd Segm entation m e th o d using P a rt-O f- S peech" U nd e r-g d u a tio n Thesis C o lle g e o f T e ch n o lo g y 2009 Scientifìc coníribution Enhance knovvledge as \ve ll as s k ills fo r m em bers o t'th e la b o to ry in N a tu l Language Processing ỉ Tóm tắt kết nghiên cứu đề tài Ket qua vể khoa học (những đóng góp cua đê tài, cơng trìn h khoa học công bô) báo đăng hội nghị quốc tế chuyên ngành (đăng IEEE CS) • Cong Thanh T ruong The D u y B ui Son Bao Pham "Near-dupHcates detection f o r Vietnamese D ocum ents in L a rg e Database ” , 7lh IE E E International Conference on Advanced Language Processing and W cb In íb n n a tio n T ech n o lo g y" China 2008 Bài báo ứng dụng phươna pháp phát chép văn ban tiếng V iệ t m ột sơ liệu lớn đề ứng dụng vào phát tin tức có nội dung gần giố n g giúp tăng hiệu cùa hệ thống tim kiếm thơng tin Phương pháp có tác dụna làm tăng hiệu qua mặt thời gian tim kiêm nội dung m ạng internet tiết kiệm tài nguyên vê nhớ lưu trữ • Dai Ọuoc N guyen Dat Ọuoc N guyen Son Bao Phani The D uy Bui "A Fast Tem pỉate- hasecì A p p m a c h to A u to m a tic a llỵ ld e n lifi' P rim a rv Text C o n le n i o f a Weh P a g e " In The r ' IHHH International C ontèrence on K now ledge and Systems Hngineering Hanoi V ietnam 2009 Bài háo ứnn dụng phươnạ pháp phát chép văn ban tiếng V iệ t úng dụng việc nhanh chóng tìm tem plate cúa vvebsite đê xác định phan nội duníi Két qua p h ụ c vụ thực tê (các san phàm công nghệ, kha năn g áp d ụng thực tê) Phươna pháp đè xuàt giái quyêt toán phát chép - hay phát trùne lặp văn ban tiẻna V iệ t áp dụne vào hệ thơne tim kiê m thịne tin X a lo v n cua côna t\ T in h Vàn K ê/ qua đào tạo (sô lư ợ n g sin h viên, sô lư ợ n g học viên cao học, nẹhiên cú n sin h tham g ia thực lùm việc tro n iỊ đê i sơ khóa luận, lu ậ n văn đ ã hồn th n h hao vệ) ỉ klióa luận tốt nghiệp C:\TT: • C ong Thanh T ru o n ti "X e u r-i/iip ltc a te c / D e tc c tio n fo r Vietnamesc D ocum ents in L a rạ c D a ta h a s c " U n d e r-a d u a tio n Thesis C o lle a e o f T c c h n o lo N 2008 • T rần B ìn h G iang "V ietnam ese B lo g P r o fd in g " U n d e r-g d u a tio n Thesis C o lle g e o l' l e c h n o lo g ) 2009 • Phạm Đ ức Đ ăng, "P h n g p h p p h â n đoạn từ tiế n g Việt sử dụn g g n nhãn từ lo i Khóa luận tố t nghiệp đại học, Đ i học C ông N ghệ 2009 K êt nân g cao tiêm lự c khoa học (nâng cao trìn h độ cán tra n g th iê t b ị hạc p h â n mêm x â y dựng g ia o nộp đưa vào sử dụng tạ i đơn v ị): N âng cao lực chuyên m ơn cùa cán phịne thí n g hiệm lĩn h vực xứ lý nsôn ngừ tự nhiên trí tuệ nhân tạo BÁO C Á O T Ỏ N G K É T G iới thiệu Vấn đề xác đ ịn h g iố n g văn m ộ t vân đê quan trọ n g vớ i nhiêu tác động tới nhiều lĩn h vực tro n g sông H iệ n việ c g iả i vân đê xác định hai hay nhiều văn bàn có tương đồng tích cực nghiên cứu G ia i qut tốn ứng dụng tro n g nhiều mặt cua xã hội m ột ứng dụng cua toán phát việc "đ o v ă n " k h i mà tài liệ u nghiên cứu đưa lên m ạng Internet m ột cách rộng rãi phố biến C ùng vớ i phát triể n vớ i tốc độ chóng mặt cùa Internet cơng nghệ tìm k iế m , g iả i đươc toán xác đ ịnh tươna đồng văn mang lại nhiêu ý nghĩa tích cực tro n g việc xâ y dựng cỗ m áy tim k iê m tăng hiệu hoạt động toàn hệ th ố n g tìm kiếm T ro n g hệ th ố n g tim kiê m thô n a tin m ột tro n g m ục tiêu tiên quvêt trình bày trang thích hợp tới naười dùng nhanh có thê Đẽ đạt m ục tiêu hộ thống tìm k iế m cần phai phát trang trù n g lặp gằn trù n lập bơi chúng khiến việc tìm k iế m chậm đồng thời tăn thêm chi phí nhớ cho việc tìm kiếm V iệ c phát trang nội dung trùng lặp hồn lồn có thê thực dề dàn tí nhờ phươne pháp checksum tu y nhiên phát nội dune gàn trù n e lặp lại phức tạp nhiều C húng ta có thê sứ dụng m ột cách đơn íìian so sánh từne cặp văn ban m ột với dè kiê m tra độ a.iơna, nhưne vớ i sị lượne văn ban cực lớn tro iiíi cỗ máy tìm kiê m điều khơng kha th i v ì độ phức tạp lởn G ia i vấn đề có m ột sơ thuật tốn Nearest N e ig h b o r Search [3 ] L o c a lity S ensitive H ashing [1J DSS DSC-SS [4 ] Sim hash o íC h a rik a r [2 ] hay I-m a tch [5 j T ro n a m ô i trườna In te rn e t V iệ t N am tliỏ n a kè từ 25 trang tin phò biến V ietna m n e t.co m D a n tri.c o m N aoisao.net Y.v cho thây khoane 0% sô tin báo trùna lặp san trù n a lặp m ỗi nsàỵ D o vậ y việc phát dược nlũrna tin sè đóng vai trị quan trọ n s kh n g cho hệ thô n a tim kiê m mà cịn cho nhũ nu nghiên cứu tio n iì xứ ỉý n °ị n naữ phàn nhóm văn ban phát chu dê tru \ vết nội duníỊ cũna nhièu lìn h vực khác Trơn a iớ i dà cỏ nhiêu nhóm tập UIII 1>1 n tihiên cứu vê xác định ” 1011” niũa văn han Cũn ti dã cỏ nhiêu ửnsỉ d ụ iiíí rộn Li rãi SU' d ụ im tro n ” hệ thõnu tim k iế m th n s tin hay tó m tăt da văn ban I L1\ nhiên, nííhiẽn cứu ửnu đ ụ n ” tronu lĩnh vực c lio tiế n a V iệ t ràt 1110 Do vậỵ dè tài n \ tập tru n a lớ i nuhicn cửu \a xà\ dựng írna d ụna \ àn đê nàv \ ới nuỏn nuữ tiê n s V iệ t International Conference on Advanced Language Processing and Web Information Technology A L P I T 0 T able o f C on ten ts Message from th e G en eral C h a irs X I I Message from th e P ro g ram C o -c h a ir s xiii Conference C h a ir s X IV steering C o m m itte e XV I Advisory B o a rd X V II O rganizing C o m m itte e .xviii Technical Program C o m m itte e .xix Track 1: LPT (Language Processing Technology) Exploring V ario us F eatu res in S e m a n tic Role L a b e lin g Hongling Wang, G uodong Zhou, Q iơom ing Zhu, and Peide Qiơn Transíorm ation Rule Le arn in g vvithout Rule T e m p late s: A Case Stud y in Part of Speech T a g g in g Ngo Xuan Boch, Le Anh Cuong, Nguyen Viet Hơ, a nd Nguyên Ngoe Binh Word Sense D isam b ig u atio n Based on R elation S tru c tu re 15 M yunggw on Hwang, C hang Choi, Byungsu Youn, and Pankoo Kim K-Similar C o n d itio n al R ando m Field s for Sem i-supervised Sequence L a b e iin g .21 Xi Chen, Shihong Chen, a n d Kun Xiao Chinese S e n te n ce S im ilarity M easu re Based on VVords and Stru ctu re In ío rm a tio n 27 Rongbo Wong, X ioohuo Wong, Zheru Chi, and Zhiqun Chen Email C lassiíicatio n Using S e m an tic Featu re s p a c e 32 Vun Fei Yi, Cheng Hua Li o n d Wei Song A Clustering A p p ro ach of C o n ce p tu a l S e n te n ce G ro u p s XiangFeng Wei, HơnPen Zong, a n d Q uan Zhang Aulhonzed hcensed use linnted 10 ƯNSVV L brary Do«vnioaded " Augos' - - Finding Sim ilar T e xts U sing U -W IN 43 Kang-seop Shim, Cheol-Young Ock, D ong-M eong Kim, Ho-Seop Choe, and Chang-Hwan Kim Deriving a S em an tic C lassificatio n T ree o f Korean V erbs Based on Sem antic Features 49 Yude Bi, Jing Yuan, a n d Jian g u o Xiong Autom atic Parsing o f 'NP+ you +VP' in C h in e se , Jap an e se and Indonesian Based on CTT & Co m p lex P e a tu re s 53 Junping Zhang, X iaoling Zhang, a n d Zhiw ei Feng Korean Syn tactic A n alysis U sing D e p e n d e n cy Rules and S e g m e n ta tio n .59 Yong uk Park a n d H yuk-chul K n o n Tree Kernel-Based Se m an tic Relatio n Extractio n Usỉng Unified D ynam ic Relation Tree 64 Longhua Qian, G uodong Zhou, Fang Kong, Qiaom m Zhu, and Peide Qian Near-Duplicates D etectio n for V ietn am e se D ocu m en ts in Large D d ta b d se 70 Cong Thanh Truong The Duy Bui, a n d Bao Son Pham late nt Sem an tic Kern els for VVordNet: Tran sfo rm in g a Tree-Like Structure into a M a trix 76 Young-Bum Kim and Yu-Seop Kim A New Feature-Fusion S e n te n ce Selectin g Strategy fot Q uery-Focused M ulti-docum ent S u m rm arization 81 Tingting He, Fang Li, Wei 5hao, Jinguang Chen, and Liang Ma A W eighted k-N earest N e ighb orh ood for BaseNP D etection under Co variate S h if t 87 Jeong-Woo Son, Seong-Bae Park, Young-Jin Han, and Se-Young Park Autom atic O p in íon A nalysis Based on 5VM and Distance-VVeighted C o m p u tin g 93 Wei Guo and H ongtei Lm Technical T ran slatio n and a Role for F C A 99 Roger England a n d Stemart Hanson A W eb-Based O n to lo g y E valu atio n S y s te m 104 Xu Jianliang a n d M a Xiaovveì D evelopm ent o f Korean C o n ce p t & Instance Classiíication S y s te m 108 Young-ÁinBoe Cheol-Young Ock Ho-SeopChoe IVang-VVooLee and Hoiv-M ook Yoon Topic D etection and T rackm g foi C h in e se N e m Web P a g e s 114 Jing Qiu, LeJian Liao a nd XiuJie Dong Determ ining G e n d er of Korean N am es w ith C o n te x t '21 Hee-Geun Yoon, Seong-Bơe Pơĩk Yong-Jm Han, and Sang-Jo Lee A Hybrid M odel Based on CR Fs for C h in e se Nam ed Entỉty R e co g n itio n 12/ Lishuang Li Zhuoye Ding Dc'gen Huong and Huiwei Zhou M ultiword Exp re ssio n R e co g n itio n Using M uỉtipỉe Sequence A lig n m e n t Ru Li, Liịun Zhong, and Jionyong Duan Character Code C o n ve rsio n Hnd M isspelled W ord Processing ỉn U yghur Kazak, K y rg y ? M u lt i ỉỉn g u a ỉ ln f o r ! ^ a tio n R p t r ;e v a l S y s t e m Turdi Tohti W im M usaiơn a n d A ỉk a r Ham dulla Research on Im p ro ve d TBL Bdsed Jap an e se NER Poíít-Process ng W a n g jin g , Zheng Dequan, and Zhao Tie/un Authonzed Iicensed use IiimieJ '0 J N S A utxd'"* D A"ii0daerí 0'' Ac.ịjus: í - 33 Inlemational Conlerence on Advanced L.anguagc ProtcsMiiiỊ and \\ ch Intornuiu.n I cclinolo^v N ear-D uplicates Detcction for Vietnamese Ducuments In Large Database Cong Thanh Truon g The D uy B ui Bao Son Pham Vietnam N ational U n iv e rs ity , Hanoi Vietnam N ational U niversity Hanoi Vietnam N alional U niversity lla n o i T h a n h l r u o n g c o n g u ^ g m u i l C (/II1 d u y h t II Y im c d ii I II s i m / ) h I I v u n C i h i V II Absírací: N car-duplicatc documcnts cxacerhatc the p r o c c s s in g problcm of inform ation ovcrloatl Rcscarch in dc!cc(in» ncar-duplicalcs has attracteri a lot of attcntion from both industrỵ and acatlcm ia In this papcr, \vc focus nn addrcssing íhis prohlem for Victnam csc (ỉocumcnts tvliicli, to the besl ol o ur kno\vled«jc, lias (Idl bccn dombefore Most » f the c u rrc n l Mlgorithm.s have bccn dcsi{>ncd for Kngli.sh \vhich arc noi (iirc c tl) applicahlc to Vietnamc.sc - a monosyllahic lanịỉuagc \V c propo.se lo comhinc CharikarVs algorithm |2 | \vilh a “ tveigliting schcme” and V ielnamcsc spcciHc ícalure.N lo HildrcvN che languaiỉi* intricacy Kxpcrim en tal IT.SUIts indicHtc thai OUI' schcmc is clTcchvc fo r dctcctinị’ nc;ir-(luplic;ìtcs ÍII a corpus o f Victnamcsc docunicnts t o p i c d e t e c t io n a n d t r u c k in g I d c n l i l i c a t i t H ì o í Ii c a r - d i i p l i c i i t c t iis k s s n c li K e y m trd s : C h a r ik a r , LS H , n e a r - d u p lic a íc c lu s te ru m • S \ I lii h lc s spaccs and T lic r c in V ie ln ; im e s e ) (" h tn 'h ọ t " I • V ic u u im c s c has llc x ih lc and \ \ o i d s c a n h c a r r u iig c d d i i ì c r c n t K l l i c s a in c m c m i n g i n h I n g liN li J u c iim c n i' \ ’ ic tiK im c s c DCUS iir c in c llic ic r u a r t ic lc s In ihi.N and m c llc c tiv c p ip c r Im uc p n »p « )> c lu i V ic tn in iC N C d iì g o a l o n e o f i h e p r n h l c m s t h a i \ v e b s c a r c lì c n u in c s l u i v c to im p r o \c m c n t d e a l \v it li is h o \ \ t o d e l e c t c l u p l i c a t c a n d n c a r - d u p l i c i i l c w c l> d iK U i n c n i s h \ n iii ov.luL'ing J " t c n n u c i g h l i n i i > t l i c i n c " u h i c l i pages T h e s c p a g c s e i t l ic r s lo \v d o \ \ n OI' i n c r e a s e l l i c c o s t o l ti» llic s o l u t io n to è o m p a n > i» U N ix 't|U Ír c J d u p lic a t c th c in d c \ l a iliir c d o c u m e n is lo ( O g e lh c r r c im > \L * m iiih l OI iir o u p iin n o N th c a ll Ii c a r - u s c rs is «ilsi> p r o p o s c d lo llic r c in a in d c i •»! ih Is papcr S c c l io n c m c r N r c l a l c d \ v o r k D u p lic a lc s c u n b e c a s iK c a lc u la t in g checksum s th e s m ip le s o l u t io n IS l o r c a l i / e d \ v h c i i b a n g c x l r u c l c d h> h o \v e \e r p r o h le m hccom cs c o m p u re a pau \v iih n c iir - d u p lic a lc d xen oi c o m p lic a ic d d o a im e iU v d is U in c c b e tx v e e n i h c n i is s m a l l c m u i n l i t h ọ lo h e n e a r - d u p lic a lc A íia in th c p m b le m An lu r th c r rc d u c c o p n m i/a iu m t íic m t m h c i ol \\ilh r c d u n d a n t r e s u lt s d o c u m c n ls jl i > o r i t h m m ip r i» \c s llic n c iir - d t ip lĩc a lc m c iis u ic m c n l th e s e a r c h in s p r o c e s s b e c a u s e o t t h e i n c r c a s i n i ỉ s p a c e n e e d e il s to re C lu r ik a r II \ tò r n c a r - iiu p h c u lc a liio r iih m s Ìp p r o a c h 111 S c c l in n I> o r t M f > i/c d iiN ln ll t n s s \ \ c p r c s c iu s o m c h a c k ^ r o u n d m 's c c tio n S c c l io n ; in d i L - N c r il v n u r c \ p c r i n i c n i > Iiiil S o n ic C’ .ilu K io n r t s u ll s ili c a r c a m s k le r c d r is c s \\ilh la r g c II R E L A T E I) \V O U K d a la h a s e \ v h e r e c o m p a r i n i ’ e v e r v p a i r \ v o u l d n o t h c p r a c t ic i il S e v e l p n ib lu n S c n s il iv c a lg o r ith m s su ch ;is have heen N c u rc s l I l.t s h i n i! 11 | p ro p o s c d N c iiih h o i’ I) S S U' S c it r c h ĩ) S C - S S re s o lv e |3 | Ị4 | il i c I O L -alilx S i m h iis h ol Chiirikur 12ị tHKỈ I-inaich 151 \ \.ir ic tv a c K lw m ic o l i c c h n u i u c s h is I k c i i d f \ c l u p f d p í i2M n s m uch p i* jc u p lita tC ' - Iiu l p r o h lc m s l'r o m arc p rc s c iu ! \N 'c n l\ li\c n c x N s p a p c rs r c x c i i l th íH IIK V C d u p lic iilc s c v c r \ d u p lic a t e ih a r i t la \ D c ic c lin i: a r tic le s is o ỉ n r c iil I II n io s l ? n °n o l 11'iK 'k 'N ii n d c l i m i i K i i i n i Ị im p o iU U K C ‘Í78-I»-7(1US Í-S IIS s ' ' "II r i m s l l t-lrx )| 10 I |(W ALIM r 2U0 S 7(> \ ' i c l i i i i m c s c d o c u m c n is p o p u liir o t ilin c V io liiiin ic ^ c U '|- iir v t ic il lh c s e I U I I o d ic r ic \i liu p liL a in l d.ilaỈM^- icutnls Hrm ct al |V4.'| h.i\c pr.ip.iscJ p r o t o i x p c S > 'IC II1 c u l l c d ( ■( ) l ’ s (< ( >p> ' n 'i c c l n m Thcsc S t n t is t ic s to id c n l ih m lc llc c t u a l s lm \ 'k u n i II ct li p i« » p c rt> p i< > K X l iliỊjiu l S \s t c n > M u v lu c u n tc n is |4 /»| Ii.in c » lc\cl«» pãl Nt \ M ( , , p \|1,||\ S| \lcch II1IMÌ11 illu il- ' D- ■ ■■ s LV ĩ h c second m osi conim on approiich in detcrtnining ilic s im ila n ly o l tvvo docum cnts is co sin c similaritN measurc rhis approach represents cach documcni as a vcclor in nd im ensional space T h e sim ila rity h c iu c c n two d o cu nicnis is then d clìn ed as ih e co sin c distancc het\vcen tlic l\vo corresponding vcctors T h u s as ihe d isiancc ol i\\ o docum cnts approachcs I ih c \ hecom c m orc s im ila r in relation lo the ícaturcs heing compared c > lh r tD l J ’L': *L> I -— 'D il svhiđi tcrms should be uscd as ihe hasis Ibr com parison A n documenl Irequency ( id duplicate u h c lh c r rescarchcrs have ihreshold t to docum ents \vould inverse vveight is dctcrm ined for each I \vu d o c u iiic n ls arc ca llcd n ca r-d u p lica lc il thcir s i m i l a r i t ) m c a s u r e is b i t ỉ g c r t h a n y SCI i h r c s h o ld t term in the collection T h e id lT o r caclì icrm is đ d in e tl as lou (N/n) vvhcre N is thu n um b cr ol d o cu m em s in ilic collection and 11 is ihe num ber o í doum ieiU.s c o n la in iiig thc given terni The \c r íill ru m in ii- ol ih c l-M u ic h itp p m iid i is O (d lo g d ) in the xvorsl cusc \v h crc íill docum ents are d up licates ol ench other and (d) oihervvise \vhcre tl is ih c n u m lx T «(■ documents in the c ollcction I ASH [2| is an alcorilhm íbr solviniỉ thc (ap p ro x im aie/ex aa) N c a r N e ighb o r Scarch in h ig li dim ensional spaccs T h e m a in idea h ch intl I S I I is lo reilu cc llic num hcr o f d im o n sio n s ;ind IISC I hasli ỉim c iio n lo iv ilu c e runtimc Sim hash proịciMs cach leature in lo h -ilim c n ^ io n a l spaee hy la iiilo m lv c lìo o s iim b c n ir ic s Ih n n Ị - l I j T h is prọịcction IS ihc sam c loi iill docum cnis I OI ca ch d o aim en i b -d im c n sio iia l veclo r IS c rca ic d l\\ ia ld iiiiỉ ih c prọịcclions o l a ll ih e lc a lu rc s in iis le a lu iv scqucncc The M erg e/Purg e problem is proposcd h \ M ernande/ ei al to id e n lifv d u p lica te records ỉrom d ilìc r c n l sourcc dalabases [9.10.15] AI! records tiom dilíorcni daUI huSOS arc sorlcd on im portanl d is c r im in a lin g keỵ utlributcs I:ach liin c the records arc sortcd on a certain k e \ atirih u tc records \\ith in a sm a lỉ n n h h o rh o o d iirc comparecl \vith ca ch olher and n ear-du plicatc rcco rds arc id e n iilic d In V ictn u m u llh ou g h lo cal scurcli cn g in c s d c \c ỉo p I t i p u l l ) n c i i i - c l n p l i c i i l c d c l c c l i o i i i> > l i II a J i í ì l c n l l p i o h l c m H C h a r ik a r l-.acli doeument is rcprcscntcd h\ ÍI SCI ol lcu lu rcs and ih c ir c o i T c s p o n d in ii u c ig h ts \ h a s li lu iic lio ii c i illc d S im h iish is uscd lo crcatc thc paiic liim c i p rinl I iicli tcatnrc is proicclctl ínto an l-d in ic n s io iìiil spacc h \ randoniK choosm u b c n ln c s lio in J - l I ; l l i i s prn|ccii(H) is thc samt' Ibr a ll docum cnts Io r eaeli docum cni a l-d in icn sio n ;il vcclo r IS crcỉKcd h \ a ik lin u llic projection> o f iill ihc (ciilurcN II I it.s lc i i t u r c s c iỊi ic n c c r iic lii K i ! N c c io r lo i I h c c l o c im i c ii l creaicd h> scltin g c \ c r \ p o s iti\c cnir> III ihc \c c io r to I aiui c \c r \ im n -p i> sili \'V CIIUA lo IN thc 1'CMilt ol raiulnm p ro ịcclum lor cach d ocu iììcn l I( has ih c propcrt' thiil llic cosinc s n n iL iril) ol i\\ o docum ciil.s IS propurinmul u • llic n u m h c r o l b ils I II u l i i c l i t h c l\M > c t M T c s p o u d in ii p iiỊ ic c lio n s agrec Ih u s ih c s im ila r ily ot t\\o d oaim cniN is ihc num hcr o f b iis lliat thcir proịeclion.s iiíircc on \lt c r hushinii thc> used llu m m in u distancc It> com pulc ih c d isla n tc Limoni’ d ocum cnis then incrca.sc d isiu n cc to ch(H)sc lìiost su ila lìlc \iilu c k T h e a liỉo rith m \ \ i ll Jn \ c r \ \scll il u c rcm o \e sp c c itic lan iiuaucs T h e algt H Ìlhni dcpcnds on lcaiurc sclcclum and h ou lit tiilc u k ilc llic ii' Lorrcspondm i! \\LMuln ol cach laniduaidc (H h c ru isc u h c n lỉaUi !•> \ c r \ h iiiíic r ruDlim c i.s u Im i a b iii p r o h lc m ( lo c u n ii.n l> |\ II S \.N II I d co rp u N wc cannoi > m ip l\ n c c J M ) l u l io n L iim p a r c l o 'c m iK c l l i i s IS.SIIC in s p c c i l ì c N Ìt u a ii( ) ii> III I BACKGKOUM ) l\ Sim ilarity m etrics \\o r d W h ile II is u n c lc iir at \vhich point ,1 d o a im e n t is no lonuer a d u p iic a lc ol aiiother rcsca rch crs lu iv c cxam in cd s e v c l m e ln c s lo r d e k T m im ii ìỉ llic O L K APPROACH S1I11 il; n n > I v l u c c n p lii\ I\\< > docum enls I irsllv il II d o cu n icin liu n s ro iiiih l) thc sainc seniíintic co n lcnl coin p arcd \o itiD ilicr d ocu m cn i llic ii II IN ii sclicm e fo r \ itMnamcst* (lo c u m e iK s l i i l l c r c i i l r n lc s II I l c M INSÌHM lỉic \ ic t n u i1 u c iịịI u ỈK I' 1" iill lc r m s ( i ii iiỊ ili c ik d UI 1111111,11 li I ' I n 111iV.IcrH I>' I '.pcv.ii il IV Ho-.IIIM .i n d u m k A i tliv vveiglil oi a lealurc dcpcnds OH ils conlcxl For example the same token "k iế n " in the following two sentences has completely diffircnl mcaning.Y In llic scnlence kiến dang bỏ cành cày (tlie am is craw/ing on iree branchl" "kiên" (ant) is ihe subjccl and a noun hul in ilie otlier sentcnce lỏ i muôn kién nghị lèn hà chu lịch l i mím to petition my idea lo ihe chainvuman) t h e token "kiến" is a part of the \vord "kiên nghị (petition)" and "kiến nẹhị (peliliơn) " is a verb Whcn selecting fealures vve calculale \veighls usinu inverse documeni írcquency (id l) vveighi and llìc order o! tokens ỉn the docunienl Hcalure seleclion can be done lo k c n Ic n c I \ v o i d l c \ c l O I u - g i í i m bcc au s c th e \v o r d » m aybc O lh e r w i.s e b c c a u s c a p o s il io n o r d c r a n d \v o r d I‘(t) is posiiion l lcaturc I in th e lli c I II lị llo \\m « Ị s u m m a iẠ \a lu c s : il OI' l ìr s t p a r a t ỉr a p h o lh c r p a g p h s I documcnt \vhich la k o th e lit lc il IV la s t p a r it iia p h / I II th e I l í ’ / IS in I I / is in m u l l i p l e a r c a s llì c n |1| / ) o l p o s ilio n v u lu c s - M im c o u n i o lY ||t|l ilie num bcr ol ilie syllu b lcs in a lcature B N ear-duplicate dctection fram ework l c v c l In o i d e r u> d u l c a i u i c selection at word lcvcl unlike I nulish \vc nccd to \vord segmenlalion And llicn wc collccl IrcquencN ol \vords in ihc c n ip u s iiu d I'CH1I»\'C u o r d s \ \ i l h I dcnoio ilìc trcqucncx tlic lcasi iicqucnilv used teaiurc in thc corpus l o o h i g l i i»r l o o l ( ) \ \ s lo p m ayhc u o rd s ve n u n ih c r o í s \ l l a b l c or m any íir c liv iịt k M K A ru rc h I ach uscJ lia m e u o r k tc tc h c d ilo a im c n i IS d c s c rih c d IU c r u \\lc r Iis Ir o n i is I e p r e s e n ic il D o c l ỉ ) a n d íls Im g c r p r in l h\ I : IÌỈU I'C I liim c r p r m l I ỉ> lo t cach IS c r c a t c d d n c iiin c iu id c n lilic r \ V c u s c d t h c l l i m m m n ili s l i i n c c to measurc thc disiancc bchveen ii pair ol liniicrprints v\'c iricd s \ I l ii h lc s s o \c r \ Our d o c u m c n ls to d iv iilc th e d o a im c n t s im p o r la n l in t o n u i l l i p l c c l u s t c r s h a s a s i íiiK Ỉu r i l l ì n g c r p r i m cvcn i l l h c v d o c u n i c n i III t h c c l u s t c r \ M i c n ;i I1 CU i l n c u m c n t is p r o c c s s c d s a n ic iir c o f t h e s a m c g r a m m u t i c a l i v p e a n d h a v e th o ỉic q u c n c ) in t lic c o rp u s \V o d c lìn c ih c il v v c m lil 1) 1’ a th c \\h ic h I a c h c l u s ic r l.n i ig c i vv o ic ta a r c u s i u i l l ) m o r e i m p o r L m i l l i i i n s l i o i t c r u o r d s lỉa m n iim i J is u m c c o l i> ih c l i n g c r p r i n i o l l l i c li r s l iis t in u c r p r in t a i u l a c l u s t c r IS s i n a n C I OI' C tịiu il l o a c o n s ta tit k II i.s t i> s ig n c il l o che c l u ^ t c r lc a lu r c I a > l o l l o u v w ,r = L i • F t r ,í': lt 'lh c L L iI d is liin c c c I u s I lt F n il’ lh i> i.N h i g g c i n iclliiH Ỉ llia n k l l i c d o c u m c n l lò rm s a n c \ \ s lill hiis S u p p ti.s c h > r c \ i i i n p l c d o c u n i c n i I l a i i ì M i I I I i i il i s i i m c c o l vvherc: ih c \ h it.s d o t u m c n l h a v c J l l a m m m - Ị i l i s t a n c c 1>I h i i ( is ihe fĩe q u en e\ « l' lc a iu rc I in llic corpns iir c II I llic S iim c t lu s l c r D o u im c n i h a v e J l l i m m m ili s i; m c c o l ’ F m a \ dcnotcs thc Ircqu cniẠ o l'th e mosL tVccỊiicnlI\ u.NCil t c a lu r c in t h c c o r p u s lu llo \\m i: h ils prohlcin a n d li o c u m c n i l ỉ h i u - l í I i i i l ilo c u m c n i I lìc n ilo c t i M i c n i \ \ ( lỉ ( im l l Im » w c \c i c in l o k lilic % s t ln s p m h l c m \\c proposc iHi optinii/alu>n M*luiion I í/nLUincni /\ III c/ iln s it r t ỉi.s U in ic li iin d u n í) If ih c l/íin iin in ự hom ih t i/ o c n n n n l t o li h ii i/ ! c ì u s l e r s i l o c i i i n c n i IS sn ittili'1 ' i h i n i k Ị Tnpĩĩỉ - A documcni I) - A r g u m c n t k : i n a x m u m i d i.N tiin c c d o c u n i c n t s in c l u s ic r ! Olll/lllt ■Docum cm rm u c ip rin i - T h e c lu s ic r u h ic li llic d ocutncnl bclon n s to PrcpidL css - I c l c h I 1C \ \ d o c u m c n l I ) - S o rl ih c a r lic le s b \ lo cialc t m i c i n d c s c c n d i n í i o r d c r h c c u n s c a r u c lc > l l i a l h ỉ \ c c l o > c p u h l i > h c d d a i c be n e a r -d u p lie a le s T h is \\m ild r c d u c c i h c n u n i h c r 1)1 c a n d i d a l c s u h c n la s lc n t h e d u p l i c a l c i l o l c c l i o n priiCL-N.N P ro ce sse s Stcp I R c m o \c lu m I tau.N aikl ci'[i>cnìcn(N S lcp w ord scu m cn u itio n It.i l '•[.Itislics S lcp -V UcmoNc u o rd s h a \m h iiili |IX\|UCI11 OI Iix i MHiill I|\\|U C I 11 M cp c a lc u liilc \\c ii:h l> ol \\n u i» Stcp Sort \\o rd s 1,1 v.k'M.‘c n d in n ordcr ol llic ir \\c iiĩlil> S l c p (ì I s c N - u i -i i i n Ii> LTC UU ' l c i t u r c I|>1 I I ( 1)1 S tc p C r c a t c l ì n u c r p r i n l I l o i c a c í i I l m I u i c t, | l I i2 l\! Slcp s c rc iilc lin iic r p r in l I íor iliíc u iìic n i I) h) S i i i i Im nIi 11111v.'tu »m S lcp V C a lc u la le I liiim n iim d isltin cc ln>m ih c d n c iim cn l It' cach clu sicr Stcp II there is a s a lis lic d c lu slc r assiiỊi) llic d ocnm cnl 10 the clustcr Skp 11 I inisli \ll Lloumicnl- lv lu n t:in.L In liu J u-Ici s c a r c liiiiịỊ h i \ c IiiịịIic i p r íh a h ih i) ti>! J u p l i c j t i o n o i 11) ir t ic lc a n d lìc n c c Figure Our Framework c Weighied graph ouiput is lisi ot solution \vord segmenlulion OI a sequencc salistv: number ol \vords in a sequcnce is tho smallcsl Fingerprint com putation Each document has a teature set Eacli tèaturc has thcir corresponding vvcighl and vvc use Simhash lo gcncralc un Ibil lingerprint ai* íịllovvs We inil an l-dimensiunal vecior V each o f vvhole dimensional is initial /ero A teature is liashed into an l-bil hasli valuc For eacli oí hil in ỉ-bii: ii thc i-lh bit of liash value is I ihc i-ih componenl ol V IS incremenled hy ihc wcighl oi ihal lealure ilih e i-ili hu ol the hasli value is Ihe i-th eomponent o f V is dccrcmcntcd by the weiglu o llh a l ỉcalure When all 1‘eulures hiivc hccn piutcssed Sonic componenls ol V arc po.Mli\c \vhile u i Iktn are negative The signs o f componenis dcterminc the corrcsponding hil> ol the linal íingcrprim loi ihc ilociimciil Then, tỉie lingcrprinl is calculaled Hamming distance to eacli clusler A docunieni bclong a clusicr il and onl\ il distancc liom llie document lo a liall ol duaimenls in clustcr arc smaller k Step 2: l :sc Diịksira algorithni to lìnd all shortest paths in Ci I ;ich palh is pntcntiỉil solulion Ibr scunicntinc ihc scntcncc inlo uords Step A.ssign puri ol speccli: r.uch potcntial \vord scgnicniation solulion assign part ol speeeh (l'()S ) lo cacli \vord III ihe N d ilcn cc l or cuch P O S opiion \vc culculatc llie prohiihiliiy ol'the sciưence iis lolloHs h senlencc) = n P (T , ) * Ỉ1 P (T , Tl +, T , , ) P (T,): probabilil) appcanni> part ol specch T ol uord i-lh I’í I , I I J : pm biibililN part 1)1' spcccli I I I Slaml conimuoiish in corpus : C lio o s c p ro b iib iliụ solulittn S le p \v h icli havc the in a \iim im D VVord segm entation \ EX P ER IM EN TS AND EVALUATION Word segmetilation is alxvavs an imporiani prohlcm in Processing Viclnanicsc documuiLs IVccision 1)1 ihc vuncl s c g m c n u ilm n p lia s e p la y s UI1 in ip o r la iil r o le III v\ c llic perlbrmance o f the \vhole system \Yc usc an alỊiontlim lo I k iv c C iim a l oul c x p c n m c iU N c H ìc ic n c \ o l d u r Ira m c u o rk lo c v t liiiik - ll i c I \ p c r i m c i i t s a r c p c i u > r m c i l lo i' lind llic sh o ricsl patlì in a \v c iu h ic d lira p li Ibr ta c k lin ii lliis lu i) prohlcm as lòllovvs: \siih hiisit Uikcn SCI uhich ilocs mu V.OMUIII1 111\ tcaliircN a p p r o u c lic s u c n c tc d h\ IĨIM iip p n u ỉth V ic ltiiim c s c s c u m c n ia lio n and SC I l c iilu r c s u r e s i i n p K v v h it c spaccs l r c i |i i c n ú e s and in llic ir ih c líK tl.s U m g in u U scs N iic h t lu m k ii ii.s u o n l In ilic hitNiL lo k c n s u h ic l i d c lm iit c d t o r r c s p o n d iM i: c o n tiiim n i: i s HU I' a p p n u i c h M iu p ls p rc K o s in g p iir i-t il- s p c c c h lc a iu r c a p p iO tic h ỈI I hc \v c ii ! h t > Jocum cni I li^ h> d ic SCL'»»IUI v s i t l i l u l l I c a U i i c NCl I i i c l i i c l i n u o m u c i t il U i iiL L > c li c m c \\ c h iix c a l.M ) c a i T Í c d OUI o u r c x p c n m c n lN to liiu l llic hcsl N.ilucs loi thu to llim iu i: piiriimclcrN /' k ih e lc n iilh " I l i n i ic i p n n i n iaM im im (.lisiancc Jocumcm> JI'C pnir 1>Ỉ ncar d llp llC illC ll \ \ \a r ic d I o r C íit h F ig u re W ord s e g m e n ta tio n Slep /: Build \vcitihled liraph ( i lor cacli sen len cc ol lenglli n llie graph G lu is n-^ I v c ilc x so tliu l vcrtcx j eoiTespoiuls to llie svllable j in ilìc scnicnce k \\c k lio n i I lo r ; in d i> m l\ 10 a n d I iin c q u a l n u m l v r k \\c m c tric s U) c \ a lu a ic lu m d u p lic c iic d o c u n iL M U s i th e n I i i i i I II Ih e c li.n n n l > > lliih lc s lio m I t u I m a k c s ;i vvord in llic ni ven Vicinamc^c diciionar\ oih eru isc /■(/,/) I Isc l l i I t (1-1) *I) / I2 X 1)1 l i n c r p r i i i l > t h u t a r c a t J H a m m i n g ilis U in c c s m u l l c r t h a n u s c p r c c i s i t i n r c c ii ll a n d I - m c i i M i r c \ \ h c rc an ap p rn a c h I h c h m h c r llic 111 iip p r iK ic h c a n o h i i i i n ( III ,1 II I a m o n n í>4 a n d s a m p l c d J e q i iiil n u m h c r o l p íii r s c a n J c t c c i n c i ir - I- m o a N i ir c 't o r c t liiit l l i c I v U c i i h c i i p p iu j v il And 0.0 0% II _ ( R e t r m v c d d u c p a i r s ) n ị c o r r e c t ứ o c )iu i r s ) 50.0 0% c o r r e c t d u t p a ir s 0.0 0% F = 32 30.00% Experiment corpus A With the development o f man) Vietnamese Electronic Nevvspapers the readers are provided vvith ruimcmus sources ol' documenls These sourccs houcvcr not alvvays providc íresh Iiexvs Nc\vs in d iíĩcrcm \vehsiies is olién relerred to an original one 10000 articles liavc hccn collecled írom 25 most lamotisc Victnamcst* elccimnic nevvspapers using our Vietnamese searcli cneine Wc have proccssed ihcse ariiđes to create our corpus \vitli the follo\ving steps: • Classify articles \vhicli havc close puhlishcd datc into a group In each group SOI1 ihc articlcs in dccrcasing order of sizc We manually annolale thcsc ariicles in ihc sanie group and approximalcd si/e to mark articles near duplicated ['rom 10000 a rlic lo \\c lundomK ncIcci 100 1)1)11duplicate arliclcs and then permulc llicir paragruphs lo creíilc ncw articlcs Ihc iicnvIv crciitcd articlcs and ihcir origm arc marked as near duplicatcd • l ompute sliilislĩcs ol \\o rd s lokcns and III 0.0 0% F = 64 10.00% F = 128 00% s F ig u re F -m e a rs u re c h a rt u sin g the b a s ic íe a tu re s se t And ulien applx OIU liam euork \vc liuvc a helou result T a b le F -m e a su re s c o re s for d iffe re n t v a lu e s of F an d k u s in g o u r vveighting s c h e m e and V ietn am e se s p e c iíic íe a tu re s t 32 \? 24 64 '6.22 I2S 35 : v (yy ()"„ 62 X ")°I 5‘M > 45 2'ỉ U‘í 16-1 " "íĩ „ 83.56 4ÍI.M 3.V 22 ^ 07 78.32 4S Ị0 s "" corpus Tlicsc siíilislics w ill be uscd lo canculatc uciiiln s nl lc a lu r c s in o u r Ir a m c x v o r k T h e r c d i c a b o i i l 0 0 lo k c n Y , 2.3UO.UUU tvorilòuikl 197.200.000 2-giam 100 00% 80 00% Kc.su Iỉ li 60 00% l c a lu r c SCI \ v h i c h d o e s n o l c o n t a i n LII1\ l c a l u r e s l i c n c r a t c d h> 20 00% As can hc SCCI1 li'om I ahlc I 0.00% Vicinamcse Processing tools the highcsl F-mcasurc is 54.2°111 I i M l l i L iin h c 'C O I i l i i l I.s X V í» " i u h c n A ' lim h c r ili a n ih c |2 V 'I" ,) and /• 4 h c '1 í- m e a - u i c \ \ h c n o n l \ u s iu t! t h e b u M t Ũ M l u iv NCl I OI i i \ c d \ a lu c t>l I J iij k M K T e iiM > Ir o m I lo lo i II J iỉiila l Ù iia u -M o lir u d o c u m c n i.s In A C M S I G M O D A n n u a l C o n le K M K e c«p> d c u x l.o n I ’| t*cc cJ II ỊỊ.N >>1 títc S a n I r a iK is c o C A M d > 19 S h iv a k u m a r il ( ia r c ia -M o h n a s c \M -\ c o p v d c lc c U o n m c c h a m s m fo i d iii iu il Im c r n a lio n a l (.'o n lÌM v iK Ọ 111 I I k o i an d P c tic c l.ib raiIC S I >on a u t o m a lc d ol ỈI d o c u m e n ts P io c c e d m g s ol 2nd D ig it a l \ u s i i n 'Ic N a.s lu n c I ỉi lla n c ll d c i c c lu in im p lc m c n l u iiiin c o llu s io n d c l c c lo r I V Iiu c s C o n ỉc ic ik o A M a lc o h n 1)1' p c tic a l co p viM Li 111 P k iiíu in s m lu n c t h c o ic t ic il IxiM S lu th e h e tu e a i t lĩc l- c i r c i l ’r c \ c n i IOI1 ic x ts jn d ||> p L iụ u n s m J iid 1’i.ic iic c Iin l ( 1111 lu CDinpiii ison 1*1 jppr(ntchc> 1’i'OL í ưdiiiiỉ'! nf i/ic (>th I.tniiỉiitiỊỉt' l< i\ in it\ s jn i I v a n i j l i ‘ >n ( o í; /i’i v « u ' I k l ■( 2'»» x h c s v 2U \ i \ i i - < 'p in r k il A lg o r it h m s lo i N e a r N e iíĩh h o r P i o N o m c 'u k 11S) D in h í) ic n llo a n i; k i c m l*( ) S - L iụ u c i Im I iiỊili.sli • \ ’ ic in a n iO M ' H ilm iiU iil ( I« p ti> U V n k N h iip l i i n k l i n i ! a n d I MMi; 'a u llc l l c \ t s D ĨIl I I )| I \ c n M ic ln n c 11 insl.itiD M , i n j H c \o m i lo i th e h u tu rc K Ỉ V I 15] In í)ic tu > n a r> lo o k -u p vvith o n c c ir o r 12 I I I P h u o n u a n d 11 I \ ‘ m h \ M u x im u m r n l i «*p y A p p to d t h 1«» S c n ic n c c M o u iìd a rx D c t c c im n o i V ic in a m c s c Í t M s II I I In lc r n iiIio iK il C o n lc r c iu c OI1 UcNLMrch lnnt>\vUHHi in d V i M im V III N In «*!' u t‘ A liio r ith m s ( 11 Iv d I W ỌC.0 17 HI System p u u cs I 10 lan 44 iC N c m h la n c c R in t lc r A W ch In 116 ] A docum cius" l’m m V ie ln a m N u iio iu l l im c iM l} I líUUM No III T c \i ilo c u m c n is ( o n lc ie r K e Tlìis work has bccn lìnancialK supporlcd h\ llic Research granl "Plagiarism Detcction lịr Victiiiimc.sc s L a rtie * Shori Ị1 l l c m t / e N' S c a la h lc d o c u n ic n t liim c r p r in tin u l S I N I X W 'o rk-.sh o p (in I : lc c i i »H IIC C’i» m m c rc c l lí % W c ihunk I inh V a n M c d iu C o m p u iụ lor lc lú n g us II.SC ih cir dalahasc as \ve ll as lo r su p p o rlinu us lin n n a u llv ị}\ A ^ 'u k in llu a n iỉ X u c q i C h c n e I 1M ll S l- N IX C o n lc io u c c \';k C v II'ii c d ito is 12 \\c h Nhun Bai m c th o d |l| to r S c a lc P v a lu a t io n o t' A lg o r it h m s III P ro c c c d in ẽ s o t' th c ^ th approprialc valucs ol I and k so \vc can appl) ihc rcsull to Vietname.se searcli cngines Wc w ill continue improve om írameuork by using spcciíìc Victnamcsc languagc processinc mcthods combining \viih hasli íunclions V II «»l a n n u a l in ic r n a t io n a l A C M S Í G I R c o n lc rc n c e u n 'R e s e a rc h an d d e v e lo p m e n t in m t o r m a t io n r c tr ic v a l A C M Press 0 1’ro i' our c o n ia in m c iu i m p r o \ c J NidbilitN in P ro c e e d in g s o í th c I t h In te rn a tio n a l c o n lc r c n t c [ I Ị M c n z in g e r Charikar algorithm combincd \viih uũghtinu schcmc is elTcctivc and c llìcic n l lo dclcct nciir duplicale III Vietnamest* arliclcs I; \perimcntiil result prove thai, thc ol an d ( II(I4 | in (ìn WorŨ W idc Wcb A C M Press 2007 C O N C L USION l*-m easurc C ln » \ \d h u r y c t vil S lu d ic s o ! I - M a i c l i s ig n a tu ic s \ i a lc M c o n r a n d o m i/a t ió n A< »1 |9 ] D a ta h a s c s " highest O n ih c rc N c m h la n c c In S I - O S S e q u c n i.c s V I A | l l ) ' Dctecling VI P la e ia r is m IS e a s \ b u i u lso cas> to d e ie c i P la g ia rs C r n s s -D is c ip lin a n H a g ia r is m I a h n c a lio n a n d Id ls it ic a im n ol S o n B a o P h a m KSE2009 notiíication KSE2009 Sat j u| 2009 at 10:51 AM To Son Bao Pham < s b p h a m @ g m a il.c o m > We are pleased to iníorm you that y o u r sub m ission to K S E -2 0 has been accepted as a full paper for conference Please revise y o u r p a p e r to in c o rp o te re v ie v v e r c o m m e n ts fo r c a m e r a - r e a d y v e r s io n s u b m is s io n A d d re s s in g reviewer concerns in y o u r c a m e -re a d y p a per is o f pa rticuỉar signiíicance since the Program C om m ittee may revisit your papers to e n sure that th e se co n c e rn s have been adeqũately addressed Thepage lim it for full p a per is F u rth e r C M R subm ission and registration instructions will be sent later We look forward to see in g you in H anoi in O ctober Best regards Ngoe Thanh N guyen, The D uy Bui, E dw ard S zczerbicki Paper 72 Title: A Fast T e m p la te -b a s e d A p p ro a c h to A utom atically Identiíy Pnm ary Text C ontenl of a W ob Page .re vie w PAPER: 72 TITLE: A Fast T e m p ỉa te -b a s e d A p p ro a c h to A u tom atically Identity Pnm ary Text C ontent of a WeD Page OVERALL R ATIN G : (a cce p t) REVỈEWER'S C O N F ID E N C E (m edium ) REVIEWER'S C O N P ID E N C E (m edium ) ORIGINALITY: (M o d e te ly O ngm ai) SIGNIFICANCE: (V ery S igm íican t) PRESENTATION A N D R E A D A B IL IT Y : (Very Good) RELEVANCE FO R T H E C O N F E R E N C E : (V ery R elevant) TECHNICAL Q U A LIT Y : (S ee m s S oun d) ' RECOMMEND AS S H O R T P A P E R /P O S T E R : (yes) REVIEVV The authors in troduce d a fast a lgo rithm for de tecting mam context blocks in web pages a u to m a tic a lly This see m s to be a considerabỉe im provem ent of a prior vvork called C ontentE xtra ctor a lgo rithm The p a per IS readable in general The paper is m o stly o f e x p e rim e n ta l cha racte r and is not so cleai how the presenl algorilhm can w ork w ell It w o u ld be a lso mce if com pariso ns with different types of related algorithm s are m ade re vie w PAPER' 79 riTLE: A Fast T e m p la te -b a s e d A p p ro a c h to A utom atically Ident.tv P ' " - y Com em o ' a Veb P a g '; OVERALL R ATIN G : (stro n g acce pt) rEVIẼWER'S C O N F ID E N C E : (h ig h) rEVIẼVVER^S C O N F ID E N C E : (h ig h) ORIGINALITY: (M o d e te ly O rig in al) SIGNIFICANCE: (V ery S ig n iíic a n t) PRÉSENTATION A N D R E Ă D A B IL IT Y : (V ery G ood) RELEVANCE FO R T H E C O N F E R E N C E : (V ery R elevant) TECHNICAL Q U A LIT Y : (T e c h n ic a lly S oun d) RECOMMEND A S S H O R T P A P E R /P O S T E R : (no) R E V IE V V - .As a w eb p a g e c o n ta in s n o t o n ly in fo rm a tiv e c o n te n t but a ls o n o n -m fo rm a tiv e c o n te n t s u c h a s a d v e rtis e m e n t, navigation lin k s , e t c it is im p o rta n t fo r a s e a r c h e n g in e to e x lr a c t ju s t the in to rm a tiv e p art of the w e b p a g e s li vvants to search The au th o rs p ropose F a stC o n te n tE xtra cto r as an extension of C ontentE xtra ctor an earlier effective lechnique used to ex tra c t th e in ío rm a tiv e c o n te n t from a w ebsite The idea IS based on detecting the web block lemplate that is com m on a m ong the site's pages U sing this tem plate inío rm ative blocks can be extracted quickly The idea is somevvhat novel and the e x p e rim e n t w ith real w ebsites show s a signiíicant im provem ent over ContentExtractor T h e p a p e r is w e lỉ-p re s e n te d Onelhing not very c le a r is the a lgo rithm used to detect the tem plate The paper does not mdicate clearly w hether Ihis is done m anually o r by an a lgo rithm I think the authors mean an algonthm but m ore details vvould be appreciated revievv PAPER: 72 TITLE A Fast T e m p la te -b a s e d A p p ro a c h to AulurTialically Identiíy Pnm ary Text C ontent of a W eb Page OVERALL RATING: (stro n g acce pt) REVIEWER'S C O N F ID E N C E (h ig h) REVIEWER'S C O N F ID E N C E : (h ig h) ORIGINALITY: (V ery O rig m al) SIGNIFICANCE: (V ery s ìg n iíic a n t) PRESENTATION A N D R E A D A B IL IT Y : (E xcellent) RELEVANCE FO R TH E C O N F E R E N C E : (Very R elevant) TECHNICAL Q U A LIT Y : (S ee m s S oun d) _ RECOMMEND AS S H O R T P A P E R 'P O S T E R (no) R E V IE W The paper presents an a p p ro a c h to iden tify prim ary text content Irom web pages A very well w ritten pa per clear, co n cise , w ith good exam ples The authors should be v e ry plea sed w ith the outcom e of this paper The approa ch is w ell th o u g h t out and provides substantial beneíits a s s ta te d T h e e v a lu a tio n w h ic h IS th e n e x l m o s t important a s p e c t IS v e ry w e ll c o n d n c te d a n d e x p la in e d Beware of the very fe w g m m a tic a l errors (e g by a traversing path) and lypographical e rro rs (e g g u a n tie s ) A Fast T e m p la te -b a s e d A p p r o a c h to A u tom atically Identify Prim ary Text Contcnt o f a W eb Page D at Q u o c N g iụ e n D a i Ọ u o c N g m e n Son B a o P h a m T h e Du> B u i llu m a n M a c h in c In tc c lio n l.a b o to rx C o l l c g c o l l c c h n o lo g ) V ic ln a m N a tio n a l l n ivc rs itN Ila n o i Abstract-— S e a r c h e n g in e s h a v e b c c o m c a n i n d is p c n s a b l c tool \vc h piiucs th u n thc sam c u c h s iic I hc m a in disa iÌN aiilan c o l fo r b r « w s in K i n l o r m a l io n o n I h c I n t e r n e t I h e u s c r , h o u e v e r , is this a ltid riih n i is thai ii is q u itc sU n\ \\h c n thc n u m h c r o l ’ o ítcn a n n o y c d b y r c d u n t la n l r c s u l t s r r o m i r r c l e v a n t \>eb paf»cs O ne rca so n in fo rm a tiv c is h ccausc b lo c k s of scarch w eb c n g in c s p ag cs a ls o such lo o k as at non- a d v c r lis c m v n l, n atig H tio n l in k s , e íc I n llii.s p a p v r , H í p r o p o s c a f a s l a lị> o rith m ca llcd K a s t C o n t c n t K x lr a c t o r cu n lcn l b lo c k s in a ío iiu t o in a t ic a ll) ucb p aj»c l>> í le t c c t m a in in ip r o \ in g I lic ( onlenlKxlracior altỉorilhm B> auloniaticall) iU cntihinị’ and sto rin g ( e m p la le s r c p r c s c n t i n g I h c s t r u c t u r c « f c o n lc n t b lo e k s in a \v cb s ite , c o n t c n l b lo c k s of a n c \\ w cb pagc Iro m in p u t \\c h paucs c on tcn tl A ír a c t o r is laru c a lu o rith m M o rc o v c r docs noi hccausc p rc sc rx c llic h ic rc h ic a l o rd e r ot* o u u t h locks thc c x tra c tc d a x ìtc n l b lo c ks nia> not a ppca r in thc sum c o rd e r us thc o rig in a l OI1CS I h is in ig h l p rc x c n t thc sca rch c n iỉin c from sc a rch ilít! correclK íin CMICI phrasc \\hcn thc phrasc spans across I\N(* c o n sc c u lix c hlocks In llìis papcr u c propoNC I astC o m c n tl \tr u c to r - «1 íast th e H ch sitc c a n he e x l r a c t e d q u i c k l y I h c h i c r a r c h i c a l o r t le r o t 'lh c a lg o ritlm i o u u l h lo c k s is a ls o paucs h \ im p r o s in iỉ C o n lc n tl \tr iic to r Iiìstcad (»1 storini! all n iỉũ n t H Ìn c d n h ic li liiia r a n t c c s Í h íi l th e u* a u to m a lic a lh lic tc c l pat:c> »>l a \\c h s iic c o n lcn l hlncks in \ve u iU im a lic a lK ucb c x lr a c lc d c o n t c n l h l o c k s a r e in I h c s a m c o r d c r a s I h c o r ig in a l in p u l \\c h ones Ic in p liiic (o nIoiv in to r m a lio n o l c o n ic n l h lo c k \ iirul possihk- c rciilc a \M(>ni»l> ilc tc c la l hltK-ks lo r latcr rc lric v iil I ach hloc k in \ \ e b puiỉc can hc k lc n ũ llc d allhoiiLỉh in>l a lu i i \ s u n ic |iic l\ hs Keyyvords: (hua m itiin g , tem plate (ietection, Meb m in iiiỊỊ ;i tm x c r v il I pa lh in ÍI h ic n irc h ic iil trcc ol hlocks vshich rcprcsonl'- llic \\oh paưc \ lcmpl.ik- COMI.IÍIIS In IK(>1)1 ( 11\ iihM>lulc pulliN (»1 ciM iicnl N o u a d u N s s ca rclì c n tĩin c s h u \c h c c o m c an in ili> p ciisíih lc lool for h r o u s in g in lb m ia tio n UI1 ilic In ic rn c i \\h ilc thcrc h lo c ks Iik l "! iu m -o n iiv.n l hlitcks h ;i\iiiL ỉ ihc >anic p;iths iis lliiil o l COMICIII hlocks I ỉ \ '.iiiriiiL' ih c ahs»)lulc palliN ih c h ic rc h ic il »>l ih c iH iip m hlocks arc man\ usclul scarch entỉinc.s a\ailaM c thc uscrs arc still is n ia in la in c d u h ic h uuariin lc cs I lun thc c x tru c ic il c onlcnl annoscd b \ hlo c k> iirc in ihc \IID C o a lc r iiN ihc iMĨLỉiiuil OIICS re d u iK Ỉa n t ro s u lls IIOMI irrc lc x a n l uch pagcs One o l l h e rcasons is hc ca us c u c b paucs o llc n c o n ta in m»nintorm atỉxc h lo c k s s uch as a d x c rtis c m c n is lin k s etc \ lc m p la lc lo r \\c h N ilc i'' M o iv d c;ich n c \\ l \ paiỉC ÍN c t)[U p aiv J \\ilh llic tc m p k iic to \H c t ilic cravvlcv! \ \ ch lin tl Ks prim iiA scarclì cn g in c \ \ h ic h la c k s e lĩc c liv c e o n tc n t h lo c k d c ic c tio n c o n tc n i cnpacilv c o m p Ìmmi> in I ciNiC o n ic iitl A tra c to r i> m uch >m allci (lum o lte n sciirchcs in n o n - ĩn lo r n iiiíĩ \ c h lo c k s ;ind thcrclbrc p ro iiu ce s rc d u n tia n l rcsulls >uiToundcd b\ U L > an open lai: >uch a> S IW N I hc m im b e r ol c M ru c lc d hl»>cks and thm in ( o n tc n il \lr u c lo r u h ic h lìKikcs I a>tC 'ontcnll M cio i A hloc k in a \ \ e h puì>e is o lỉc n v.lclìiK \l IN a part o l w ch pagc hlocks I \B I I t a s lc r liia n C \ > n ic n t l A t r u c ln r IK 1)1 V ■ a n d ;i m a tc h in g closc I hc rcsl 1*1 ih e p a p cr I> o ru c in i/c d as u>lli)\\N \\ c M im m a ri/C rc la tc d m a tc riiils a ĩiil niclh(>Js in S cction II In tag Ị I I Ị D clcctin iì \\hich blocks arc primaiẠ tc\l conicni S e c lio n I I I \vc dcNcrihcd *111' app ro ac h Sonic c \p c rim c n lN blocks not o n lv arc p rcsenied in S c c tio ĩi l \ in d u c e s h im c e lììc ie n c N in storuge lo r a seạrch c n c in c h u l nlso im p r tn c s scarch c tìlc ic n c ) in o rd e r lo incrcasc users’ s a lis la c tio n M a n u a ll) m a rk in g co n icn i II blocks is lìo t li lo iis ih lc M ilu lio n lo i a scarcli c n g in c III ilii'' papcr uc c itn s ijc r llìc ta>k 11 1* a u lo N ia lic n ll) d c lc c tin y contcnl h lo c k s in a u e b p a ec v\ c h p u iic s O I ih c > iim c in o rJ c r to sho\\ thc p c r lo n iu n c c itl our tỊtpnKich S o c l n ic tlio tK Ki h j\c i namkks hccn p ri» p »> «.cd lo J c lc c i c o n tc u i b lo c k s o r n o n - c o n u m h lo i.k > in \ \ c h p ;iLics a u t t M iu iic ilK Nxcl^Mk' UM I.|II> ỈK I\ c s im ila i 'ì I c ! ,il |*M h ;i\ c p n 'p o \ o J n v c N lru c (u i\' \ \ h ic h IN c ilk x ! S ik ' slructurcs I u rlh c rm o rc IHMÌ-CIHIICMI M o c k > n tic ii s ilu a lc in S h lc lìxed posiiio n s 'i i l i / i n t i ilìo s c (ih s ciA iiio iiN COIUCIII M ocks S S I Ì n !.t n iK \ l fW*ni ih c D O M IIA.V «>l J i t l c r c n i u c h Ị\ I'J C ' in ;i u ch paiic CIIII hc iiiilo iiU ilic iill' ilcioctccl \I prcscnl liv C IS ^ ỈI ú»r c ic h u c h N Ĩk ’ h.iNCil lii> m tln '.11110 \N ch>itc V i ct iil o h > c r \ alit-HN a U i' p r c N c n ic J !.» ru iu l;i> !«>r se vo rai Iiìc lh o d s h u \o h c c ii p ro p n s c d u> la c k lc iliiN p ro h lc n i c ilc u L iliii^ th o iin p o r liin c c o l c u c h n o v k ’ UI s s I including CoiU ciìi l-\tracu>r h\ D chiuiih ct al 11 I I I 121 noi>c eliminatĩon mcthod b\ \ \ ct ;il r>l- ln lo l)is c o \ c rc r h> I in and Mo 113 j .Ainoniỉ llic m C\»nicnll M racinr ap|XMi> to hc lo the mosi c rrccli\c "akorithni to cxtruci primaiẠ coiilcm \ \ iJ c blocks >, i| , For a \\c h pauc C \> fìtcn lI \ i r a c lo r hlocks h \ c o m p iiriiiỉ* c iic h (li it> h lo c k > lìncK contcni >>ll M n c k s oi c lin iiiu ilv - n o i> \ i n l i i r i i K i I i i '11 ,1( ) J J ! \C ' u h iv il h c lp s p r in ia r v c o n ic iìi I h c p r o b lc m " I t h is ,ip p r« M cli t p p c ií' u l'.c n ỉlic m n n h c r in p u l \ \ c h piiL!C'» I ' la r ự c SttM Ín u m il lio iì n o is \ \w -h |; i v \ilh V I ! ih c ii I I1 V K I (->l"s.-k> >u V.h ' j J in tỈK- \ \ n i k j vi H iL ' h M U ' I, I L-! \ K « 'k / k M i p l UKÌ n fv i L i ũ i ' I I h ir > I i u l li' visuall) separating vveh pagcs into hlocks hascd on vcrtical and horizontal lines they calculatcd thc block í'rcqucnc> lịr each hlock Ifth e hlock ỉrcquenc) value o f a block is hiéh it is a template block vvhich is then laheled tbr buiĩding template model M chta and M adaan 110| prcsented an approach using regex-based tcm platc Ii\ segmenting ueb pages hased on the template they could detect important scclions V ieira et al Ị used tree mapping together \vith ihc R TD M -TD Algorithm and thc K ctrieve Tcm platc Aluorithm for dctccting thc tcmplate I in and I lo I n | introdiiccd a method to ideniiiy conlcnt hlocks h> pariitioning a ueh pugc into blocks hased on ih c < I A B U tag r.ntrop> \alucs ol the terms appearing in each blơck arc caĩculatcd and uscd for determining content hlocks C o n le n t lA lr a c lo r 11 11112 u p p c a rs lo be thc m o si elTective algorithm to id en liỉs primurv in lb rm a ii\c contcni blocks The input o f this algorithm is a sct o l u e b paucs thai are assumed to ve sim ila r structurc í-irst the uluoriihm partitions cach pagc inlo alom ic b lo ck v An ulomic hloek is a block llial does ni)l contain an\ block In the n c\l stcp u iili an atomic block B , the aliiorithm calculutcs thc nuinhcr ol \veb pagcs thai conltiin u block M inilar U) B II block B occurs munv tim cs o \e r d illc rc n i u c h paiics hlock B is considercd as a non-conlcnt hlock and ii is rcmcncd Othcruisc block B is considcrcd a priinaiA contcnl hlt>ck I igurc I shous a block u ith corrcspondinL! I* laii ol a wcb pagc I his block co nlains lòur alom ic suh-hlocks u iili corrcsponding tag (SCO the souICC codc in I iuurc 2) C o illcnll 'A ira c io r llìc n p a r iiiio n s ilic b lo c k iiiio í ì \ c blocL- Gn S e p t 27, che us
Ngày đăng: 18/03/2021, 17:38
Xem thêm: