Phát hiện sao chép giữa các văn bản tiếng việt

40 14 0
Phát hiện sao chép giữa các văn bản tiếng việt

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Đ Ạ I H Ọ C Q U Ó C GIA HÀ NỌI **** PHÁT HIỆN SAO CHÉP GIỮA CÁC VĂN BẢN T1ÉNG VIỆT Mã số: Q C 08.17 C hú nhiệm đề tài: Phạm Bao Son Đ A I H Ọ C Q U Ố C G IA h a n õ , T R U N G t ầ m t h ô n g t i n t h u V ịỆ N Ọ O O ỊO O O O O ^Ậ I la Noi 2009 M Ụ C LỤ C B ÁO C ÁO TỐNG K É T I Giới th iệ u 2.Thách th ứ c 3.Tông quan vấn đê nghiên c ứ u 1.Các phương pháp phổ b iế n 3.2.Kiến thức tàng Xây dựng Corpus văn tiếng V iệ t 5.Phương pháp xác định chép văn bàn với sò' liệu lớ n 1.Mơ hình phát văn ban gần trùim lặp với sở liệu lớn 5.2.Lựa chọn đặc trư ng 5.3.Tính Fingerprint cho văn b a n 5.4.Xác định cluster cho văn bán 3 ì 6.2.Ket qua 9 ] I ~> 7.Kết luận Tài liệu tham kh o 6.Thực n g h iệ m 1.Xây dựng phưưnu pháp thí nghiệm phương pháp đánh giá Danh sách n h ũ n g người tham gia thực đề tài (học h àm , học v ị , CO’ quan côn g tác) Chú tr ì để tà i: • TS Phạm Bảo Sơn Những n g i thự c hiện: H ọ tên TT H ọc v ị Cơ quan cône tác B ù i Thê D u y TS Trườ ng Đ H C N ? Lê A n h Cường TS Trườ ng Đ H C N Trư ơng C ông Thành CN Trường Đ H C N N guyễn Q uốc Đ ạt CN T rườ ng Đ H C N N guyễn Q uốc Đại CN T rườ ng Đ H C N Tràn Bình G ianu CN Trường Đ H C N Danh m ụ c b ảng số liệu Bàng I Kết F-measure tiến hành thí nghiệm theo hướng tiếp cận thử (chi sứ dụng đặc trưng mặc định (âm tiết) không sử dụng đặc trưng riê n g qua xừ lý tiêng Việt) 13 Bảng Kết F-measure thí nghiệm với mơ hình phát trùng lặp cua văn ban tiếng Việt đề xuất 13 |\ Danh m ụ c hình Hình I M hình phương pháp phát chép văn ban tiếng V iệ t Hình Biêu đồ thể kết thí nghiệm theo hướng tiếp cận thứ n h ấ t 13 Hình Biểu đồ kết (F-measure) sừ dụng mơ hình đề xuất 14 \ OVERVTEW O bjective P la g ia rism D e te ctio n is one o f the m ost im p o rta n t p ro b le m s a ffe c tin g o u r life and it is a c tiv e ly studied by m any research groups in the vvorld T a c k lin g th is task can b rin g m any advantages to the society, e specially to academ ic since there are m any researches as w e ll as study m aterials pu b lish e d in the In te rn e t vvidely M o re o v e r s o lv in g the Plagiarism D etection problem g re a tly contributes to Search eneines períorm ance T h is p ro je ct is to b u ild up an e ffe ctive m ethod to ta ckle the task o f P lagiarism D etection A d d itio n a lly , the task o f P lagiarism D e te ctio n fo r Vietnam ese c u rre n tly receives v e ry little studies so that in this p ro je ct w e w o u ld lik e to pay m ore attention to solving the Vietnam ese P lagiarism Research • Research com m on P lagiarism D etection m ethods o ve r the w o rld such as D SC 1M atch L A S H , Sim hash, C harikar • • B u ild up a V ietnam ese docum ents corpus C onstruct a fra m e w o rk fo r V ietnam ese P laeiarism D etection in a very laree database Result publications in the International Conferences pubỉished by IEEE c s • Cong Thanh Truong, lh e D uy B u i, Son Bao Pham "N ea r-d u p lica te s detection f o r Vietnamese Docum ents in L a rg e D a t a b a s e 7lh IE E E International Conference on Advanced Language Processing and W eb In íb rm a tio n T e ch n o lo g y" C hina 2008 • Dai Ọuoc N guyen D at Quoc N guyen Son Bao Pham The D u v B ui "A Tem pìate-based A p p ro a c h to A u to m a tic a ìỉy Identiýỵ P rim a ry Text C ontent o f a Fast Web P a g e ” , In The l st IE E E International Conterence on K no\vledee and Systems Engineerine Hanoi Vietnam 2009 Application The Vietnam ese P laaiarism D etection fra m e \vo rk is c u rre n tly applied in the X a lo Search engine o f T in h v a n M e d ia C om pany Academic resuíl U n d e r-iỊru J i(a iiu n theses • C ong Thanh T ru o n g “ N e a r-d u p ìic a te d D e te c tio n f o r Vietnamese D ocum ents in La rge D atubase ", U nd e r-g d u a tio n Thesis, C o lle g e o í I e c h n o lo g y, 2008 • Trần B ìn h G iang, “ Vietnamese B lo g P r o f ilin g ’\ U n d e r-g d u a tio n Thesis, C ollege o f T ech n o lo g y 2009 • Phạm Đ ức Đ ăng, " Vietnamese W o rd Segm entation m e th o d using P a rt-O f- S peech" U nd e r-g d u a tio n Thesis C o lle g e o f T e ch n o lo g y 2009 Scientifìc coníribution Enhance knovvledge as \ve ll as s k ills fo r m em bers o t'th e la b o to ry in N a tu l Language Processing ỉ Tóm tắt kết nghiên cứu đề tài Ket qua vể khoa học (những đóng góp cua đê tài, cơng trìn h khoa học công bô) báo đăng hội nghị quốc tế chuyên ngành (đăng IEEE CS) • Cong Thanh T ruong The D u y B ui Son Bao Pham "Near-dupHcates detection f o r Vietnamese D ocum ents in L a rg e Database ” , 7lh IE E E International Conference on Advanced Language Processing and W cb In íb n n a tio n T ech n o lo g y" China 2008 Bài báo ứng dụng phươna pháp phát chép văn ban tiếng V iệ t m ột sơ liệu lớn đề ứng dụng vào phát tin tức có nội dung gần giố n g giúp tăng hiệu cùa hệ thống tim kiếm thơng tin Phương pháp có tác dụna làm tăng hiệu qua mặt thời gian tim kiêm nội dung m ạng internet tiết kiệm tài nguyên vê nhớ lưu trữ • Dai Ọuoc N guyen Dat Ọuoc N guyen Son Bao Phani The D uy Bui "A Fast Tem pỉate- hasecì A p p m a c h to A u to m a tic a llỵ ld e n lifi' P rim a rv Text C o n le n i o f a Weh P a g e " In The r ' IHHH International C ontèrence on K now ledge and Systems Hngineering Hanoi V ietnam 2009 Bài háo ứnn dụng phươnạ pháp phát chép văn ban tiếng V iệ t úng dụng việc nhanh chóng tìm tem plate cúa vvebsite đê xác định phan nội duníi Két qua p h ụ c vụ thực tê (các san phàm công nghệ, kha năn g áp d ụng thực tê) Phươna pháp đè xuàt giái quyêt toán phát chép - hay phát trùne lặp văn ban tiẻna V iệ t áp dụne vào hệ thơne tim kiê m thịne tin X a lo v n cua côna t\ T in h Vàn K ê/ qua đào tạo (sô lư ợ n g sin h viên, sô lư ợ n g học viên cao học, nẹhiên cú n sin h tham g ia thực lùm việc tro n iỊ đê i sơ khóa luận, lu ậ n văn đ ã hồn th n h hao vệ) ỉ klióa luận tốt nghiệp C:\TT: • C ong Thanh T ru o n ti "X e u r-i/iip ltc a te c / D e tc c tio n fo r Vietnamesc D ocum ents in L a rạ c D a ta h a s c " U n d e r-a d u a tio n Thesis C o lle a e o f T c c h n o lo N 2008 • T rần B ìn h G iang "V ietnam ese B lo g P r o fd in g " U n d e r-g d u a tio n Thesis C o lle g e o l' l e c h n o lo g ) 2009 • Phạm Đ ức Đ ăng, "P h n g p h p p h â n đoạn từ tiế n g Việt sử dụn g g n nhãn từ lo i Khóa luận tố t nghiệp đại học, Đ i học C ông N ghệ 2009 K êt nân g cao tiêm lự c khoa học (nâng cao trìn h độ cán tra n g th iê t b ị hạc p h â n mêm x â y dựng g ia o nộp đưa vào sử dụng tạ i đơn v ị): N âng cao lực chuyên m ơn cùa cán phịne thí n g hiệm lĩn h vực xứ lý nsôn ngừ tự nhiên trí tuệ nhân tạo BÁO C Á O T Ỏ N G K É T G iới thiệu Vấn đề xác đ ịn h g iố n g văn m ộ t vân đê quan trọ n g vớ i nhiêu tác động tới nhiều lĩn h vực tro n g sông H iệ n việ c g iả i vân đê xác định hai hay nhiều văn bàn có tương đồng tích cực nghiên cứu G ia i qut tốn ứng dụng tro n g nhiều mặt cua xã hội m ột ứng dụng cua toán phát việc "đ o v ă n " k h i mà tài liệ u nghiên cứu đưa lên m ạng Internet m ột cách rộng rãi phố biến C ùng vớ i phát triể n vớ i tốc độ chóng mặt cùa Internet cơng nghệ tìm k iế m , g iả i đươc toán xác đ ịnh tươna đồng văn mang lại nhiêu ý nghĩa tích cực tro n g việc xâ y dựng cỗ m áy tim k iê m tăng hiệu hoạt động toàn hệ th ố n g tìm kiếm T ro n g hệ th ố n g tim kiê m thô n a tin m ột tro n g m ục tiêu tiên quvêt trình bày trang thích hợp tới naười dùng nhanh có thê Đẽ đạt m ục tiêu hộ thống tìm k iế m cần phai phát trang trù n g lặp gằn trù n lập bơi chúng khiến việc tìm k iế m chậm đồng thời tăn thêm chi phí nhớ cho việc tìm kiếm V iệ c phát trang nội dung trùng lặp hồn lồn có thê thực dề dàn tí nhờ phươne pháp checksum tu y nhiên phát nội dune gàn trù n e lặp lại phức tạp nhiều C húng ta có thê sứ dụng m ột cách đơn íìian so sánh từne cặp văn ban m ột với dè kiê m tra độ a.iơna, nhưne vớ i sị lượne văn ban cực lớn tro iiíi cỗ máy tìm kiê m điều khơng kha th i v ì độ phức tạp lởn G ia i vấn đề có m ột sơ thuật tốn Nearest N e ig h b o r Search [3 ] L o c a lity S ensitive H ashing [1J DSS DSC-SS [4 ] Sim hash o íC h a rik a r [2 ] hay I-m a tch [5 j T ro n a m ô i trườna In te rn e t V iệ t N am tliỏ n a kè từ 25 trang tin phò biến V ietna m n e t.co m D a n tri.c o m N aoisao.net Y.v cho thây khoane 0% sô tin báo trùna lặp san trù n a lặp m ỗi nsàỵ D o vậ y việc phát dược nlũrna tin sè đóng vai trị quan trọ n s kh n g cho hệ thô n a tim kiê m mà cịn cho nhũ nu nghiên cứu tio n iì xứ ỉý n °ị n naữ phàn nhóm văn ban phát chu dê tru \ vết nội duníỊ cũna nhièu lìn h vực khác Trơn a iớ i dà cỏ nhiêu nhóm tập UIII 1>1 n tihiên cứu vê xác định ” 1011” niũa văn han Cũn ti dã cỏ nhiêu ửnsỉ d ụ iiíí rộn Li rãi SU' d ụ im tro n ” hệ thõnu tim k iế m th n s tin hay tó m tăt da văn ban I L1\ nhiên, nííhiẽn cứu ửnu đ ụ n ” tronu lĩnh vực c lio tiế n a V iệ t ràt 1110 Do vậỵ dè tài n \ tập tru n a lớ i nuhicn cửu \a xà\ dựng írna d ụna \ àn đê nàv \ ới nuỏn nuữ tiê n s V iệ t International Conference on Advanced Language Processing and Web Information Technology A L P I T 0 T able o f C on ten ts Message from th e G en eral C h a irs X I I Message from th e P ro g ram C o -c h a ir s xiii Conference C h a ir s X IV steering C o m m itte e XV I Advisory B o a rd X V II O rganizing C o m m itte e .xviii Technical Program C o m m itte e .xix Track 1: LPT (Language Processing Technology) Exploring V ario us F eatu res in S e m a n tic Role L a b e lin g Hongling Wang, G uodong Zhou, Q iơom ing Zhu, and Peide Qiơn Transíorm ation Rule Le arn in g vvithout Rule T e m p late s: A Case Stud y in Part of Speech T a g g in g Ngo Xuan Boch, Le Anh Cuong, Nguyen Viet Hơ, a nd Nguyên Ngoe Binh Word Sense D isam b ig u atio n Based on R elation S tru c tu re 15 M yunggw on Hwang, C hang Choi, Byungsu Youn, and Pankoo Kim K-Similar C o n d itio n al R ando m Field s for Sem i-supervised Sequence L a b e iin g .21 Xi Chen, Shihong Chen, a n d Kun Xiao Chinese S e n te n ce S im ilarity M easu re Based on VVords and Stru ctu re In ío rm a tio n 27 Rongbo Wong, X ioohuo Wong, Zheru Chi, and Zhiqun Chen Email C lassiíicatio n Using S e m an tic Featu re s p a c e 32 Vun Fei Yi, Cheng Hua Li o n d Wei Song A Clustering A p p ro ach of C o n ce p tu a l S e n te n ce G ro u p s XiangFeng Wei, HơnPen Zong, a n d Q uan Zhang Aulhonzed hcensed use linnted 10 ƯNSVV L brary Do«vnioaded " Augos' - - Finding Sim ilar T e xts U sing U -W IN 43 Kang-seop Shim, Cheol-Young Ock, D ong-M eong Kim, Ho-Seop Choe, and Chang-Hwan Kim Deriving a S em an tic C lassificatio n T ree o f Korean V erbs Based on Sem antic Features 49 Yude Bi, Jing Yuan, a n d Jian g u o Xiong Autom atic Parsing o f 'NP+ you +VP' in C h in e se , Jap an e se and Indonesian Based on CTT & Co m p lex P e a tu re s 53 Junping Zhang, X iaoling Zhang, a n d Zhiw ei Feng Korean Syn tactic A n alysis U sing D e p e n d e n cy Rules and S e g m e n ta tio n .59 Yong uk Park a n d H yuk-chul K n o n Tree Kernel-Based Se m an tic Relatio n Extractio n Usỉng Unified D ynam ic Relation Tree 64 Longhua Qian, G uodong Zhou, Fang Kong, Qiaom m Zhu, and Peide Qian Near-Duplicates D etectio n for V ietn am e se D ocu m en ts in Large D d ta b d se 70 Cong Thanh Truong The Duy Bui, a n d Bao Son Pham late nt Sem an tic Kern els for VVordNet: Tran sfo rm in g a Tree-Like Structure into a M a trix 76 Young-Bum Kim and Yu-Seop Kim A New Feature-Fusion S e n te n ce Selectin g Strategy fot Q uery-Focused M ulti-docum ent S u m rm arization 81 Tingting He, Fang Li, Wei 5hao, Jinguang Chen, and Liang Ma A W eighted k-N earest N e ighb orh ood for BaseNP D etection under Co variate S h if t 87 Jeong-Woo Son, Seong-Bae Park, Young-Jin Han, and Se-Young Park Autom atic O p in íon A nalysis Based on 5VM and Distance-VVeighted C o m p u tin g 93 Wei Guo and H ongtei Lm Technical T ran slatio n and a Role for F C A 99 Roger England a n d Stemart Hanson A W eb-Based O n to lo g y E valu atio n S y s te m 104 Xu Jianliang a n d M a Xiaovveì D evelopm ent o f Korean C o n ce p t & Instance Classiíication S y s te m 108 Young-ÁinBoe Cheol-Young Ock Ho-SeopChoe IVang-VVooLee and Hoiv-M ook Yoon Topic D etection and T rackm g foi C h in e se N e m Web P a g e s 114 Jing Qiu, LeJian Liao a nd XiuJie Dong Determ ining G e n d er of Korean N am es w ith C o n te x t '21 Hee-Geun Yoon, Seong-Bơe Pơĩk Yong-Jm Han, and Sang-Jo Lee A Hybrid M odel Based on CR Fs for C h in e se Nam ed Entỉty R e co g n itio n 12/ Lishuang Li Zhuoye Ding Dc'gen Huong and Huiwei Zhou M ultiword Exp re ssio n R e co g n itio n Using M uỉtipỉe Sequence A lig n m e n t Ru Li, Liịun Zhong, and Jionyong Duan Character Code C o n ve rsio n Hnd M isspelled W ord Processing ỉn U yghur Kazak, K y rg y ? M u lt i ỉỉn g u a ỉ ln f o r ! ^ a tio n R p t r ;e v a l S y s t e m Turdi Tohti W im M usaiơn a n d A ỉk a r Ham dulla Research on Im p ro ve d TBL Bdsed Jap an e se NER Poíít-Process ng W a n g jin g , Zheng Dequan, and Zhao Tie/un Authonzed Iicensed use IiimieJ '0 J N S A utxd'"* D A"ii0daerí 0'' Ac.ịjus: í - 33 Inlemational Conlerence on Advanced L.anguagc ProtcsMiiiỊ and \\ ch Intornuiu.n I cclinolo^v N ear-D uplicates Detcction for Vietnamese Ducuments In Large Database Cong Thanh Truon g The D uy B ui Bao Son Pham Vietnam N ational U n iv e rs ity , Hanoi Vietnam N ational U niversity Hanoi Vietnam N alional U niversity lla n o i T h a n h l r u o n g c o n g u ^ g m u i l C (/II1 d u y h t II Y im c d ii I II s i m / ) h I I v u n C i h i V II Absírací: N car-duplicatc documcnts cxacerhatc the p r o c c s s in g problcm of inform ation ovcrloatl Rcscarch in dc!cc(in» ncar-duplicalcs has attracteri a lot of attcntion from both industrỵ and acatlcm ia In this papcr, \vc focus nn addrcssing íhis prohlem for Victnam csc (ỉocumcnts tvliicli, to the besl ol o ur kno\vled«jc, lias (Idl bccn dombefore Most » f the c u rrc n l Mlgorithm.s have bccn dcsi{>ncd for Kngli.sh \vhich arc noi (iirc c tl) applicahlc to Vietnamc.sc - a monosyllahic lanịỉuagc \V c propo.se lo comhinc CharikarVs algorithm |2 | \vilh a “ tveigliting schcme” and V ielnamcsc spcciHc ícalure.N lo HildrcvN che languaiỉi* intricacy Kxpcrim en tal IT.SUIts indicHtc thai OUI' schcmc is clTcchvc fo r dctcctinị’ nc;ir-(luplic;ìtcs ÍII a corpus o f Victnamcsc docunicnts t o p i c d e t e c t io n a n d t r u c k in g I d c n l i l i c a t i t H ì o í Ii c a r - d i i p l i c i i t c t iis k s s n c li K e y m trd s : C h a r ik a r , LS H , n e a r - d u p lic a íc c lu s te ru m • S \ I lii h lc s spaccs and T lic r c in V ie ln ; im e s e ) (" h tn 'h ọ t " I • V ic u u im c s c has llc x ih lc and \ \ o i d s c a n h c a r r u iig c d d i i ì c r c n t K l l i c s a in c m c m i n g i n h I n g liN li J u c iim c n i' \ ’ ic tiK im c s c DCUS iir c in c llic ic r u a r t ic lc s In ihi.N and m c llc c tiv c p ip c r Im uc p n »p « )> c lu i V ic tn in iC N C d iì g o a l o n e o f i h e p r n h l c m s t h a i \ v e b s c a r c lì c n u in c s l u i v c to im p r o \c m c n t d e a l \v it li is h o \ \ t o d e l e c t c l u p l i c a t c a n d n c a r - d u p l i c i i l c w c l> d iK U i n c n i s h \ n iii ov.luL'ing J " t c n n u c i g h l i n i i > t l i c i n c " u h i c l i pages T h e s c p a g c s e i t l ic r s lo \v d o \ \ n OI' i n c r e a s e l l i c c o s t o l ti» llic s o l u t io n to è o m p a n > i» U N ix 't|U Ír c J d u p lic a t c th c in d c \ l a iliir c d o c u m e n is lo ( O g e lh c r r c im > \L * m iiih l OI iir o u p iin n o N th c a ll Ii c a r - u s c rs is «ilsi> p r o p o s c d lo llic r c in a in d c i •»! ih Is papcr S c c l io n c m c r N r c l a l c d \ v o r k D u p lic a lc s c u n b e c a s iK c a lc u la t in g checksum s th e s m ip le s o l u t io n IS l o r c a l i / e d \ v h c i i b a n g c x l r u c l c d h> h o \v e \e r p r o h le m hccom cs c o m p u re a pau \v iih n c iir - d u p lic a lc d xen oi c o m p lic a ic d d o a im e iU v d is U in c c b e tx v e e n i h c n i is s m a l l c m u i n l i t h ọ lo h e n e a r - d u p lic a lc A íia in th c p m b le m An lu r th c r rc d u c c o p n m i/a iu m t íic m t m h c i ol \\ilh r c d u n d a n t r e s u lt s d o c u m c n ls jl i > o r i t h m m ip r i» \c s llic n c iir - d t ip lĩc a lc m c iis u ic m c n l th e s e a r c h in s p r o c e s s b e c a u s e o t t h e i n c r c a s i n i ỉ s p a c e n e e d e il s to re C lu r ik a r II \ tò r n c a r - iiu p h c u lc a liio r iih m s Ìp p r o a c h 111 S c c l in n I> o r t M f > i/c d iiN ln ll t n s s \ \ c p r c s c iu s o m c h a c k ^ r o u n d m 's c c tio n S c c l io n ; in d i L - N c r il v n u r c \ p c r i n i c n i > Iiiil S o n ic C’ .ilu K io n r t s u ll s ili c a r c a m s k le r c d r is c s \\ilh la r g c II R E L A T E I) \V O U K d a la h a s e \ v h e r e c o m p a r i n i ’ e v e r v p a i r \ v o u l d n o t h c p r a c t ic i il S e v e l p n ib lu n S c n s il iv c a lg o r ith m s su ch ;is have heen N c u rc s l I l.t s h i n i! 11 | p ro p o s c d N c iiih h o i’ I) S S U' S c it r c h ĩ) S C - S S re s o lv e |3 | Ị4 | il i c I O L -alilx S i m h iis h ol Chiirikur 12ị tHKỈ I-inaich 151 \ \.ir ic tv a c K lw m ic o l i c c h n u i u c s h is I k c i i d f \ c l u p f d p í i2M n s m uch p i* jc u p lita tC ' - Iiu l p r o h lc m s l'r o m arc p rc s c iu ! \N 'c n l\ li\c n c x N s p a p c rs r c x c i i l th íH IIK V C d u p lic iilc s c v c r \ d u p lic a t e ih a r i t la \ D c ic c lin i: a r tic le s is o ỉ n r c iil I II n io s l ? n °n o l 11'iK 'k 'N ii n d c l i m i i K i i i n i Ị im p o iU U K C ‘Í78-I»-7(1US Í-S IIS s ' ' "II r i m s l l t-lrx )| 10 I |(W ALIM r 2U0 S 7(> \ ' i c l i i i i m c s c d o c u m c n is p o p u liir o t ilin c V io liiiin ic ^ c U '|- iir v t ic il lh c s e I U I I o d ic r ic \i liu p liL a in l d.ilaỈM^- icutnls Hrm ct al |V4.'| h.i\c pr.ip.iscJ p r o t o i x p c S > 'IC II1 c u l l c d ( ■( ) l ’ s (< ( >p> ' n 'i c c l n m Thcsc S t n t is t ic s to id c n l ih m lc llc c t u a l s lm \ 'k u n i II ct li p i« » p c rt> p i< > K X l iliỊjiu l S \s t c n > M u v lu c u n tc n is |4 /»| Ii.in c » lc\cl«» pãl Nt \ M ( , , p \|1,||\ S| \lcch II1IMÌ11 illu il- ' D- ■ ■■ s LV ĩ h c second m osi conim on approiich in detcrtnining ilic s im ila n ly o l tvvo docum cnts is co sin c similaritN measurc rhis approach represents cach documcni as a vcclor in nd im ensional space T h e sim ila rity h c iu c c n two d o cu nicnis is then d clìn ed as ih e co sin c distancc het\vcen tlic l\vo corresponding vcctors T h u s as ihe d isiancc ol i\\ o docum cnts approachcs I ih c \ hecom c m orc s im ila r in relation lo the ícaturcs heing compared c > lh r tD l J ’L': *L> I -— 'D il svhiđi tcrms should be uscd as ihe hasis Ibr com parison A n documenl Irequency ( id duplicate u h c lh c r rescarchcrs have ihreshold t to docum ents \vould inverse vveight is dctcrm ined for each I \vu d o c u iiic n ls arc ca llcd n ca r-d u p lica lc il thcir s i m i l a r i t ) m c a s u r e is b i t ỉ g c r t h a n y SCI i h r c s h o ld t term in the collection T h e id lT o r caclì icrm is đ d in e tl as lou (N/n) vvhcre N is thu n um b cr ol d o cu m em s in ilic collection and 11 is ihe num ber o í doum ieiU.s c o n la in iiig thc given terni The \c r íill ru m in ii- ol ih c l-M u ic h itp p m iid i is O (d lo g d ) in the xvorsl cusc \v h crc íill docum ents are d up licates ol ench other and (d) oihervvise \vhcre tl is ih c n u m lx T «(■ documents in the c ollcction I ASH [2| is an alcorilhm íbr solviniỉ thc (ap p ro x im aie/ex aa) N c a r N e ighb o r Scarch in h ig li dim ensional spaccs T h e m a in idea h ch intl I S I I is lo reilu cc llic num hcr o f d im o n sio n s ;ind IISC I hasli ỉim c iio n lo iv ilu c e runtimc Sim hash proịciMs cach leature in lo h -ilim c n ^ io n a l spaee hy la iiilo m lv c lìo o s iim b c n ir ic s Ih n n Ị - l I j T h is prọịcction IS ihc sam c loi iill docum cnis I OI ca ch d o aim en i b -d im c n sio iia l veclo r IS c rca ic d l\\ ia ld iiiiỉ ih c prọịcclions o l a ll ih e lc a lu rc s in iis le a lu iv scqucncc The M erg e/Purg e problem is proposcd h \ M ernande/ ei al to id e n lifv d u p lica te records ỉrom d ilìc r c n l sourcc dalabases [9.10.15] AI! records tiom dilíorcni daUI huSOS arc sorlcd on im portanl d is c r im in a lin g keỵ utlributcs I:ach liin c the records arc sortcd on a certain k e \ atirih u tc records \\ith in a sm a lỉ n n h h o rh o o d iirc comparecl \vith ca ch olher and n ear-du plicatc rcco rds arc id e n iilic d In V ictn u m u llh ou g h lo cal scurcli cn g in c s d c \c ỉo p I t i p u l l ) n c i i i - c l n p l i c i i l c d c l c c l i o i i i> > l i II a J i í ì l c n l l p i o h l c m H C h a r ik a r l-.acli doeument is rcprcscntcd h\ ÍI SCI ol lcu lu rcs and ih c ir c o i T c s p o n d in ii u c ig h ts \ h a s li lu iic lio ii c i illc d S im h iish is uscd lo crcatc thc paiic liim c i p rinl I iicli tcatnrc is proicclctl ínto an l-d in ic n s io iìiil spacc h \ randoniK choosm u b c n ln c s lio in J - l I ; l l i i s prn|ccii(H) is thc samt' Ibr a ll docum cnts Io r eaeli docum cni a l-d in icn sio n ;il vcclo r IS crcỉKcd h \ a ik lin u llic projection> o f iill ihc (ciilurcN II I it.s lc i i t u r c s c iỊi ic n c c r iic lii K i ! N c c io r lo i I h c c l o c im i c ii l creaicd h> scltin g c \ c r \ p o s iti\c cnir> III ihc \c c io r to I aiui c \c r \ im n -p i> sili \'V CIIUA lo IN thc 1'CMilt ol raiulnm p ro ịcclum lor cach d ocu iììcn l I( has ih c propcrt' thiil llic cosinc s n n iL iril) ol i\\ o docum ciil.s IS propurinmul u • llic n u m h c r o l b ils I II u l i i c l i t h c l\M > c t M T c s p o u d in ii p iiỊ ic c lio n s agrec Ih u s ih c s im ila r ily ot t\\o d oaim cniN is ihc num hcr o f b iis lliat thcir proịeclion.s iiíircc on \lt c r hushinii thc> used llu m m in u distancc It> com pulc ih c d isla n tc Limoni’ d ocum cnis then incrca.sc d isiu n cc to ch(H)sc lìiost su ila lìlc \iilu c k T h e a liỉo rith m \ \ i ll Jn \ c r \ \scll il u c rcm o \e sp c c itic lan iiuaucs T h e algt H Ìlhni dcpcnds on lcaiurc sclcclum and h ou lit tiilc u k ilc llic ii' Lorrcspondm i! \\LMuln ol cach laniduaidc (H h c ru isc u h c n lỉaUi !•> \ c r \ h iiiíic r ruDlim c i.s u Im i a b iii p r o h lc m ( lo c u n ii.n l> |\ II S \.N II I d co rp u N wc cannoi > m ip l\ n c c J M ) l u l io n L iim p a r c l o 'c m iK c l l i i s IS.SIIC in s p c c i l ì c N Ìt u a ii( ) ii> III I BACKGKOUM ) l\ Sim ilarity m etrics \\o r d W h ile II is u n c lc iir at \vhich point ,1 d o a im e n t is no lonuer a d u p iic a lc ol aiiother rcsca rch crs lu iv c cxam in cd s e v c l m e ln c s lo r d e k T m im ii ìỉ llic O L K APPROACH S1I11 il; n n > I v l u c c n p lii\ I\\< > docum enls I irsllv il II d o cu n icin liu n s ro iiiih l) thc sainc seniíintic co n lcnl coin p arcd \o itiD ilicr d ocu m cn i llic ii II IN ii sclicm e fo r \ itMnamcst* (lo c u m e iK s l i i l l c r c i i l r n lc s II I l c M INSÌHM lỉic \ ic t n u i1 u c iịịI u ỈK I' 1" iill lc r m s ( i ii iiỊ ili c ik d UI 1111111,11 li I ' I n 111iV.IcrH I>' I '.pcv.ii il IV Ho-.IIIM .i n d u m k A i tliv vveiglil oi a lealurc dcpcnds OH ils conlcxl For example the same token "k iế n " in the following two sentences has completely diffircnl mcaning.Y In llic scnlence kiến dang bỏ cành cày (tlie am is craw/ing on iree branchl" "kiên" (ant) is ihe subjccl and a noun hul in ilie otlier sentcnce lỏ i muôn kién nghị lèn hà chu lịch l i mím to petition my idea lo ihe chainvuman) t h e token "kiến" is a part of the \vord "kiên nghị (petition)" and "kiến nẹhị (peliliơn) " is a verb Whcn selecting fealures vve calculale \veighls usinu inverse documeni írcquency (id l) vveighi and llìc order o! tokens ỉn the docunienl Hcalure seleclion can be done lo k c n Ic n c I \ v o i d l c \ c l O I u - g i í i m bcc au s c th e \v o r d » m aybc O lh e r w i.s e b c c a u s c a p o s il io n o r d c r a n d \v o r d I‘(t) is posiiion l lcaturc I in th e lli c I II lị llo \\m « Ị s u m m a iẠ \a lu c s : il OI' l ìr s t p a r a t ỉr a p h o lh c r p a g p h s I documcnt \vhich la k o th e lit lc il IV la s t p a r it iia p h / I II th e I l í ’ / IS in I I / is in m u l l i p l e a r c a s llì c n |1| / ) o l p o s ilio n v u lu c s - M im c o u n i o lY ||t|l ilie num bcr ol ilie syllu b lcs in a lcature B N ear-duplicate dctection fram ework l c v c l In o i d e r u> d u l c a i u i c selection at word lcvcl unlike I nulish \vc nccd to \vord segmenlalion And llicn wc collccl IrcquencN ol \vords in ihc c n ip u s iiu d I'CH1I»\'C u o r d s \ \ i l h I dcnoio ilìc trcqucncx tlic lcasi iicqucnilv used teaiurc in thc corpus l o o h i g l i i»r l o o l ( ) \ \ s lo p m ayhc u o rd s ve n u n ih c r o í s \ l l a b l c or m any íir c liv iịt k M K A ru rc h I ach uscJ lia m e u o r k tc tc h c d ilo a im c n i IS d c s c rih c d IU c r u \\lc r Iis Ir o n i is I e p r e s e n ic il D o c l ỉ ) a n d íls Im g c r p r in l h\ I : IÌỈU I'C I liim c r p r m l I ỉ> lo t cach IS c r c a t c d d n c iiin c iu id c n lilic r \ V c u s c d t h c l l i m m m n ili s l i i n c c to measurc thc disiancc bchveen ii pair ol liniicrprints v\'c iricd s \ I l ii h lc s s o \c r \ Our d o c u m c n ls to d iv iilc th e d o a im c n t s im p o r la n l in t o n u i l l i p l c c l u s t c r s h a s a s i íiiK Ỉu r i l l ì n g c r p r i m cvcn i l l h c v d o c u n i c n i III t h c c l u s t c r \ M i c n ;i I1 CU i l n c u m c n t is p r o c c s s c d s a n ic iir c o f t h e s a m c g r a m m u t i c a l i v p e a n d h a v e th o ỉic q u c n c ) in t lic c o rp u s \V o d c lìn c ih c il v v c m lil 1) 1’ a th c \\h ic h I a c h c l u s ic r l.n i ig c i vv o ic ta a r c u s i u i l l ) m o r e i m p o r L m i l l i i i n s l i o i t c r u o r d s lỉa m n iim i J is u m c c o l i> ih c l i n g c r p r i n i o l l l i c li r s l iis t in u c r p r in t a i u l a c l u s t c r IS s i n a n C I OI' C tịiu il l o a c o n s ta tit k II i.s t i> s ig n c il l o che c l u ^ t c r lc a lu r c I a > l o l l o u v w ,r = L i • F t r ,í': lt 'lh c L L iI d is liin c c c I u s I lt F n il’ lh i> i.N h i g g c i n iclliiH Ỉ llia n k l l i c d o c u m c n l lò rm s a n c \ \ s lill hiis S u p p ti.s c h > r c \ i i i n p l c d o c u n i c n i I l a i i ì M i I I I i i il i s i i m c c o l vvherc: ih c \ h it.s d o t u m c n l h a v c J l l a m m m - Ị i l i s t a n c c 1>I h i i ( is ihe fĩe q u en e\ « l' lc a iu rc I in llic corpns iir c II I llic S iim c t lu s l c r D o u im c n i h a v e J l l i m m m ili s i; m c c o l ’ F m a \ dcnotcs thc Ircqu cniẠ o l'th e mosL tVccỊiicnlI\ u.NCil t c a lu r c in t h c c o r p u s lu llo \\m i: h ils prohlcin a n d li o c u m c n i l ỉ h i u - l í I i i i l ilo c u m c n i I lìc n ilo c t i M i c n i \ \ ( lỉ ( im l l Im » w c \c i c in l o k lilic % s t ln s p m h l c m \\c proposc iHi optinii/alu>n M*luiion I í/nLUincni /\ III c/ iln s it r t ỉi.s U in ic li iin d u n í) If ih c l/íin iin in ự hom ih t i/ o c n n n n l t o li h ii i/ ! c ì u s l e r s i l o c i i i n c n i IS sn ittili'1 ' i h i n i k Ị Tnpĩĩỉ - A documcni I) - A r g u m c n t k : i n a x m u m i d i.N tiin c c d o c u n i c n t s in c l u s ic r ! Olll/lllt ■Docum cm rm u c ip rin i - T h e c lu s ic r u h ic li llic d ocutncnl bclon n s to PrcpidL css - I c l c h I 1C \ \ d o c u m c n l I ) - S o rl ih c a r lic le s b \ lo cialc t m i c i n d c s c c n d i n í i o r d c r h c c u n s c a r u c lc > l l i a l h ỉ \ c c l o > c p u h l i > h c d d a i c be n e a r -d u p lie a le s T h is \\m ild r c d u c c i h c n u n i h c r 1)1 c a n d i d a l c s u h c n la s lc n t h e d u p l i c a l c i l o l c c l i o n priiCL-N.N P ro ce sse s Stcp I R c m o \c lu m I tau.N aikl ci'[i>cnìcn(N S lcp w ord scu m cn u itio n It.i l '•[.Itislics S lcp -V UcmoNc u o rd s h a \m h iiili |IX\|UCI11 OI Iix i MHiill I|\\|U C I 11 M cp c a lc u liilc \\c ii:h l> ol \\n u i» Stcp Sort \\o rd s 1,1 v.k'M.‘c n d in n ordcr ol llic ir \\c iiĩlil> S l c p (ì I s c N - u i -i i i n Ii> LTC UU ' l c i t u r c I|>1 I I ( 1)1 S tc p C r c a t c l ì n u c r p r i n l I l o i c a c í i I l m I u i c t, | l I i2 l\! Slcp s c rc iilc lin iic r p r in l I íor iliíc u iìic n i I) h) S i i i i Im nIi 11111v.'tu »m S lcp V C a lc u la le I liiim n iim d isltin cc ln>m ih c d n c iim cn l It' cach clu sicr Stcp II there is a s a lis lic d c lu slc r assiiỊi) llic d ocnm cnl 10 the clustcr Skp 11 I inisli \ll Lloumicnl- lv lu n t:in.L In liu J u-Ici s c a r c liiiiịỊ h i \ c IiiịịIic i p r íh a h ih i) ti>! J u p l i c j t i o n o i 11) ir t ic lc a n d lìc n c c Figure Our Framework c Weighied graph ouiput is lisi ot solution \vord segmenlulion OI a sequencc salistv: number ol \vords in a sequcnce is tho smallcsl Fingerprint com putation Each document has a teature set Eacli tèaturc has thcir corresponding vvcighl and vvc use Simhash lo gcncralc un Ibil lingerprint ai* íịllovvs We inil an l-dimensiunal vecior V each o f vvhole dimensional is initial /ero A teature is liashed into an l-bil hasli valuc For eacli oí hil in ỉ-bii: ii thc i-lh bit of liash value is I ihc i-ih componenl ol V IS incremenled hy ihc wcighl oi ihal lealure ilih e i-ili hu ol the hasli value is Ihe i-th eomponent o f V is dccrcmcntcd by the weiglu o llh a l ỉcalure When all 1‘eulures hiivc hccn piutcssed Sonic componenls ol V arc po.Mli\c \vhile u i Iktn are negative The signs o f componenis dcterminc the corrcsponding hil> ol the linal íingcrprim loi ihc ilociimciil Then, tỉie lingcrprinl is calculaled Hamming distance to eacli clusler A docunieni bclong a clusicr il and onl\ il distancc liom llie document lo a liall ol duaimenls in clustcr arc smaller k Step 2: l :sc Diịksira algorithni to lìnd all shortest paths in Ci I ;ich palh is pntcntiỉil solulion Ibr scunicntinc ihc scntcncc inlo uords Step A.ssign puri ol speccli: r.uch potcntial \vord scgnicniation solulion assign part ol speeeh (l'()S ) lo cacli \vord III ihe N d ilcn cc l or cuch P O S opiion \vc culculatc llie prohiihiliiy ol'the sciưence iis lolloHs h senlencc) = n P (T , ) * Ỉ1 P (T , Tl +, T , , ) P (T,): probabilil) appcanni> part ol specch T ol uord i-lh I’í I , I I J : pm biibililN part 1)1' spcccli I I I Slaml conimuoiish in corpus : C lio o s c p ro b iib iliụ solulittn S le p \v h icli havc the in a \iim im D VVord segm entation \ EX P ER IM EN TS AND EVALUATION Word segmetilation is alxvavs an imporiani prohlcm in Processing Viclnanicsc documuiLs IVccision 1)1 ihc vuncl s c g m c n u ilm n p lia s e p la y s UI1 in ip o r la iil r o le III v\ c llic perlbrmance o f the \vhole system \Yc usc an alỊiontlim lo I k iv c C iim a l oul c x p c n m c iU N c H ìc ic n c \ o l d u r Ira m c u o rk lo c v t liiiik - ll i c I \ p c r i m c i i t s a r c p c i u > r m c i l lo i' lind llic sh o ricsl patlì in a \v c iu h ic d lira p li Ibr ta c k lin ii lliis lu i) prohlcm as lòllovvs: \siih hiisit Uikcn SCI uhich ilocs mu V.OMUIII1 111\ tcaliircN a p p r o u c lic s u c n c tc d h\ IĨIM iip p n u ỉth V ic ltiiim c s c s c u m c n ia lio n and SC I l c iilu r c s u r e s i i n p K v v h it c spaccs l r c i |i i c n ú e s and in llic ir ih c líK tl.s U m g in u U scs N iic h t lu m k ii ii.s u o n l In ilic hitNiL lo k c n s u h ic l i d c lm iit c d t o r r c s p o n d iM i: c o n tiiim n i: i s HU I' a p p n u i c h M iu p ls p rc K o s in g p iir i-t il- s p c c c h lc a iu r c a p p iO tic h ỈI I hc \v c ii ! h t > Jocum cni I li^ h> d ic SCL'»»IUI v s i t l i l u l l I c a U i i c NCl I i i c l i i c l i n u o m u c i t il U i iiL L > c li c m c \\ c h iix c a l.M ) c a i T Í c d OUI o u r c x p c n m c n lN to liiu l llic hcsl N.ilucs loi thu to llim iu i: piiriimclcrN /' k ih e lc n iilh " I l i n i ic i p n n i n iaM im im (.lisiancc Jocumcm> JI'C pnir 1>Ỉ ncar d llp llC illC ll \ \ \a r ic d I o r C íit h F ig u re W ord s e g m e n ta tio n Slep /: Build \vcitihled liraph ( i lor cacli sen len cc ol lenglli n llie graph G lu is n-^ I v c ilc x so tliu l vcrtcx j eoiTespoiuls to llie svllable j in ilìc scnicnce k \\c k lio n i I lo r ; in d i> m l\ 10 a n d I iin c q u a l n u m l v r k \\c m c tric s U) c \ a lu a ic lu m d u p lic c iic d o c u n iL M U s i th e n I i i i i I II Ih e c li.n n n l > > lliih lc s lio m I t u I m a k c s ;i vvord in llic ni ven Vicinamc^c diciionar\ oih eru isc /■(/,/) I Isc l l i I t (1-1) *I) / I2 X 1)1 l i n c r p r i i i l > t h u t a r c a t J H a m m i n g ilis U in c c s m u l l c r t h a n u s c p r c c i s i t i n r c c ii ll a n d I - m c i i M i r c \ \ h c rc an ap p rn a c h I h c h m h c r llic 111 iip p r iK ic h c a n o h i i i i n ( III ,1 II I a m o n n í>4 a n d s a m p l c d J e q i iiil n u m h c r o l p íii r s c a n J c t c c i n c i ir - I- m o a N i ir c 't o r c t liiit l l i c I v U c i i h c i i p p iu j v il And 0.0 0% II _ ( R e t r m v c d d u c p a i r s ) n ị c o r r e c t ứ o c )iu i r s ) 50.0 0% c o r r e c t d u t p a ir s 0.0 0% F = 32 30.00% Experiment corpus A With the development o f man) Vietnamese Electronic Nevvspapers the readers are provided vvith ruimcmus sources ol' documenls These sourccs houcvcr not alvvays providc íresh Iiexvs Nc\vs in d iíĩcrcm \vehsiies is olién relerred to an original one 10000 articles liavc hccn collecled írom 25 most lamotisc Victnamcst* elccimnic nevvspapers using our Vietnamese searcli cneine Wc have proccssed ihcse ariiđes to create our corpus \vitli the follo\ving steps: • Classify articles \vhicli havc close puhlishcd datc into a group In each group SOI1 ihc articlcs in dccrcasing order of sizc We manually annolale thcsc ariicles in ihc sanie group and approximalcd si/e to mark articles near duplicated ['rom 10000 a rlic lo \\c lundomK ncIcci 100 1)1)11duplicate arliclcs and then permulc llicir paragruphs lo creíilc ncw articlcs Ihc iicnvIv crciitcd articlcs and ihcir origm arc marked as near duplicatcd • l ompute sliilislĩcs ol \\o rd s lokcns and III 0.0 0% F = 64 10.00% F = 128 00% s F ig u re F -m e a rs u re c h a rt u sin g the b a s ic íe a tu re s se t And ulien applx OIU liam euork \vc liuvc a helou result T a b le F -m e a su re s c o re s for d iffe re n t v a lu e s of F an d k u s in g o u r vveighting s c h e m e and V ietn am e se s p e c iíic íe a tu re s t 32 \? 24 64 '6.22 I2S 35 : v (yy ()"„ 62 X ")°I 5‘M > 45 2'ỉ U‘í 16-1 " "íĩ „ 83.56 4ÍI.M 3.V 22 ^ 07 78.32 4S Ị0 s "" corpus Tlicsc siíilislics w ill be uscd lo canculatc uciiiln s nl lc a lu r c s in o u r Ir a m c x v o r k T h e r c d i c a b o i i l 0 0 lo k c n Y , 2.3UO.UUU tvorilòuikl 197.200.000 2-giam 100 00% 80 00% Kc.su Iỉ li 60 00% l c a lu r c SCI \ v h i c h d o e s n o l c o n t a i n LII1\ l c a l u r e s l i c n c r a t c d h> 20 00% As can hc SCCI1 li'om I ahlc I 0.00% Vicinamcse Processing tools the highcsl F-mcasurc is 54.2°111 I i M l l i L iin h c 'C O I i l i i l I.s X V í» " i u h c n A ' lim h c r ili a n ih c |2 V 'I" ,) and /• 4 h c '1 í- m e a - u i c \ \ h c n o n l \ u s iu t! t h e b u M t Ũ M l u iv NCl I OI i i \ c d \ a lu c t>l I J iij k M K T e iiM > Ir o m I lo lo i II J iỉiila l Ù iia u -M o lir u d o c u m c n i.s In A C M S I G M O D A n n u a l C o n le K M K e c«p> d c u x l.o n I ’| t*cc cJ II ỊỊ.N >>1 títc S a n I r a iK is c o C A M d > 19 S h iv a k u m a r il ( ia r c ia -M o h n a s c \M -\ c o p v d c lc c U o n m c c h a m s m fo i d iii iu il Im c r n a lio n a l (.'o n lÌM v iK Ọ 111 I I k o i an d P c tic c l.ib raiIC S I >on a u t o m a lc d ol ỈI d o c u m e n ts P io c c e d m g s ol 2nd D ig it a l \ u s i i n 'Ic N a.s lu n c I ỉi lla n c ll d c i c c lu in im p lc m c n l u iiiin c o llu s io n d c l c c lo r I V Iiu c s C o n ỉc ic ik o A M a lc o h n 1)1' p c tic a l co p viM Li 111 P k iiíu in s m lu n c t h c o ic t ic il IxiM S lu th e h e tu e a i t lĩc l- c i r c i l ’r c \ c n i IOI1 ic x ts jn d ||> p L iụ u n s m J iid 1’i.ic iic c Iin l ( 1111 lu CDinpiii ison 1*1 jppr(ntchc> 1’i'OL í ưdiiiiỉ'! nf i/ic (>th I.tniiỉiitiỊỉt' l< i\ in it\ s jn i I v a n i j l i ‘ >n ( o í; /i’i v « u ' I k l ■( 2'»» x h c s v 2U \ i \ i i - < 'p in r k il A lg o r it h m s lo i N e a r N e iíĩh h o r P i o N o m c 'u k 11S) D in h í) ic n llo a n i; k i c m l*( ) S - L iụ u c i Im I iiỊili.sli • \ ’ ic in a n iO M ' H ilm iiU iil ( I« p ti> U V n k N h iip l i i n k l i n i ! a n d I MMi; 'a u llc l l c \ t s D ĨIl I I )| I \ c n M ic ln n c 11 insl.itiD M , i n j H c \o m i lo i th e h u tu rc K Ỉ V I 15] In í)ic tu > n a r> lo o k -u p vvith o n c c ir o r 12 I I I P h u o n u a n d 11 I \ ‘ m h \ M u x im u m r n l i «*p y A p p to d t h 1«» S c n ic n c c M o u iìd a rx D c t c c im n o i V ic in a m c s c Í t M s II I I In lc r n iiIio iK il C o n lc r c iu c OI1 UcNLMrch lnnt>\vUHHi in d V i M im V III N In «*!' u t‘ A liio r ith m s ( 11 Iv d I W ỌC.0 17 HI System p u u cs I 10 lan 44 iC N c m h la n c c R in t lc r A W ch In 116 ] A docum cius" l’m m V ie ln a m N u iio iu l l im c iM l} I líUUM No III T c \i ilo c u m c n is ( o n lc ie r K e Tlìis work has bccn lìnancialK supporlcd h\ llic Research granl "Plagiarism Detcction lịr Victiiiimc.sc s L a rtie * Shori Ị1 l l c m t / e N' S c a la h lc d o c u n ic n t liim c r p r in tin u l S I N I X W 'o rk-.sh o p (in I : lc c i i »H IIC C’i» m m c rc c l lí % W c ihunk I inh V a n M c d iu C o m p u iụ lor lc lú n g us II.SC ih cir dalahasc as \ve ll as lo r su p p o rlinu us lin n n a u llv ị}\ A ^ 'u k in llu a n iỉ X u c q i C h c n e I 1M ll S l- N IX C o n lc io u c c \';k C v II'ii c d ito is 12 \\c h Nhun Bai m c th o d |l| to r S c a lc P v a lu a t io n o t' A lg o r it h m s III P ro c c c d in ẽ s o t' th c ^ th approprialc valucs ol I and k so \vc can appl) ihc rcsull to Vietname.se searcli cngines Wc w ill continue improve om írameuork by using spcciíìc Victnamcsc languagc processinc mcthods combining \viih hasli íunclions V II «»l a n n u a l in ic r n a t io n a l A C M S Í G I R c o n lc rc n c e u n 'R e s e a rc h an d d e v e lo p m e n t in m t o r m a t io n r c tr ic v a l A C M Press 0 1’ro i' our c o n ia in m c iu i m p r o \ c J NidbilitN in P ro c e e d in g s o í th c I t h In te rn a tio n a l c o n lc r c n t c [ I Ị M c n z in g e r Charikar algorithm combincd \viih uũghtinu schcmc is elTcctivc and c llìcic n l lo dclcct nciir duplicale III Vietnamest* arliclcs I; \perimcntiil result prove thai, thc ol an d ( II(I4 | in (ìn WorŨ W idc Wcb A C M Press 2007 C O N C L USION l*-m easurc C ln » \ \d h u r y c t vil S lu d ic s o ! I - M a i c l i s ig n a tu ic s \ i a lc M c o n r a n d o m i/a t ió n A< »1 |9 ] D a ta h a s c s " highest O n ih c rc N c m h la n c c In S I - O S S e q u c n i.c s V I A | l l ) ' Dctecling VI P la e ia r is m IS e a s \ b u i u lso cas> to d e ie c i P la g ia rs C r n s s -D is c ip lin a n H a g ia r is m I a h n c a lio n a n d Id ls it ic a im n ol S o n B a o P h a m KSE2009 notiíication KSE2009 Sat j u| 2009 at 10:51 AM To Son Bao Pham < s b p h a m @ g m a il.c o m > We are pleased to iníorm you that y o u r sub m ission to K S E -2 0 has been accepted as a full paper for conference Please revise y o u r p a p e r to in c o rp o te re v ie v v e r c o m m e n ts fo r c a m e r a - r e a d y v e r s io n s u b m is s io n A d d re s s in g reviewer concerns in y o u r c a m e -re a d y p a per is o f pa rticuỉar signiíicance since the Program C om m ittee may revisit your papers to e n sure that th e se co n c e rn s have been adeqũately addressed Thepage lim it for full p a per is F u rth e r C M R subm ission and registration instructions will be sent later We look forward to see in g you in H anoi in O ctober Best regards Ngoe Thanh N guyen, The D uy Bui, E dw ard S zczerbicki Paper 72 Title: A Fast T e m p la te -b a s e d A p p ro a c h to A utom atically Identiíy Pnm ary Text C ontenl of a W ob Page .re vie w PAPER: 72 TITLE: A Fast T e m p ỉa te -b a s e d A p p ro a c h to A u tom atically Identity Pnm ary Text C ontent of a WeD Page OVERALL R ATIN G : (a cce p t) REVỈEWER'S C O N F ID E N C E (m edium ) REVIEWER'S C O N P ID E N C E (m edium ) ORIGINALITY: (M o d e te ly O ngm ai) SIGNIFICANCE: (V ery S igm íican t) PRESENTATION A N D R E A D A B IL IT Y : (Very Good) RELEVANCE FO R T H E C O N F E R E N C E : (V ery R elevant) TECHNICAL Q U A LIT Y : (S ee m s S oun d) ' RECOMMEND AS S H O R T P A P E R /P O S T E R : (yes) REVIEVV The authors in troduce d a fast a lgo rithm for de tecting mam context blocks in web pages a u to m a tic a lly This see m s to be a considerabỉe im provem ent of a prior vvork called C ontentE xtra ctor a lgo rithm The p a per IS readable in general The paper is m o stly o f e x p e rim e n ta l cha racte r and is not so cleai how the presenl algorilhm can w ork w ell It w o u ld be a lso mce if com pariso ns with different types of related algorithm s are m ade re vie w PAPER' 79 riTLE: A Fast T e m p la te -b a s e d A p p ro a c h to A utom atically Ident.tv P ' " - y Com em o ' a Veb P a g '; OVERALL R ATIN G : (stro n g acce pt) rEVIẼWER'S C O N F ID E N C E : (h ig h) rEVIẼVVER^S C O N F ID E N C E : (h ig h) ORIGINALITY: (M o d e te ly O rig in al) SIGNIFICANCE: (V ery S ig n iíic a n t) PRÉSENTATION A N D R E Ă D A B IL IT Y : (V ery G ood) RELEVANCE FO R T H E C O N F E R E N C E : (V ery R elevant) TECHNICAL Q U A LIT Y : (T e c h n ic a lly S oun d) RECOMMEND A S S H O R T P A P E R /P O S T E R : (no) R E V IE V V - .As a w eb p a g e c o n ta in s n o t o n ly in fo rm a tiv e c o n te n t but a ls o n o n -m fo rm a tiv e c o n te n t s u c h a s a d v e rtis e m e n t, navigation lin k s , e t c it is im p o rta n t fo r a s e a r c h e n g in e to e x lr a c t ju s t the in to rm a tiv e p art of the w e b p a g e s li vvants to search The au th o rs p ropose F a stC o n te n tE xtra cto r as an extension of C ontentE xtra ctor an earlier effective lechnique used to ex tra c t th e in ío rm a tiv e c o n te n t from a w ebsite The idea IS based on detecting the web block lemplate that is com m on a m ong the site's pages U sing this tem plate inío rm ative blocks can be extracted quickly The idea is somevvhat novel and the e x p e rim e n t w ith real w ebsites show s a signiíicant im provem ent over ContentExtractor T h e p a p e r is w e lỉ-p re s e n te d Onelhing not very c le a r is the a lgo rithm used to detect the tem plate The paper does not mdicate clearly w hether Ihis is done m anually o r by an a lgo rithm I think the authors mean an algonthm but m ore details vvould be appreciated revievv PAPER: 72 TITLE A Fast T e m p la te -b a s e d A p p ro a c h to AulurTialically Identiíy Pnm ary Text C ontent of a W eb Page OVERALL RATING: (stro n g acce pt) REVIEWER'S C O N F ID E N C E (h ig h) REVIEWER'S C O N F ID E N C E : (h ig h) ORIGINALITY: (V ery O rig m al) SIGNIFICANCE: (V ery s ìg n iíic a n t) PRESENTATION A N D R E A D A B IL IT Y : (E xcellent) RELEVANCE FO R TH E C O N F E R E N C E : (Very R elevant) TECHNICAL Q U A LIT Y : (S ee m s S oun d) _ RECOMMEND AS S H O R T P A P E R 'P O S T E R (no) R E V IE W The paper presents an a p p ro a c h to iden tify prim ary text content Irom web pages A very well w ritten pa per clear, co n cise , w ith good exam ples The authors should be v e ry plea sed w ith the outcom e of this paper The approa ch is w ell th o u g h t out and provides substantial beneíits a s s ta te d T h e e v a lu a tio n w h ic h IS th e n e x l m o s t important a s p e c t IS v e ry w e ll c o n d n c te d a n d e x p la in e d Beware of the very fe w g m m a tic a l errors (e g by a traversing path) and lypographical e rro rs (e g g u a n tie s ) A Fast T e m p la te -b a s e d A p p r o a c h to A u tom atically Identify Prim ary Text Contcnt o f a W eb Page D at Q u o c N g iụ e n D a i Ọ u o c N g m e n Son B a o P h a m T h e Du> B u i llu m a n M a c h in c In tc c lio n l.a b o to rx C o l l c g c o l l c c h n o lo g ) V ic ln a m N a tio n a l l n ivc rs itN Ila n o i Abstract-— S e a r c h e n g in e s h a v e b c c o m c a n i n d is p c n s a b l c tool \vc h piiucs th u n thc sam c u c h s iic I hc m a in disa iÌN aiilan c o l fo r b r « w s in K i n l o r m a l io n o n I h c I n t e r n e t I h e u s c r , h o u e v e r , is this a ltid riih n i is thai ii is q u itc sU n\ \\h c n thc n u m h c r o l ’ o ítcn a n n o y c d b y r c d u n t la n l r c s u l t s r r o m i r r c l e v a n t \>eb paf»cs O ne rca so n in fo rm a tiv c is h ccausc b lo c k s of scarch w eb c n g in c s p ag cs a ls o such lo o k as at non- a d v c r lis c m v n l, n atig H tio n l in k s , e íc I n llii.s p a p v r , H í p r o p o s c a f a s l a lị> o rith m ca llcd K a s t C o n t c n t K x lr a c t o r cu n lcn l b lo c k s in a ío iiu t o in a t ic a ll) ucb p aj»c l>> í le t c c t m a in in ip r o \ in g I lic ( onlenlKxlracior altỉorilhm B> auloniaticall) iU cntihinị’ and sto rin g ( e m p la le s r c p r c s c n t i n g I h c s t r u c t u r c « f c o n lc n t b lo e k s in a \v cb s ite , c o n t c n l b lo c k s of a n c \\ w cb pagc Iro m in p u t \\c h paucs c on tcn tl A ír a c t o r is laru c a lu o rith m M o rc o v c r docs noi hccausc p rc sc rx c llic h ic rc h ic a l o rd e r ot* o u u t h locks thc c x tra c tc d a x ìtc n l b lo c ks nia> not a ppca r in thc sum c o rd e r us thc o rig in a l OI1CS I h is in ig h l p rc x c n t thc sca rch c n iỉin c from sc a rch ilít! correclK íin CMICI phrasc \\hcn thc phrasc spans across I\N(* c o n sc c u lix c hlocks In llìis papcr u c propoNC I astC o m c n tl \tr u c to r - «1 íast th e H ch sitc c a n he e x l r a c t e d q u i c k l y I h c h i c r a r c h i c a l o r t le r o t 'lh c a lg o ritlm i o u u l h lo c k s is a ls o paucs h \ im p r o s in iỉ C o n lc n tl \tr iic to r Iiìstcad (»1 storini! all n iỉũ n t H Ìn c d n h ic li liiia r a n t c c s Í h íi l th e u* a u to m a lic a lh lic tc c l pat:c> »>l a \\c h s iic c o n lcn l hlncks in \ve u iU im a lic a lK ucb c x lr a c lc d c o n t c n l h l o c k s a r e in I h c s a m c o r d c r a s I h c o r ig in a l in p u l \\c h ones Ic in p liiic (o nIoiv in to r m a lio n o l c o n ic n l h lo c k \ iirul possihk- c rciilc a \M(>ni»l> ilc tc c la l hltK-ks lo r latcr rc lric v iil I ach hloc k in \ \ e b puiỉc can hc k lc n ũ llc d allhoiiLỉh in>l a lu i i \ s u n ic |iic l\ hs Keyyvords: (hua m itiin g , tem plate (ietection, Meb m in iiiỊỊ ;i tm x c r v il I pa lh in ÍI h ic n irc h ic iil trcc ol hlocks vshich rcprcsonl'- llic \\oh paưc \ lcmpl.ik- COMI.IÍIIS In IK(>1)1 ( 11\ iihM>lulc pulliN (»1 ciM iicnl N o u a d u N s s ca rclì c n tĩin c s h u \c h c c o m c an in ili> p ciisíih lc lool for h r o u s in g in lb m ia tio n UI1 ilic In ic rn c i \\h ilc thcrc h lo c ks Iik l "! iu m -o n iiv.n l hlitcks h ;i\iiiL ỉ ihc >anic p;iths iis lliiil o l COMICIII hlocks I ỉ \ '.iiiriiiL' ih c ahs»)lulc palliN ih c h ic rc h ic il »>l ih c iH iip m hlocks arc man\ usclul scarch entỉinc.s a\ailaM c thc uscrs arc still is n ia in la in c d u h ic h uuariin lc cs I lun thc c x tru c ic il c onlcnl annoscd b \ hlo c k> iirc in ihc \IID C o a lc r iiN ihc iMĨLỉiiuil OIICS re d u iK Ỉa n t ro s u lls IIOMI irrc lc x a n l uch pagcs One o l l h e rcasons is hc ca us c u c b paucs o llc n c o n ta in m»nintorm atỉxc h lo c k s s uch as a d x c rtis c m c n is lin k s etc \ lc m p la lc lo r \\c h N ilc i'' M o iv d c;ich n c \\ l \ paiỉC ÍN c t)[U p aiv J \\ilh llic tc m p k iic to \H c t ilic cravvlcv! \ \ ch lin tl Ks prim iiA scarclì cn g in c \ \ h ic h la c k s e lĩc c liv c e o n tc n t h lo c k d c ic c tio n c o n tc n i cnpacilv c o m p Ìmmi> in I ciNiC o n ic iitl A tra c to r i> m uch >m allci (lum o lte n sciirchcs in n o n - ĩn lo r n iiiíĩ \ c h lo c k s ;ind thcrclbrc p ro iiu ce s rc d u n tia n l rcsulls >uiToundcd b\ U L > an open lai: >uch a> S IW N I hc m im b e r ol c M ru c lc d hl»>cks and thm in ( o n tc n il \lr u c lo r u h ic h lìKikcs I a>tC 'ontcnll M cio i A hloc k in a \ \ e h puì>e is o lỉc n v.lclìiK \l IN a part o l w ch pagc hlocks I \B I I t a s lc r liia n C \ > n ic n t l A t r u c ln r IK 1)1 V ■ a n d ;i m a tc h in g closc I hc rcsl 1*1 ih e p a p cr I> o ru c in i/c d as u>lli)\\N \\ c M im m a ri/C rc la tc d m a tc riiils a ĩiil niclh(>Js in S cction II In tag Ị I I Ị D clcctin iì \\hich blocks arc primaiẠ tc\l conicni S e c lio n I I I \vc dcNcrihcd *111' app ro ac h Sonic c \p c rim c n lN blocks not o n lv arc p rcsenied in S c c tio ĩi l \ in d u c e s h im c e lììc ie n c N in storuge lo r a seạrch c n c in c h u l nlso im p r tn c s scarch c tìlc ic n c ) in o rd e r lo incrcasc users’ s a lis la c tio n M a n u a ll) m a rk in g co n icn i II blocks is lìo t li lo iis ih lc M ilu lio n lo i a scarcli c n g in c III ilii'' papcr uc c itn s ijc r llìc ta>k 11 1* a u lo N ia lic n ll) d c lc c tin y contcnl h lo c k s in a u e b p a ec v\ c h p u iic s O I ih c > iim c in o rJ c r to sho\\ thc p c r lo n iu n c c itl our tỊtpnKich S o c l n ic tlio tK Ki h j\c i namkks hccn p ri» p »> «.cd lo J c lc c i c o n tc u i b lo c k s o r n o n - c o n u m h lo i.k > in \ \ c h p ;iLics a u t t M iu iic ilK Nxcl^Mk' UM I.|II> ỈK I\ c s im ila i 'ì I c ! ,il |*M h ;i\ c p n 'p o \ o J n v c N lru c (u i\' \ \ h ic h IN c ilk x ! S ik ' slructurcs I u rlh c rm o rc IHMÌ-CIHIICMI M o c k > n tic ii s ilu a lc in S h lc lìxed posiiio n s 'i i l i / i n t i ilìo s c (ih s ciA iiio iiN COIUCIII M ocks S S I Ì n !.t n iK \ l fW*ni ih c D O M IIA.V «>l J i t l c r c n i u c h Ị\ I'J C ' in ;i u ch paiic CIIII hc iiiilo iiU ilic iill' ilcioctccl \I prcscnl liv C IS ^ ỈI ú»r c ic h u c h N Ĩk ’ h.iNCil lii> m tln '.11110 \N ch>itc V i ct iil o h > c r \ alit-HN a U i' p r c N c n ic J !.» ru iu l;i> !«>r se vo rai Iiìc lh o d s h u \o h c c ii p ro p n s c d u> la c k lc iliiN p ro h lc n i c ilc u L iliii^ th o iin p o r liin c c o l c u c h n o v k ’ UI s s I including CoiU ciìi l-\tracu>r h\ D chiuiih ct al 11 I I I 121 noi>c eliminatĩon mcthod b\ \ \ ct ;il r>l- ln lo l)is c o \ c rc r h> I in and Mo 113 j .Ainoniỉ llic m C\»nicnll M racinr ap|XMi> to hc lo the mosi c rrccli\c "akorithni to cxtruci primaiẠ coiilcm \ \ iJ c blocks >, i| , For a \\c h pauc C \> fìtcn lI \ i r a c lo r hlocks h \ c o m p iiriiiỉ* c iic h (li it> h lo c k > lìncK contcni >>ll M n c k s oi c lin iiiu ilv - n o i> \ i n l i i r i i K i I i i '11 ,1( ) J J ! \C ' u h iv il h c lp s p r in ia r v c o n ic iìi I h c p r o b lc m " I t h is ,ip p r« M cli t p p c ií' u l'.c n ỉlic m n n h c r in p u l \ \ c h piiL!C'» I ' la r ự c SttM Ín u m il lio iì n o is \ \w -h |; i v \ilh V I ! ih c ii I I1 V K I (->l"s.-k> >u V.h ' j J in tỈK- \ \ n i k j vi H iL ' h M U ' I, I L-! \ K « 'k / k M i p l UKÌ n fv i L i ũ i ' I I h ir > I i u l li' visuall) separating vveh pagcs into hlocks hascd on vcrtical and horizontal lines they calculatcd thc block í'rcqucnc> lịr each hlock Ifth e hlock ỉrcquenc) value o f a block is hiéh it is a template block vvhich is then laheled tbr buiĩding template model M chta and M adaan 110| prcsented an approach using regex-based tcm platc Ii\ segmenting ueb pages hased on the template they could detect important scclions V ieira et al Ị used tree mapping together \vith ihc R TD M -TD Algorithm and thc K ctrieve Tcm platc Aluorithm for dctccting thc tcmplate I in and I lo I n | introdiiccd a method to ideniiiy conlcnt hlocks h> pariitioning a ueh pugc into blocks hased on ih c < I A B U tag r.ntrop> \alucs ol the terms appearing in each blơck arc caĩculatcd and uscd for determining content hlocks C o n le n t lA lr a c lo r 11 11112 u p p c a rs lo be thc m o si elTective algorithm to id en liỉs primurv in lb rm a ii\c contcni blocks The input o f this algorithm is a sct o l u e b paucs thai are assumed to ve sim ila r structurc í-irst the uluoriihm partitions cach pagc inlo alom ic b lo ck v An ulomic hloek is a block llial does ni)l contain an\ block In the n c\l stcp u iili an atomic block B , the aliiorithm calculutcs thc nuinhcr ol \veb pagcs thai conltiin u block M inilar U) B II block B occurs munv tim cs o \e r d illc rc n i u c h paiics hlock B is considercd as a non-conlcnt hlock and ii is rcmcncd Othcruisc block B is considcrcd a priinaiA contcnl hlt>ck I igurc I shous a block u ith corrcspondinL! I* laii ol a wcb pagc I his block co nlains lòur alom ic suh-hlocks u iili corrcsponding tag (SCO the souICC codc in I iuurc 2) C o illcnll 'A ira c io r llìc n p a r iiiio n s ilic b lo c k iiiio í ì \ c blocL-

Gn S e p t 27, che us

Ngày đăng: 18/03/2021, 17:38

Tài liệu cùng người dùng

Tài liệu liên quan