A unified plagiarism detection frame work for Vietnamese documents

Trang 1

VIETNAM NATIONAL UNIVERSITY, HA NOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

A UNIFIED PLAGIARISM DETECTION FRAMEWORK FOR VIETNAMESE DOCUMENTS By NGUYEN XUAN TOI Supervised Dr PHAM BAO SON—

A thesis submitted in partial fulfillment for the degree of Master of Information Technology

In the

Faculty of Information Technology University of Engineering and Technology

HA NOI - 2010

Trang 2

VIETNAM NATIONAL UNIVERSITY, HA NOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

A UNIFIED PLAGIARISM DETECTION FRAMEWORK FOR VIETNAMESE DOCUMENTS By NGUYEN XUAN TOI Supervised Dr PHAM BAO SON—

A thesis submitted in partial fulfillment for the degree of Master of Information Technology

In the

Faculty of Information Technology University of Engineering and Technology

HA NOI - 2010

Trang 3

Table of Contents List of Figures List of Tables 1 Introduction 2 Literature Review 21 22 2.3 2.4

Concept of chunk in Vietnamese textL Strategy of chunk selection - -c{

Comparison methods es

Trang 4

4.1.4 Result 2.20.02 ee eee 29

4.2 Experiment with PAN corpus .0 2000- 39

4.2.1 Data collection < ci e ese egaweew ee wee mes ae 39

Â 2/2: HODSGIVE «ie nw cea aa mA ARSE Re êm 39 42.3 Implementation - 005 2.00005 39

4.24 Result 0 ee eee 40

4.3 Experiment with corpus of P Clough and M Stevenson - 43

Trang 5

Chapter 1

Introduction

Recently, with advances in technology, Internet and Digital libraries have pro- vided users easier on-line access to digitized news, articles, magazines, books and

other information of interest Word processors also become more sophisticated and faster In this environment, they may cut and paste, modify pre-existing doc-

uments from a lot of different sources and redistribute the information illegally much easier Regarding this matter, Tripolitis (2002) comments: "When you read the information you have collected, understand its meaning, express your own point of view on the subject based on this information and clearly reference all

your resources, then you will not be in danger of being accused for plagiarism" Table 1.1 shows an example of plagiarized documents The netnews in left column is published in dantri website at 2:25 PM, 25/08/2009 and the netnews in right column is published in VOA website at 5:15 PM, 25/08/2009 We find that the netnews in VOA website is cut and pasted from the netnews in dantri website

Both of them contain nearly the same content and they are only different in some words In Table 1.2, two netnews have same content, too This is other an example of plagiarism The netnews in laodong website is cut and pasted from hanoimoi website In fact, two authors can not write two netnews with a lot of similar words about one event The problem is even worse if there are some paragraphs that are

Trang 6

From: http://dantri.com.vn/c76/s82- 346059 /nha-dau-tu-nuoc-ngoai- thich-linh-vuc-an-uong.htm; Ba, 25/08/2009 - 2:25 PM Thứ From: http://vovnews.vn/Home/8- thang-thu-hut-duoc-hon-104-ty-USD- von-FDI/20098/120134.vov ; 5:12 PM, 25/08/2009 (Dantri) - Trong 8 tháng đâu

nam 2009, céc nha dau tu nude

ngoài đã đăng ký đều tư ào

Viét Nam 10,453 ty USD, giải

ngân uốn FDI đạt 6,5 tỷ USD,

bằng 91,5% so với cùng kỳ năm 2008

và đạt hơn 72% kế hoạch năm Trong

con số này, vốn từ nước ngoài khoảng

5,5 tỷ USD Về cấp giấy chứng

nhận đầu tư, trong 8 tháng cả nude cé 504 dự án rnới được

cấp giấu chứng nhận đầu tư uới tổng uốn đăng ký 5,625 tỷ USD,

bằng 10,8% so với cùng kỳ Tuy uốn đăng ký cấp mới giảm nhưng lượng uốn tăng thêm của các dự án đã đầu tư giai đoạn

trước lại tăng hơn so uới cùng kỳ năm 2008 Cụ thể có 149 dự án đăng ky tăng uốn đầu tư uới tổng vén tang thêm là 4,828 ty USD, tăng 3,8% so cùng ky

(VOV) - Sáng 25/8, Cục trưởng Cục Đầu tư nước ngoài Phan Hữu Thắng cho biết, trong 8 tháng qua, tính cả uốn cấp mới oà tăng thêm, các nhà đầu tư nước ngoài đã đăng ký đầu tu vao Việt Nam 10,453 ty USD, giải ngân

ốn FDI đạt 6,5 tỷ USD Theo báo cáo

của Cục Đầu tư nước ngoài, nể cấp giấu

chứng nhận đầu tu, trong 8 thang qua cad nude cé 504 dy án được cấp

mdi gidy chứng nhận đầu tư uới tổng uốn đăng ký 5,625 tỷ USD, bằng 10,8% so uới cùng kỳ Tuụ uốn đăng kú cắp mới giảm nhưng lượng uốn tăng thêm của các dự án đã đầu tư giai đoạn trước lại tăng hơn so tới

cùng kỳ năm 2008 Trong 8 tháng

đầu năm, có 149 dự dn dang ky tang uốn đầu tư uới tổng uốn tăng thêm là

4,828 tủ USD, tăng 3,8% so cùng kỳ

"Điều này thể hiện niềm tin của các nhà đầu tư vào khả năng phục hồi và tiềm năng phát triển của nền kinh tế Việt Nam", Cục Đầu tư nước ngoài nhận định

TABLE 1.1: Example of plagiarized netnews

So what is plagiarism? "Plagiarism, broadly defined, is the use of words or ideas

of another without giving proper credit" [Guiliano, 2000] "There is general agree- ment that a word-for-word copy of an entire document is plagiarism" {Noynaert, 2009] "When the work of someone else is reproduced without acknowledging the source, this is known as plagiarism" [Clough et al., 2002] "Unacknowledged copy- ing of documents or program"(Joy and Luck, 1999] Pinto [2009] added that pla-

giarism can also happen when one document or source is translated into another language and then reused

Trang 7

Chapter 1 Introduction 6

to protect original documents

Today, there exists some techniques to address this issue We can divide them into two categories: copy prevention and copy detection [Shivakumar and Molina,

1995, Brin et al., 1995, Si et al., 1997] Copy prevention schemes may be physical isolation or software isolation of documents For example, we place documents on

a private CD system Users only can look for and view the documents, but can not add or delete any data We can use some software to create documents which users can not copy, past or print For example we can use Acrobat software to create PDF documents which users only can view them but can not copy and paste, can

not insert and delete However, with new technology, it is not difficult to break this protection so that users can copy and distribute the original documents In addition, users can view and rewrite in their words, they do not need to copy and paste In some cases, we can not apply this approach because we want to provide

users with access to the documents by internet The other approach can be referred to as signatured based schemes A "signature" is added to the document, and this

signature can be used to show if documents are original or not For example, one popular approach which we usually use in word documents is watermarks When we create a new document, we insert watermarks into this document [Brassil et al., 1995] However, signature schemes have some weaknesses Firstly, we can easily remove the "signature", which cause documents to become untraceable Second, we can not find documents which are only partially copied Third, users can view and rewrite this text in their documents

A challenging question is how to give users access to a lot of digital libraries

and different sources and to protect our original documents at the same time To address these difficulties, there is a better approach in which we will build a plagiarism detection system Nowadays, some researchers have divided plagiarism

detection into two categories The first class is external plagiarism detection The

second class is intrinsic plagiarism detection

For external plagiarism detection, original documents are registered and stored in a repository Subsequent documents are compared against the pre-registered

Trang 8

Chapter 1 Introduction it

For intrinsic plagiarism detection, we can identify whether a document is

plagiarized or new by detecting writing style breaches [Eissen and Stein, 2006, Feiguina and Hirst, 2007, Stein et al., 2008, Potthast et al., 2009] Normally, each person has his own writing style, for example: average sentence length, average

paragraph length, using sentence, richness of vocabulary, etc So we can identify plagiarized portions of text by different between style of them and other style of

document

Plagiarism detection is a very important matter It is considered by a lot of people

not only in education but also in many other fields, for example music, software, and so on So there are a lot of methods invented to address this problem in differ-

ent fields Especially, in education, a series of tools have been created to identify whether or not a student copies all or parts of an assignment from another student However, it is very difficult to decide which is the best algorithm or the

best tool Each of them has its own advantages and disadvantages This approach

maybe is effect in this domain but maybe it is not effective in another domain All methods usually measure text similarity Some important methods are presented

in the later sections of this thesis

So how do we find the most effective method in a new domain automatically? In this thesis, we propose a unified plagiarism detection framework for Vietnamese

documents This framework can identify automatically which approach is the most

effective in a new domain and it can check if a document is copied or not This

framework is an external plagiarism detection system Besides, the framework can

identify which parameters are effective with each approach (e.g chunks are 1-gram, 2-gram, or 3-gram, and so on) It is because word segmentation in Vietnamese documents and English documents is different So in this thesis, we want to test if word segmentation in Vietnamese documents is important in detecting plagiarism

for Vietnamese documents Three important methods are included into the frame-

work They are Overlap and Cosine and GST method

This thesis consists of five chapters In Chapter 2 we review the related works We

describe and discuss three methods chosen to compare similar documents which are included into our framework Particularly we discuss some strategies to choose

Trang 9

describe some existing tools and why we do not use them In Chapter 3 we introduce the architecture and describe the function of modules in the framework

Chapter 4 presents the process of collecting input data for the system Steps of

Trang 10

Chapter 1 Introduction 9 http://www.laodong.com.vn/Home/J http://www.hanoimoi.com.vn/Xet-xu- 'Vu-New-Century-Nhieu-bi-cao- so-tham-vu-vu-truong-New-Century thua-nhan-pham-toi-do-ham- /3122406.epi; Cập nhật: 06:38 AM choi/20098/152629.laodong; Cập | 25/08/2009 nhật: 8:29 AM, 25/08/2009

(LD) - Bi cáo Nguyễn Dai

Duong, trú tại phường Đội

Cắn- quận Ba Đình, TP.Hà Nội, chủ ưu trường Neu Cen-

turụ - bị truụ tô uề tội "Kinh

doanh trái phép" theo điểm c, khoản 2, Điều 159 Bộ luật

Hình sự 7 khách tới chơi tại uũ trường là: Trần Thi

Thanh, Lê Anh Tuấn, Trương

Thị Thu Hiền, Đào Phương

Trí, Nguyễn Tuấn Trưng, Lê

Quốc Vượng, Lê Thị Kim Anh

bị truy tô uê tội "mua bứn trái phép chất ma túy" hoặc "tàng trữ trái phép chất ma túu" Vụ án được phát hiện vào lúc 1h sáng 28.4.2007, lực lượng công an da ap vào vii triéng New Century, tam

giữ hơn 1.000 đối tượng, thu được số lượng lớn rna tuú tổng

hợp Kết quả test nhanh cho thâu, hơn 200 người có phản ứng dương tính uới ma tuy Ngay khi bước vào phiên xét xử,

Nguyễn Đại Dương đã đề nghị toà cần phải triệu tập anh Lú Hồng Linh - nhân uiên phục uụ ban tang 1 của uũ trường Neu Century ra để được dối chất,

tì theo bị cáo Dương thà anh

nàu đã có những lời khai bắt

loi cho cdc bi céo do bt mém cung Tuy nhién, dai dién VKSND

cho rằng không cần thiết phải có mặt anh Linh tại phiên xét xử, vì anh này đã có đầy đủ lời khai trước CQDT

(HNM) - Theo đại diện VKS§ giữ quyền công tố tại phiên tòa, bý cáo Nguyễn Đại Dương, trú tại phường Đội Cần, quận Ba Đình, TP Hà Nội bị truy tố ột "Kinh doanh trái phép” theo điểm c khoản 2, Điều 159 - Bộ luật

Hình sự 7 "uy khách" tới chơi tai vit

trường là Trần Thi Thanh, Lé Anh Tuấn, Trương Thị Thu Hiền, Đào Phương Trí, Nguyễn Tuấn Trưng, Lê

Quốc Vượng, Lê Thị Kim Anh bị

truụ tố uê tội "Mua bán trái phép chất rna túy" hoặc "Tùng trữ trúi phép chất rna túy" Trước đó, khoảng

1h sáng 28-4-2007, vũ trường New Century,

một điểm chơi đêm đình đám nhất Hà Nội

bị lực lượng CA bất ngờ đột kích, tgrn

giữ hơn 1.000 đối tượng, thu được số lượng lớn rna túy tổng hợp Kết quả test nhanh cho thấu, hơn 200 người có phản ứng dương tính uới ma túy Hôm qua 24-8, ngay từ đầu phiên tòa, Nguyễn Đại Dương đã có yêu cầu triệu tập thêm anh Lú Hồng Linh, nhân tiên phục uụ bàn tổng 1 của vi trudng New Century ra lam nhaén

chứng Theo bị cáo Duong, anh nay

đã bị mớm cưng Xét thấy tại cơ quan điều tra đã có lời khai của Lý Hồng Linh nên VKS cho rằng không cần thiết phải có

mặt anh Linh tại phiên xét xử Phần xét hỏi

nhóm khách có hành vi mua bán, tàng trữ ma túy trong đêm vũ trường New Century bị đột kích đã mở màn cho phần xét hỏi 7 bị cáo đều biện minh cho tội trạng bằng

lý do tuổi trẻ, ham chơi Riêng Nguyễn

Đại Dương cho rằng mình vô tội (?) Phiên tòa sẽ kết thúc vào ngày 28 tới

Trang 11

Chapter 2

Literature Review

In this chapter, we are going to discuss the existing related literature and back-

ground research which have been established in the world Concept of chunk in Vietnamese text and some strategies of selecting chunk are also presented in this

chapter Some popular approaches which measure the similarity between two doc-

uments are introduced and analyzed They are (1) Overlap method which bases on set - theoretic, (2) Cosine method which bases on vector - space, (3) GST - Greedy String Tiling which bases on substring matching These methods are used success-

fully in a lot of fields, for example: plagiarism and copy detection, information

retrieval, tracking similarity between files Finally, we will present and estimate

some existing tools which are related to our framework

2.1 Concept of chunk in Vietnamese text

In the Overlap and Cosine method(introduced late), each document is split into chunks before it is compared with other documents A chunk is one or some successive syllables General, a chunk is some successive syllables In English, a word is a syllable but in Vietnamese, a word may be one or some successive syllables

So a chunk may be a syllable, two successive syllables, three successive syllables, etc or a word or a sentence in Vietnamese document For example, in this follow

Trang 12

sentence: "Xử lý ngôn ngữ tự nhiên là một lĩnh vực rất khó" ¡f a chunk is a word in Vietnamese then "xử lý", "ngôn ngữ tự nhiên", "là", "một", "lĩnh vực", "rất",

"khó" are chunks of above sentence We ñnd that chunks represent the content of the document and we can determine whether a document is plagiarized or not by comparing chunks

2.2 Strategy of chunk selection

To determine how documents are split into chunks, first we consider the way of choosing the chunks In designing a chunk process, there are three factors that need to be taken into consideration The first is the size of the chunk, which is the number of syllables in a chunk The second is the number of chunks which

are used to build a set of chunks of one document The third is the choice of the algorithm used to select syllable from document There are several strategies of selecting chunks [Schleimer, Wilkerson, and Aiken, 2003, Heintze, 1996] but we

only discuss three popular strategies as follows:

e (A) One chunk equals n successive syllables, overlapping n-1 syllables In this strategy, every substring of size n in the document is selected The k** chunk

and (k + 1)" chunk overlap n-1 syllables For example, we have a document

ABCDEFGH where letters represent syllables or sentence and so on If we select n = 3 then the chunks of this document are: ABC, BCD, CDE, DEF, EFG and FGH This strategy produces the largest number of chunks and it is expensive for document storing However, it could be expected to be the best

strategy for effectiveness because it uses every substring of the document (B) One chunk equals n successive syllables no overlapping syllables This

strategy is similar to (A) strategy but does not select overlapping substring

that mean the k** chunk and (k + 1)"* chunk do not overlap For example,

Trang 13

Chapter 2 Literature Review 12

chunks are created by selection n successive syllables beginning k“ syllable in sentence of the document

2.3 Comparison methods

There are several methods which have been used for plagiarism and copy detection In this chapter, we discuss and describe three important methods which have been used successfully in some fields by other researches Three methods are Overlap,

Cosine and Greedy String Tiling (GST) Each method has its own advantages and

disadvantages and this method may be in effect in one domain but may not work in the other

2.3.1 Overlap method

The first way to measuring similarity which we want to present is Overlap Al- though this method is quite simple but it is a useful method This measure is used popularly in IR system [Salton, 1992] When the user gives a query to the system, the system will search in its database to respond documents which relate with the query Similarly, we need to establish a metric that measures the overlap between an incoming document and a pre-registered document

Let A, B refer to generic documents (registered or new) Supposing, A is split

into m chunks We denote them to be : fAI(0A1,Sal) , ÉA/2(UA2,8A2), „ t4m(WA,m; SAm); Where:

Trang 14

Similarly, with document B that is split into n chunks We denote them to be taa(wea,$p.1), te.2(We2,$B2), -tan(WBn, SB.n)-

Supposing there are k similar chunks in both of document A and B We denote them to be to; (wo, Seay Spi), to2(Wo2, §'42: Spa), sc ton(Wo,ks Sans Spa)

Where: with i changes from 1 to k:

to, : representing i similar chunk in both of document A and B

° đu : representing the occurrence frequency of wo, in document A ° Soy : representing the occurrence frequency of wo, in document B

Let 5S; = min (sais Sp) Vi from 1 to k Let Overlap(A, B) denotes the Overlap

value between document A with document B, as computed below:

* Overlap(A, B) = oes

i=1 OBA

Similarly, Let Overlap(B, A) denotes the Overlap value between document B with

document A, as computed below:

k Overlap(B, A) = oe

Ai

i=l

We find that Overlap measure between document A with B is quotient of sum similar chunks (in both document A and B) and sum of chunks of document A

Similarly, Overlap measure between document B with A is quotient of sum similar

chunks (in both document A and B) and sum of chunks of document B After Overlap(A, B) and Overlap(B, A) are computed then we denote S(A, B) is Overlap measure between document A and B that is computed below:

Trang 15

This measure has a range from 0 to 1 It indicates the proportion of shared chunks in smaller documents Clearly, this method is simple and it is not normalized with the sizes of two documents In some cases, Overlap measure is effective For example, considering two sets, one containing 20 elements and the other 200 elements, if the intersection is 10 elements, this would account for 50% similarity

But if we use Cosine coefficient then the similarity measure may be very low 2.3.2 Cosine similarity measure

Another popular similarity measure is Cosine measure Let us denote the set of

chunks for document A as a vector with length n, V4 = [waa,Wa2,Wa3, - Wan),

and their associated occurrence frequency as another vector, Wa = [84,1, $4.2; $4.3

› 8A], where w4,; and s4,; represent the string in chunk i“ and it’s occurrence fre-

quency in document A respectively Similarly, we denote the set of chunks and their associated occurrence frequency in document B as vector Vg = [wg,1, WB,2, WB,3, - - -

Wa,m| and Wy = [s2,1, 58,2, $B,3, -$B,m] respectively The similarity between V4 and Vg could be determined in two steps:

Step 1: Since Va and Vg might contain different number of elements so that they must be normalized to the same length A reference vector, R, is first generated as a

union of the elements in both V4 and Vg , i.e., = (VAUVp) = [t0R1,10Ra, - - - t0R,k] with k <= m+n Let us denote the corresponding normalized vector of V4 as

X4 = (Tan, 0a2,0ag, -Cax] where x4, is defined as follows: 0 if wri ¢ Va Sai if wai = way € Va

Similarly, the normalized vector of Vg is Xg , can be computed Xz = [tp, 782,783

roc Opals

Trang 16

Chapter 2 Literature Review lỗ

the similarity between V4 and Vg is S(V4, Vg) could be simply computed follows:

S(Va, Ve) =

Essentially, S(V4, Vg) is cosine value of angle which is created between two chunk vectors in k dimension space If the angle is small then value of S(V4, Vg) will be large and similar measure of two that documents are higher

This measure is simple, it proves robustness across collections Because it is nor-

malized so it is easy to obtain the enabling of its use in document classification

across queries Besides, it has performed well with respect to other approaches There are some problems with a vector-space model including: terms are assumed to be independent, no theoretical basis for the assumption of a term space, lack of

justification for choice of term weights (Griswold, 1993] 2.3.3 Greedy String tiling

Third method is discussed and described in this chapter is Greedy String Tiling

(GST) [Michael J Wise, 1993] This algorithm aims at the detection of longest

possible common strings between two documents(called the text and pattern) We can measure the similarity of this two document based on these longest common strings For example, supposing we have two strings the following:

T string:

"Thanh tra Sở Y tế tiếp tục kiểm tra tại cơ sở Hồng Hoa Trường Lạc cho thấy, sản phẩm trân châu của công ty có công bố chất lượng Tuy nhiên, cơ sở sản zuất

chưa được cấp chứng nhận đủ điều kiện ATVSTP, công nhân trực tiếp sản xuất

chưa tuân thủ đầy đủ quy định về ATVSTP Thanh tra Sở Y tế Hà Nội đã tạm đình chỉ hoạt động sản xuất cơ sé nay."

P string:

"Thanh tra Sé Y tế tiếp tục kiểm tra tại cơ sở sản xuất này, phát hiện chưa được

Trang 17

động"

The longest possible common strings between two documents are: "Thanh tra Sở Y tế tiếp tục kiểm tra tại cơ sở" ; "sản xuất" ; "chưa được cấp chứng nhận đủ điều kiện" ; "đã tạm đình chỉ hoạt động"

Before presenting detail on GST, some terms are defined

Mazimal-match is number of syllable of a longest possible common substring

at start position p of pattern with start position t of text For example, at start

position 1 of above P string and at start position 1 of above T string then Maximal match is 12 (number of syllable of "Thanh tra Sở Y tế tiếp tục kiểm tra tại cơ sở" substring)

Tiles are substrings which are longest common substrings between text and pat-

tern For example, tiles of above T string and P string are "Thanh tra Sd Y té tiếp tục kiểm tra tại cơ sở" ; "sản xuất" ; "chưa được cấp chứng nhận đủ điều

kiện" ; "đã tạm đành chỉ hoạt động "

Minimum-match-length is a interger number which maximal - matches are

ignored It mean that length of each tile always is greater than or equal minimum- match-length For example, if minimum-match-length is 3 then all of two syllable tiles are ignored

A pseudo-code of the algorithm is described by Prechelt et at [2000] is given in Table 2.1:

According to above pseudo code, we find than the algorithm is separated two

phases The first phase starts from line 7 to line 22(called scanpattern in [Wise,

1993]) This phase searches for all of mazimal-matches Firstly, it sets marmatch variable equal to the Minimum-match-length value(line 5) In this phase, only un-

marked tokens are processed If a match is found between token A, of A string and

token B, of B string then this match is extended as far as possible This match processing finish when mismatch or marked token (line 12) The length of this match, j, is compared to maxmatch If they are equal then this match is added

Trang 18

Greedy-String-Tiling (String A, String B) tiles = { }; do maxmatch = Minimum-match-length; matches = { }

Forall unmarked tokens A, in A Forall unmarked tokens B, in B J=0; while (A,,; == B,,; and Unmarked(Aq;;) and Unmarked(B,,;)) Jah maxmatch) matches = matches @ match(a,b,j); else if (j > maxmatch) matches = {match(a,b,j)} ; maxmatch = j; }

Forall match(a,b,maxmatch) € matches —

J HOC QUO: HA NO! For j =0to (maxmatch - 1) IN THU VIEN

mark_token(Aq+;); _ Ae (76 |

mark _token(By4;);

tiles = tiles U match(a,b,maxmatch); } while (maxmatch > Minimum-match-length);

return tiles ;

TABLE 2.1: The Greedy String Tiling algorithm maximal matches between string A and string B

The second phase starts from line 23 to 31(called markarrays in [Wise, 1993})

This phase stores tiles (line 30) which obtained previous phase and marks tokens

from these tiles(line 25-29) When a token is marked then it can not be used again The algorithm finish when there are not other matches longer than or equal to the

Trang 19

If two above algorithm (Overlap and Cosine) use fix-length chunk then this algo-

rithm use variable-length chunk We hope that using this kink of the chunk may be better than fix-length chunk in some domain

When we have tiles set by using GST then the similarity, Sim(A,B), between two documents can quantify by two formula as follows: coverage(tiles) = » len(match) matchétiles 2 * couerage(tiles) Sim(A, B) = Tas Bl where:

e A, B: two string input

e tiles: set of the tiles between string A and string B e len(match): length of match in set tiles e |All, || Bl|: length of string A and string B respective

2.4 Some Plagiarism Detection Systems

In the world, there are some systems which can detect similar documents For example, CHECK; COPS; SCAM; YAP3, etc Recently, there is a finding plagiarism

competition We introduce and discuss about this competition in next paragraph

CHECK system [Si, Leong, and Lau, 1997] only works with Latex documents These documents are hierarchical structure documents and they are easy to find

section, subsection, etc by specially formatted keywords of them The hierarchical

Trang 20

Each document could be viewed at multiple abstraction levels which include the

document itself, its sections, subsections, and paragraphs, with paragraphs repre-

senting the lowest level of abstraction and resembles leaf node of the document tree This system uses Cosine measure and compare by each levels of document If at the root level two are similar then they continue to compare at the lower levels

of the document tree etc

COPS system[Brin et al., 1995] works with text documents In COPS, registered

documents are broken up into sentences or sequences of sentences, and are stored in a database When we have a new document, it is broken up in the same way and this new document will be compared against the registered documents in the database

SCAM system (Shivakumar and Molina, 1995] is based on the word occurrence frequencies of documents They computer a frequent vector of words which occurs

in the new document Then they compare this vector against registered vectors in

the database In the experiments, they used 1233 netnews article to experiment

They presented results comparing SCAM system against COPS system In case there are two different sentences of the same semantics and only some different words, plagiarism detection systems using sentence chunks usually don’t produce good results In this case, result shown by SCAM is better because it uses word chunks However, SCAM has more false positives than other systems which use sentence chunks False positives are pairs of documents that are reported to be similar documents , but in the fact that they are not

The famous tool which uses RKR-GST(Running - Karp - Rabin - Greedy-String- Tiling) algorithm is YAP3 [Wise, 1996] This tool is created by Michael J Wise YAP3 is third version of YAP (Yet Another Plague) tool YAP is a system for detecting plagiarism in computer program The third version of YAP has been de- veloped for use with English language YAP3 works in two phases The first phase generates token sequence from the source text The second phase uses RKR-GST algorithm to compare Each token string is compared with all other token strings This tool is useful to find plagiarism in computer program

Trang 21

PAN’09 workshop! This competition divides plagiarism detection tasks into two kinds external and intrinsic plagiarism detection as we demented above The cor-

pus of this competition includes 20.611 suspicious documents and 20.612 source

documents Candidates of this competition have given a lot of approaches to solve plagiarism detection However, most of approaches are given to solve plagiarism

in the corpus of competition In each suspicious document in the corpus, there are some sentences or some paragraphs which are copied and pasted from one or some

source document Candidates have to find which sentence or paragraph is copied and from what source the documents are

There is a popular tool which is named WCopyfind? If the input of this tool is

set of documents then the output of the tool are number shared match(one match is one or some word) between all of pair of documents User can view this share match However, the tool only finds and shows shared matches between pairwise

of document and does not show if they are plagiarism or not

The systems are introduced above to work only with English document or com-

puter program In each system, they often use only one kind of chunk (sentence chunk or word chunk) and one comparing method In this thesis, we are to de- sign a unified plagiarism detection framework for Vietnamese document In our

framework, we will use some kinds of chunk and some comparing methods In each different domain, the framework will show what methods and parameters are the

most effective After choosing the most effective method and the most parameters

for this domain, users can compare a new document with all other documents in the database to detect plagiarism

In this chapter, we discussed about the existing related literature and background research Three popular methods which are included in our framework are presented in this chapter Some well-known tools are introduced in here, too We have explained why we do not use the tools and we propose our framework In next chapter, we will present this framework in detail

Trang 22

Chapter 3

System architecture

We have built a system which is called UPDFVD (A Unified Plagiarism Detection Framework for Vietnamese Documents) to test our ideas The high level architec-

Trang 23

Chapter 3 The architecture of the system 22

The inputs of the system are set of trained documents and list of pairs of similarity documents The trained documents are subset of a new domain In this set of trained documents, we will store all of pair of similarity documents in a List of

similarity documents Outputs of the system are a method or some methods with

their parameters which are the most effective methods in this domain

3.1 Parsing module

The current version of the system only works with Unicode plain text Three steps

of this module are presented as follow:

e Step 1: In this step, the module gets texts from all of files in the set of trained

documents The texts of each file are considered as a unique string So each

document in the set of trained documents is expressed as a unique string, too

Step 2: This step is pre-process step Before the strings is split into chunks

or input of GST method, all of punctuations and commas and semi-colons

and so on are removed from the strings All of characters of the strings are converted to lower characters

Step 3: In this step, All of chunks of each document and set of tiles of each document pair are generated As mention at above chapters, a chunk may be a word, some successive syllables or sentence and so on

After parsing, if the comparing method is Overlap or Cosine method then

this module will respond list of chunks with their occurrence frequency of each document in the set of trained documents For example, document D is input of the Parsing module then output of this module is list of chunks formatted tp(w;, s;), where:

Trang 24

— $¡ : representing occurrence frequency ?* chunk in document D

If the comparing method is GST method then this module will respond set of tiles of each pair of documents For example, document P and document T are input documents of the Parsing module then output of the module is

a set of tiles formatted t,(p,t,len) , where:

— t¿ : representing i* tile

— p: representing position of i tile in document P — t: representing position of i** tile in document T — len : representing length of it* tile

Now, we will discuss about length of a chunk and minimum — match — length value As we introduced above, a chunk may be a word or some successive syllables or a sentence or paragraph and so on The bigger chunk has lower the probability of

matching unrelated documents For instance, consider two paragraphs that share 7 out of 8 identical sentences, with paragraph chunking, no match will be detected,

while with sentence chunking, it will be detected as matching However, smaller the chunking unit has higher the probability of matching unrelated documents

For example, when chunking unit is a word then two documents maybe share a lot of word but they are unrelated documents In this case, we will say that the system has false positives It means system shows that two documents are

similarity documents but in fact they are unrelated documents

Similarly, value of minimum — match —length(parameter in GST method) is very

important If the value is too great then set of tiles is null set but if this value is

too low then set of tiles is large although pair of documents may be not similarity

Trang 25

3.2 Comparing module

This module compares and computes the degree similarity between all of document

pair in the trained document set This module use set of chunks or set of tiles which are generated by the above module to compute these degree similarity Process of

this module divides into two steps as follow:

e Step 1: After all of documents are pared into list of chunks (or set of tiles) then each document are compared with remain documents to compute the similarity measures between them For example, in the trained documents

set, we have N documents are D;, D2, ,Dy then document D, is compared with N — 1 remain documents D2, D3, ,Dn and document Dy is compared with N — 2 documents D3, D4, ,Dy and so on It means that

each comparing method and with each special parameter, there are (N*(N- 1))/2 comparing pairs

Step 2:After the similarity measures of all of document pairs are computed and combine with the List of similarity document pairs (input data) then F-measure values (discussed in next paragraph) will be computed at this module

To evaluate the effectiveness of each experimental comparing method, we have used recall and precision metrics as follows:

Recall metric: the recall metric measures the ability to retrieve a piece of information from candidate information It is defined as the percentage of documents

identified as plagiarized with respect to the actual total number of plagiarized

documents

Precision metric: the precision metric represents the ability to retrieve a piece

of information correctly Here, it is defined as the percentage of plagiarized docu-

Trang 26

Chapter 3 The architecture of the system 25 that is actual total number of related documents F-measure value is computed as follow: POR Precision = > POR Recall = ecai R We used F-measure be defined as follows: 2 * (Precision * Recall)

F- measure = (Precision + Recall)

With similar measures which are introduced above, there is a question how to choose threshold Threshold is value that similar measures of document pairs are lager this threshold value then we will say that two documents are similar documents Choosing threshold is very important because it determines effect of the system With each kind of data or different domain, threshold value may be different Thus, each method with its parameter we can find the best F-measure value by changing the threshold value from 0 to 1 with each step by 0.01 For example, with threshold value is 0.5 there are n document pairs which their similarity measure values are lager or equal 0.5 In fact there are m similarity doc-

ument pairs (mvalue is sum of similarity document pairs in the List of similarity

documents) Supposing, n and m document pairs share k document pairs then F-measure value are computed as lollows: Precision = : n Recall = £ m We used F-measure be defined as follows: 2 (Precision * Recall) F- =

TEES (Precision + Recall)

Trang 27

Relation between F-measure and 1

FicurE 3.2: Resulting showing module

3.3 Resulting showing module

Output of Comparing module are F-measure values and threshold values of each method with different parameter This module shows the values by graph To

visualizing the result we use ZedGraph! for creating 2D line and bar graphs of arbitrary datasets Zedgraph is a very good open source C sharp graph plotting

library and distributed under the GNU lesser general public license Figure 3.2 is an example of this module This graph shows results of three method with their parameters According this graph, users can choose the most effective

method and the most efficient parameters for this domain easily All of the F- measure values and threshold values which are computed from above module are

showed by graph so users can choose the most effective method which has F-

Trang 28

Chapter 4

Experimental results

In this chapter, our framework is tested and evaluated We present and discuss

some experiments to study the effectiveness of the framework To illustrate to

our framework can identify which method is the most effective method in a new

domain automatically, we use three different corpora in our experiment Different data set have different definition of what plagiarism is and it is implicit encoded in

the corpus The first corpus is Vietnamese documents We also try our framework

on English corpus to prove our conjecture that different domain may need differ-

ent methods and corresponding parameters The second corpus is subset of PAN

corpus ! The third corpus we use in our experiment is corpus of Paul Clough” We will describe about the corpora in next paragraph In each experiment, collection

of data and result of each comparative method are displayed and discussed, the purpose of these experiments is also introduced As above mention, we use three comparing methods (Cosine, Overlap and GST method) in our framework In all of our experiments, we use all of the methods with some different parameters

‘http://www.uni-weimar.de/cms-medien/webis/research/corpora/pan-pc-09.html *http://ir.shef.ac.uk/cloughie/resources/corpus-final09.zip

Trang 29

Chapter 4 Experimental results 28

4.1 Experiment with Vietnamese corpus

4.1.1 Data collection

In this experiment, we use over 800 netnews as the testing document set These net-

news are published in some popular Vietnamese websites which are nezpress.nef,

dantri.com.vn, laodong.com.un, tienphong.vn, tuoitre.un, hanoimoi.com.vn, etc and published on 14 consecutive days During this period, a large number of net-

news of one website are copied from or have overlaps with those in other websites

Therefore, the chosen document set is good to test the system

The documents set are classified into five groups which are economics, sports, le-

gal, medicine and mixed netnews With each netnews group, we would like to test which method is the most effective method

4.1.2 Objective

All of documents in this corpus are netnews and collected from some websites at

some consecutive days where some document pairs share a lot of similar words but

they are different documents As mentioned above, the documents are classified into five netnews groups Except for mixed netnews group, in each remaining

netnews group, a document pair may share a lot of words although they are not relative documents The documents in mixed netnews group belong to a lot of

different fields, for example: education, culture, etc so that pair of documents in this group may share very few similar words

With own characters of each netnews group in this corpus, we aim to show that the corpus is a good device to test the capability identify which method is the most effective method in a new domain automatically of our framework Our other purpose is to test the efficiency of the order of syllables and word segmentation

Trang 30

same event or problem The creation of the list of plagiarized documents is done manually

4.1.3 Implementation

In this experiment, with Overlap and Cosine method, we used four kinds of chunks

They are: one syllables chunk (1-gram), two successive syllables chunk (2-gram), three successive syllables chunk (3-gram), and word chunk We used vnTokenizer

tool? for segmentation The vnTokenizer tool is distributed under the GNU Gen- eral Public License

With each kind of chunk, we divide into two cases: using frequency and not using frequency of chunk in document With GST method, MML values are 1, 2 and 3 In each netnews group, with Overlap and Cosine method, number of comparison

times for each document pair are 16 times:

((number of methods - Overlap and Cosine) * (kinds of chunks) * (frequency or not frequency) = 2*4*2 = 16)

Each comparison time, we compare each document with all of the remain documents in this netnews group to get similarity measures

4.1.4 Result

Table 4.1 shows the most effective methods and the best parameters of them in

each group In this table, we find that in law netnews group, the most effective

method is GST method with MML = 2 and in the remain netnews groups the most effective method is Cosine method with 2-gram From table 4.1, the average F-measuse value is 88,76% and the average precision and recall value are 90.2% and 87.4% We find that the best chunk is 2-gram or MML=2 for all of three

methods in all of five netnews groups In next paragraph, we present and discuss the result of all of methods in each netnews group

Trang 31

Order | Group Method | Chunk F P R Thres

1 economics | Cosine 2-gram | 83,8% | 84,1% | 83,5% | 0.24

2 Law GST MML=2 | 97,0% | 97,6% | 96,5% | 0.26

3 Sport Cosine 2-gram | 81,7% | 86,8% | 77,1% | 0.20

4 Medicine | Cosine 2-gram | 91,7% | 93,3% | 90,1% | 0.20 5 mixed Cosine 2-gram | 89,6% | 89,3% | 89,9% | 0.22

Table 4.1: The most effective method in each group + The detail result of all of the methods in economics netnews group In table 4.2, we present result of comparing methods with different parameters in the economics netnews group From this table, we find that the best F-measure

value is 83% if we use Cosine method with 2-gram chunk and using frequency and

the threshold value is 0.24 The worst F-measure value is 47,7% if we use GST method with MML = 1 In this netnews group, Cosine method is more effective method than Overlap and GST method In top three results there are two results

of Cosine method with different parameters and remain result is result of GST

method with MML = 2 In this table, the best result of Overlap method is fourth position With both Cosine method and Overlap method, the 2-gram chunk is the best chunk but with GST method then F-measure value is the largest value when

MML value is three All of three methods are not effective with 1-gram chunk or

MML value is one Figure 4.1 present F-measure values and the best parameters

of each method In this netnews group, the most effective parameters of Cosine

method and Overlap are 2-gram chunk and using frequency but in GST method MML value is 3 Cosine method gives the highest F-measure value when we chose the threshold is 0.24 and with Overlap method this threshold is 0.14 The best threshold of GST method is very low and this value is 0.09

+ The detail result of all of the methods in law netnews group

Trang 32

Order | Method | Chunk | Freq? F P R Thres Cosine 2-gram 1 83,8% | 84,1% | 83,5% | 0.24 _ 2 GST MML=3 + 80,3% | 74,5% | 87,0% | 0.09 3 Cosine 2-gram 0 80,2% | 84,3% | 76,5% | 0.11 4 Cosine 1-gram 1 80,0% | 76,4% | 84,0% | 0.56 5 Overlap | 2-gram 1 78,1% | 70,4% | 87,7% | 0.14 6 Overlap | 3-gram 1 76,8% | 76,0% | 77,6% | 0.05 7 Cosine 3-gram 0 76,4% | 80,6% | 72,6% | 0.04 8 GST MML=2 1 76,0% | 83,0% | 70,2% | 0.26 9 Cosine word 1 75,3% | 75,8% | 74,7% | 0.52 10 Cosine 3-gram 1 74,9% | 75,0% | 74,7% | 0.06 11 Overlap 3-gram 74,8% | 76,8% | 73,0% | 0.05 12 Overlap | 2-gram 74,2% | 68,6% | 80,8% | 0.13 13 66,8% | 62,3% | 71,9% | 0.49 64,4% | 74,6% | 56,6% | 0.44 16 Overlap word 64,3% | 60,5% | 68,7% | 0.67 17 Overlap 1-gram 1 63,6% | 58,7% | 69,4% | 0.52 18 Overlap | 1-gram 0 48,4% | 43,1% | 55,2% | 0.54 19 GST MML=1 1 47,7% | 53,5% | 43,1% | 0.58 14 Overlap word 15 Cosine 1-gram 0 0 Cosine word 0 72,7% | 67,3% | 79,2% | 0.64 1 0 0

Table 4.2: Results of comparing methods in economics netnews group F-measure is 97,2% if we use GST method with MML value is two The worst F- measure is 69,6% if we use Overlap method with word-chunk The best F-measure values of Overlap method and Cosine method are approximate values (94,2% and 94,9%) In top three results, there are two result of GST method with different

parameters Clearly, in this netnews group, the GST method is more effective than

Cosine method and Overlap method Figure 4.2 presents the best parameters of

Trang 33

FicureE 4.1: The best parameters of each method in economics netnews group

Trang 34

Order | Method | Chunk | Freq? F P R Thres 1 GST MML=2 1 97,2% | 97,6% | 96,8% | 0.26 2 GST MML=3 1 95,7% | 95,9% | 95,6% | 0.12 3 Cosine 2-gram 1 94,9% | 96,6% | 93,2% | 0.20 4 Overlap | 2-gram 1 94,2% | 95,2% | 93,2% | 0.15 5 Cosine 1-gram 1 93,0% | 94,3% | 91,8% | 0.50 6 Cosine 3-gram 1 92,5% | 92,9% | 92,1% | 0.07 7 Cosine 2-gram 0 91,5% | 90,8% | 92,4% | 0.10 8 Overlap | 2-gram 0 90,5% | 93,1% | 87,9% | 0.14 9 Overlap | 3-gram 1 90,3% | 93,1% | 87,6% | 0.06 10 Cosine 1-gram 0 89,6% | 93,6% | 85,9% | 0.43 11 Cosine word 0 89,2% | 92,7% | 85,9% | 0.35 12 Overlap 3-gram 0 87,3% | 91,9% | 83,2% | 0.06 13 Cosine 3-gram 0 85,6% | 90,5% | 81,2% | 0.05 14 Cosine word ‡ 85,5% | 89,19% | 82,1% | 0.47 15 GST MML=1 1 83,8% | 81,6% | 86,2% | 0.57 16 Overlap 1-gram 1 76,4% | 73,9% | 79,1% | 0.51 17 Overlap word 1 71,3% | 89,0% | 59,4% | 0.6 18 Overlap 1-gram 0 70,1% | 67% |73,5% | 0.54 19 Overlap word 0 69,6% | 78,3% | 62,6% | 0.48

Table 4.3: Results comparing methods in law netnews group

Overlap method are the most effective with using 2-gram chunk and using fre-

quency The GST method has the highest F-measure value when MLL value is

two We find that all of the best F-measure values of three methods are very high

Trang 35

—— Cosin 2 Tue —— GSI

FIGURE 4.3: The best parameters of each method in sport netnews group

netnews group and the best F-measure values of Cosine method is higher than the best F-measure values of Overlap method In this netnews group, we find that both the Cosine method and Overlap are not effective with word chunk

+ The detail result of all of the methods in sport netnews group

In table 4.4, we present result of three comparing methods with different parame-

ters in sport netnews group In sports netnews group, the best F-measure is 81,7%

if we use Cosine method with 2-gram chunk and using frequency and the threshold is 0,20 The worst F-measure value is 40,3% if we use Overlap method with 1-gram chunk and no frequency The table 4.4 shows than the Cosine method is more effective than GST method and Overlap method In top three result, there are two results of Cosine method In this netnews group, the Overlap is not effective method The best F-measure value of Overlap method is only 71,4% but this value is 81,7% with Cosine method and 79,4% with GST method Figure 4.3 presents

Trang 36

Order | Method | Chunk | Freq? F P R Thres 1 Cosine 2-gram 1 81,7% | 86,8% | 77,1% | 0.20 2 Cosine 1-gram 1 80,7% | 83,3% | 78,2% | 0.58 2 GST MML=2 1 79,4% | 71,4% | 89,4% | 0.23 4 Cosine 2-gram 0 77,4% | 74,6% | 80,4% | 0.10 5 |GST MML=3| 1 | 77,0% | 68,8% | 87,7% | 0.09 6 Cosine 3-gram 1 75,7% | 73,3% | 78,2% | 0.05 7 Cosine word 0 74,3% | 74.3% | 74.3% | 0.38 8 Cosine 1-gram 0 72,5% | 70,7% | 74.3% | 0.43 9 Overlap | 2-gram 1 71,4% | 73,1% | 69,8% | 0.15 10 Cosine word 1 70,1% | 68,8% | 71,5% | 0.56 11 | Cosine | 3-gram 69,5% | 77,4% | 63,1% | 0.04 0 12 Overlap | 3-gram 0 67,9% | 65,1% | 70,9% | 0.05 0 13 Overlap 2-gram 67,3% | 74,3% | 61,5% | 0.15 14 GST MML=1 1 66,9% | 67,8% | 65,9% | 0.60 15 Overlap | 3-gram 1 66,7% | 66, 3% | 67,0% | 0.05 16 Overlap 1-gram 1 46,6% | 56,3 |39,7%| 0.58 17 Overlap word 1 45,3% | 63,6% | 35,2% | 0.56 18 Overlap word 0 43,9% | 48,3% | 40,2% | 0.51 19 Overlap 1-gram 0 40,3% | 37,5% | 43,6% | 0.56

Table 4.4: Results comparing methods in sport netnews group group but the best F-measure value of Overlap method is very lower than the

values of Cosine method and GST method

Trang 37

Overlap method does not work in this group

+ The detail result of all of the methods in medicine netnews group

In table 4.5, we present result of comparative methods with some different param-

eters in medicine netnews group Figure 4.4 presents the relationship between F- Order | Method | Chunk | Freq? F P R Thres 1 Cosine 2-gram 1 91,7% | 93,3% | 90,1% | 0.20 2 GST MML=2 1 83,6% | 89,5% | 78,4% | 0.20 3 GST MML=3 1 81,8% | 84,3% | 79,4% | 0.07 4 Cosine 3-gram 1 79,2% | 74,2% | 85,0% | 0.05 5 Overlap | 2-gram 1 78,7% | 77,8% | 79,7% | 0.14 6 Cosine 1-gram 1 77,2% | 73,0% | 82,0% | 0.52 7 |Cosine | 2gram | 0 | 76,7% | 71,6% | 82,7% | 0.10 8 Overlap 3-gram 1 75,8% | 75,9% | 75,6% | 0.05 9 Cosine 3-gram 0 75,1% | 72,6% | 77,7% | 0.04 10 Overlap 3-gram 0 74,5% | 71,7% | 77,5% | 0.05 11 Overlap | 2-gram 0 73,8% | 72,9% | 74,7% | 0.13 12 Cosine word 1 72,2% | 75,2% | 69,4% | 0.53 13 GST MML=1 1 68,7% | 63,4% | 74.9% | 0.36 14 Overlap word 1 68,5% | 66,3% | 70,9% | 0.42 15 Cosine 1-gram 0 68,3% | 66,9% | 69,8% | 0.42 16 Overlap 1-gram 1 65,3% | 60,7% | 70,7% | 0.47 17 Cosine word 0 64,4% | 57,2% | 73,7% | 0.34 18 Overlap word 0 59,2% | 57,6% | 60,8% | 0.42 19 Overlap 1-gram 0 58,3% | 60,1% | 56,5% | 0.51

Trang 38

Relation between F-measu! —— Cosin 2 The

Ficure 4.4: The best parameters of each method in medicine netnews group

value of Overlap method Both Cosine method and Overlap method are the most effective when we use 2-gram chunk and using frequency GST method is the most effective with MML value is two Table 4.5 shows the detail result of three methods with their different parameters The highest F-measure value is 91,7% and

the lowest F-measure value is 58,3% In this medicine netnews group, the most

effective method is Cosine method with 2-gram chunk and using frequency and threshold value is 0.20 The GST method is more effective than Overlap method The F-measure values of Overlap method show than this method does not agree with this netnews group The best F-measure value of Overlap method is 78,8%

with 2-gram chunk and using frequency

In this netnews group, although in top three results there are two results of GST method (with MML = 2 and MML =3) and remain value is the result of Cosine method but the Cosine method is the most effectivemethod in this netnews group

Trang 39

different field, for example: sports field, law field, education field, culture field In table 4.6, we present result of comparing methods in mixed netnews group In this Order | Method | Chunk | Freq? F P R Thres 1 Cosine 2-gram 1 89,6% | 89,3% | 89,9% | 0.22 2 | Cosine | 1-gram 1 | 86,4% | 90,3% | 82,8% | 0.56 3 GST MML=2 1 85,7% | 90,2% | 81,6% | 0.18 4 GST MML=3 1 82,6% | 79,4% | 86,0% | 0.10 Cosine | 2gram | 0 | 82,4% | 82,7% | 82,2%| 0.11 on 6 GST MML=1 1 81,4% | 83,3% | 79,6% | 0.38 7 Overlap | 2-gram 1 81,3% | 77,6% | 85,4% | 0.15 8 Cosine 3-gram 1 79,7% | 80,5% | 79,0% | 0.07 9 Overlap 2-gram 0 77,4% | 83,7% | 72,0% | 0.15 10 Cosine 3-gram 0 76,7% | 73,0% | 80,9% | 0.04 11 Cosine word 1 76,7% | 84,6% | 70,1% | 0.55 12 Cosine 1-gram 0 76,4% | 79,9% | 73,2% | 0.46 13 Overlap | 3-gram 0 74,8% | 80,3% | 70,1% | 0.06 14 | Overlap | 3-gram 1 74,8% | 70,0% | 80,3% | 0.05 15 Cosine word 0 74,5% | 71,7% | 77,5% | 0.50 16 Overlap word 1 69,2% | 76,7% | 63,1% | 0.48 avg Overlap word 0 61,1% | 63,9% | 58,6% | 0.47 18 Overlap 1-gram 1 55,8% | 53,2% | 58,6% | 0.55 19 | Overlap | 1-gram | 0 | 45,6% | 61,3% | 36,3% | 0.62

Table 4.6: Results comparing methods in mixed netnews group mixed netnews group, as result in some above groups, the most effective method is Cosine method with 2-gram chunk and using frequency and the threshold value is 0.22 In the top three results, there are two results of Cosine method and one

remain result of GST method The worst result is Overlap method with 1-gram

and no using frequency Table 4.6 shows that the results of Overlap method in

Trang 40

FiGuRE 4.5: The best parameters of each method in mixed netnews group

are not bad In three cases of GST method, the results of them are not different much From table 4.6, using word chunk is not effective in this group and Cosine

method is the most effective method because this method has top two results in

this table and these values are quite high

Figure 4.5 presents relationship between F-measure values and the thresholds of

each method in the best case of mixed netnews group Both Cosine method and

Định dạng
Số trang	57
Dung lượng	12,69 MB