VIETNAM NATIONAL UNIVERSITY, HA NOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
A UNIFIED PLAGIARISM DETECTION FRAMEWORK FOR VIETNAMESE DOCUMENTS By NGUYEN XUAN TOI Supervised Dr PHAM BAO SON—
A thesis submitted in partial fulfillment for the degree of Master of Information Technology
In the
Faculty of Information Technology University of Engineering and Technology
HA NOI - 2010
Trang 2
VIETNAM NATIONAL UNIVERSITY, HA NOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
A UNIFIED PLAGIARISM DETECTION FRAMEWORK FOR VIETNAMESE DOCUMENTS By NGUYEN XUAN TOI Supervised Dr PHAM BAO SON—
A thesis submitted in partial fulfillment for the degree of Master of Information Technology
In the
Faculty of Information Technology University of Engineering and Technology
HA NOI - 2010
Trang 3Table of Contents List of Figures List of Tables 1 Introduction 2 Literature Review 21 22 2.3 2.4
Concept of chunk in Vietnamese textL Strategy of chunk selection - -c{
Comparison methods es
Trang 4
4.1.4 Result 2.20.02 ee eee 29
4.2 Experiment with PAN corpus .0 2000- 39
4.2.1 Data collection < ci e ese egaweew ee wee mes ae 39
 2/2: HODSGIVE «ie nw cea aa mA ARSE Re êm 39 42.3 Implementation - 005 2.00005 39
4.24 Result 0 ee eee 40
4.3 Experiment with corpus of P Clough and M Stevenson - 43
Trang 5Chapter 1
Introduction
Recently, with advances in technology, Internet and Digital libraries have pro- vided users easier on-line access to digitized news, articles, magazines, books and
other information of interest Word processors also become more sophisticated and faster In this environment, they may cut and paste, modify pre-existing doc-
uments from a lot of different sources and redistribute the information illegally much easier Regarding this matter, Tripolitis (2002) comments: "When you read the information you have collected, understand its meaning, express your own point of view on the subject based on this information and clearly reference all
your resources, then you will not be in danger of being accused for plagiarism" Table 1.1 shows an example of plagiarized documents The netnews in left column is published in dantri website at 2:25 PM, 25/08/2009 and the netnews in right column is published in VOA website at 5:15 PM, 25/08/2009 We find that the netnews in VOA website is cut and pasted from the netnews in dantri website
Both of them contain nearly the same content and they are only different in some words In Table 1.2, two netnews have same content, too This is other an example of plagiarism The netnews in laodong website is cut and pasted from hanoimoi website In fact, two authors can not write two netnews with a lot of similar words about one event The problem is even worse if there are some paragraphs that are
Trang 6From: http://dantri.com.vn/c76/s82- 346059 /nha-dau-tu-nuoc-ngoai- thich-linh-vuc-an-uong.htm; Ba, 25/08/2009 - 2:25 PM Thứ From: http://vovnews.vn/Home/8- thang-thu-hut-duoc-hon-104-ty-USD- von-FDI/20098/120134.vov ; 5:12 PM, 25/08/2009 (Dantri) - Trong 8 tháng đâu
nam 2009, céc nha dau tu nude
ngoài đã đăng ký đều tư ào
Viét Nam 10,453 ty USD, giải
ngân uốn FDI đạt 6,5 tỷ USD,
bằng 91,5% so với cùng kỳ năm 2008
và đạt hơn 72% kế hoạch năm Trong
con số này, vốn từ nước ngoài khoảng
5,5 tỷ USD Về cấp giấy chứng
nhận đầu tư, trong 8 tháng cả nude cé 504 dự án rnới được
cấp giấu chứng nhận đầu tư uới tổng uốn đăng ký 5,625 tỷ USD,
bằng 10,8% so với cùng kỳ Tuy uốn đăng ký cấp mới giảm nhưng lượng uốn tăng thêm của các dự án đã đầu tư giai đoạn
trước lại tăng hơn so uới cùng kỳ năm 2008 Cụ thể có 149 dự án đăng ky tăng uốn đầu tư uới tổng vén tang thêm là 4,828 ty USD, tăng 3,8% so cùng ky
(VOV) - Sáng 25/8, Cục trưởng Cục Đầu tư nước ngoài Phan Hữu Thắng cho biết, trong 8 tháng qua, tính cả uốn cấp mới oà tăng thêm, các nhà đầu tư nước ngoài đã đăng ký đầu tu vao Việt Nam 10,453 ty USD, giải ngân
ốn FDI đạt 6,5 tỷ USD Theo báo cáo
của Cục Đầu tư nước ngoài, nể cấp giấu
chứng nhận đầu tu, trong 8 thang qua cad nude cé 504 dy án được cấp
mdi gidy chứng nhận đầu tư uới tổng uốn đăng ký 5,625 tỷ USD, bằng 10,8% so uới cùng kỳ Tuụ uốn đăng kú cắp mới giảm nhưng lượng uốn tăng thêm của các dự án đã đầu tư giai đoạn trước lại tăng hơn so tới
cùng kỳ năm 2008 Trong 8 tháng
đầu năm, có 149 dự dn dang ky tang uốn đầu tư uới tổng uốn tăng thêm là
4,828 tủ USD, tăng 3,8% so cùng kỳ
"Điều này thể hiện niềm tin của các nhà đầu tư vào khả năng phục hồi và tiềm năng phát triển của nền kinh tế Việt Nam", Cục Đầu tư nước ngoài nhận định
TABLE 1.1: Example of plagiarized netnews
So what is plagiarism? "Plagiarism, broadly defined, is the use of words or ideas
of another without giving proper credit" [Guiliano, 2000] "There is general agree- ment that a word-for-word copy of an entire document is plagiarism" {Noynaert, 2009] "When the work of someone else is reproduced without acknowledging the source, this is known as plagiarism" [Clough et al., 2002] "Unacknowledged copy- ing of documents or program"(Joy and Luck, 1999] Pinto [2009] added that pla-
giarism can also happen when one document or source is translated into another language and then reused
Trang 7Chapter 1 Introduction 6
to protect original documents
Today, there exists some techniques to address this issue We can divide them into two categories: copy prevention and copy detection [Shivakumar and Molina,
1995, Brin et al., 1995, Si et al., 1997] Copy prevention schemes may be physical isolation or software isolation of documents For example, we place documents on
a private CD system Users only can look for and view the documents, but can not add or delete any data We can use some software to create documents which users can not copy, past or print For example we can use Acrobat software to create PDF documents which users only can view them but can not copy and paste, can
not insert and delete However, with new technology, it is not difficult to break this protection so that users can copy and distribute the original documents In addition, users can view and rewrite in their words, they do not need to copy and paste In some cases, we can not apply this approach because we want to provide
users with access to the documents by internet The other approach can be referred to as signatured based schemes A "signature" is added to the document, and this
signature can be used to show if documents are original or not For example, one popular approach which we usually use in word documents is watermarks When we create a new document, we insert watermarks into this document [Brassil et al., 1995] However, signature schemes have some weaknesses Firstly, we can easily remove the "signature", which cause documents to become untraceable Second, we can not find documents which are only partially copied Third, users can view and rewrite this text in their documents
A challenging question is how to give users access to a lot of digital libraries
and different sources and to protect our original documents at the same time To address these difficulties, there is a better approach in which we will build a plagiarism detection system Nowadays, some researchers have divided plagiarism
detection into two categories The first class is external plagiarism detection The
second class is intrinsic plagiarism detection
For external plagiarism detection, original documents are registered and stored in a repository Subsequent documents are compared against the pre-registered
Trang 8Chapter 1 Introduction it
For intrinsic plagiarism detection, we can identify whether a document is
plagiarized or new by detecting writing style breaches [Eissen and Stein, 2006, Feiguina and Hirst, 2007, Stein et al., 2008, Potthast et al., 2009] Normally, each person has his own writing style, for example: average sentence length, average
paragraph length, using sentence, richness of vocabulary, etc So we can identify plagiarized portions of text by different between style of them and other style of
document
Plagiarism detection is a very important matter It is considered by a lot of people
not only in education but also in many other fields, for example music, software, and so on So there are a lot of methods invented to address this problem in differ-
ent fields Especially, in education, a series of tools have been created to identify whether or not a student copies all or parts of an assignment from another stu- dent However, it is very difficult to decide which is the best algorithm or the
best tool Each of them has its own advantages and disadvantages This approach
maybe is effect in this domain but maybe it is not effective in another domain All methods usually measure text similarity Some important methods are presented
in the later sections of this thesis
So how do we find the most effective method in a new domain automatically? In this thesis, we propose a unified plagiarism detection framework for Vietnamese
documents This framework can identify automatically which approach is the most
effective in a new domain and it can check if a document is copied or not This
framework is an external plagiarism detection system Besides, the framework can
identify which parameters are effective with each approach (e.g chunks are 1-gram, 2-gram, or 3-gram, and so on) It is because word segmentation in Vietnamese doc- uments and English documents is different So in this thesis, we want to test if word segmentation in Vietnamese documents is important in detecting plagiarism
for Vietnamese documents Three important methods are included into the frame-
work They are Overlap and Cosine and GST method
This thesis consists of five chapters In Chapter 2 we review the related works We
describe and discuss three methods chosen to compare similar documents which are included into our framework Particularly we discuss some strategies to choose
Trang 9
describe some existing tools and why we do not use them In Chapter 3 we in- troduce the architecture and describe the function of modules in the framework
Chapter 4 presents the process of collecting input data for the system Steps of
Trang 10Chapter 1 Introduction 9 http://www.laodong.com.vn/Home/J http://www.hanoimoi.com.vn/Xet-xu- 'Vu-New-Century-Nhieu-bi-cao- so-tham-vu-vu-truong-New-Century thua-nhan-pham-toi-do-ham- /3122406.epi; Cập nhật: 06:38 AM choi/20098/152629.laodong; Cập | 25/08/2009 nhật: 8:29 AM, 25/08/2009
(LD) - Bi cáo Nguyễn Dai
Duong, trú tại phường Đội
Cắn- quận Ba Đình, TP.Hà Nội, chủ ưu trường Neu Cen-
turụ - bị truụ tô uề tội "Kinh
doanh trái phép" theo điểm c, khoản 2, Điều 159 Bộ luật
Hình sự 7 khách tới chơi tại uũ trường là: Trần Thi
Thanh, Lê Anh Tuấn, Trương
Thị Thu Hiền, Đào Phương
Trí, Nguyễn Tuấn Trưng, Lê
Quốc Vượng, Lê Thị Kim Anh
bị truy tô uê tội "mua bứn trái phép chất ma túy" hoặc "tàng trữ trái phép chất ma túu" Vụ án được phát hiện vào lúc 1h sáng 28.4.2007, lực lượng công an da ap vào vii triéng New Century, tam
giữ hơn 1.000 đối tượng, thu được số lượng lớn rna tuú tổng
hợp Kết quả test nhanh cho thâu, hơn 200 người có phản ứng dương tính uới ma tuy Ngay khi bước vào phiên xét xử,
Nguyễn Đại Dương đã đề nghị toà cần phải triệu tập anh Lú Hồng Linh - nhân uiên phục uụ ban tang 1 của uũ trường Neu Century ra để được dối chất,
tì theo bị cáo Dương thà anh
nàu đã có những lời khai bắt
loi cho cdc bi céo do bt mém cung Tuy nhién, dai dién VKSND
cho rằng không cần thiết phải có mặt anh Linh tại phiên xét xử, vì anh này đã có đầy đủ lời khai trước CQDT
(HNM) - Theo đại diện VKS§ giữ quyền công tố tại phiên tòa, bý cáo Nguyễn Đại Dương, trú tại phường Đội Cần, quận Ba Đình, TP Hà Nội bị truy tố ột "Kinh doanh trái phép” theo điểm c khoản 2, Điều 159 - Bộ luật
Hình sự 7 "uy khách" tới chơi tai vit
trường là Trần Thi Thanh, Lé Anh Tuấn, Trương Thị Thu Hiền, Đào Phương Trí, Nguyễn Tuấn Trưng, Lê
Quốc Vượng, Lê Thị Kim Anh bị
truụ tố uê tội "Mua bán trái phép chất rna túy" hoặc "Tùng trữ trúi phép chất rna túy" Trước đó, khoảng
1h sáng 28-4-2007, vũ trường New Century,
một điểm chơi đêm đình đám nhất Hà Nội
bị lực lượng CA bất ngờ đột kích, tgrn
giữ hơn 1.000 đối tượng, thu được số lượng lớn rna túy tổng hợp Kết quả test nhanh cho thấu, hơn 200 người có phản ứng dương tính uới ma túy Hôm qua 24-8, ngay từ đầu phiên tòa, Nguyễn Đại Dương đã có yêu cầu triệu tập thêm anh Lú Hồng Linh, nhân tiên phục uụ bàn tổng 1 của vi trudng New Century ra lam nhaén
chứng Theo bị cáo Duong, anh nay
đã bị mớm cưng Xét thấy tại cơ quan điều tra đã có lời khai của Lý Hồng Linh nên VKS cho rằng không cần thiết phải có
mặt anh Linh tại phiên xét xử Phần xét hỏi
nhóm khách có hành vi mua bán, tàng trữ ma túy trong đêm vũ trường New Century bị đột kích đã mở màn cho phần xét hỏi 7 bị cáo đều biện minh cho tội trạng bằng
lý do tuổi trẻ, ham chơi Riêng Nguyễn
Đại Dương cho rằng mình vô tội (?) Phiên tòa sẽ kết thúc vào ngày 28 tới
Trang 11
Chapter 2
Literature Review
In this chapter, we are going to discuss the existing related literature and back-
ground research which have been established in the world Concept of chunk in Vietnamese text and some strategies of selecting chunk are also presented in this
chapter Some popular approaches which measure the similarity between two doc-
uments are introduced and analyzed They are (1) Overlap method which bases on set - theoretic, (2) Cosine method which bases on vector - space, (3) GST - Greedy String Tiling which bases on substring matching These methods are used success-
fully in a lot of fields, for example: plagiarism and copy detection, information
retrieval, tracking similarity between files Finally, we will present and estimate
some existing tools which are related to our framework
2.1 Concept of chunk in Vietnamese text
In the Overlap and Cosine method(introduced late), each document is split into chunks before it is compared with other documents A chunk is one or some suc- cessive syllables General, a chunk is some successive syllables In English, a word is a syllable but in Vietnamese, a word may be one or some successive syllables
So a chunk may be a syllable, two successive syllables, three successive syllables, etc or a word or a sentence in Vietnamese document For example, in this follow
Trang 12
sentence: "Xử lý ngôn ngữ tự nhiên là một lĩnh vực rất khó" ¡f a chunk is a word in Vietnamese then "xử lý", "ngôn ngữ tự nhiên", "là", "một", "lĩnh vực", "rất",
"khó" are chunks of above sentence We ñnd that chunks represent the content of the document and we can determine whether a document is plagiarized or not by comparing chunks
2.2 Strategy of chunk selection
To determine how documents are split into chunks, first we consider the way of choosing the chunks In designing a chunk process, there are three factors that need to be taken into consideration The first is the size of the chunk, which is the number of syllables in a chunk The second is the number of chunks which
are used to build a set of chunks of one document The third is the choice of the algorithm used to select syllable from document There are several strategies of selecting chunks [Schleimer, Wilkerson, and Aiken, 2003, Heintze, 1996] but we
only discuss three popular strategies as follows:
e (A) One chunk equals n successive syllables, overlapping n-1 syllables In this strategy, every substring of size n in the document is selected The k** chunk
and (k + 1)" chunk overlap n-1 syllables For example, we have a document
ABCDEFGH where letters represent syllables or sentence and so on If we select n = 3 then the chunks of this document are: ABC, BCD, CDE, DEF, EFG and FGH This strategy produces the largest number of chunks and it is expensive for document storing However, it could be expected to be the best
strategy for effectiveness because it uses every substring of the document (B) One chunk equals n successive syllables no overlapping syllables This
strategy is similar to (A) strategy but does not select overlapping substring
that mean the k** chunk and (k + 1)"* chunk do not overlap For example,
Trang 13Chapter 2 Literature Review 12
© (C) One chunk equals n successive syllables at k‘" syllable in sentence The
chunks are created by selection n successive syllables beginning k“ syllable in sentence of the document
2.3 Comparison methods
There are several methods which have been used for plagiarism and copy detection In this chapter, we discuss and describe three important methods which have been used successfully in some fields by other researches Three methods are Overlap,
Cosine and Greedy String Tiling (GST) Each method has its own advantages and
disadvantages and this method may be in effect in one domain but may not work in the other
2.3.1 Overlap method
The first way to measuring similarity which we want to present is Overlap Al- though this method is quite simple but it is a useful method This measure is used popularly in IR system [Salton, 1992] When the user gives a query to the system, the system will search in its database to respond documents which relate with the query Similarly, we need to establish a metric that measures the overlap between an incoming document and a pre-registered document
Let A, B refer to generic documents (registered or new) Supposing, A is split
into m chunks We denote them to be : fAI(0A1,Sal) , ÉA/2(UA2,8A2), „ t4m(WA,m; SAm); Where:
© ta, : representing i" chunk in document A © wa, : representing string in chunk t4,
Trang 14Similarly, with document B that is split into n chunks We denote them to be taa(wea,$p.1), te.2(We2,$B2), -tan(WBn, SB.n)-
Supposing there are k similar chunks in both of document A and B We denote them to be to; (wo, Seay Spi), to2(Wo2, §'42: Spa), sc ton(Wo,ks Sans Spa)
Where: with i changes from 1 to k:
to, : representing i similar chunk in both of document A and B
© wo, : representing string in chunk to,
° đu : representing the occurrence frequency of wo, in document A ° Soy : representing the occurrence frequency of wo, in document B
Let 5S; = min (sais Sp) Vi from 1 to k Let Overlap(A, B) denotes the Overlap
value between document A with document B, as computed below:
* Overlap(A, B) = oes
i=1 OBA
Similarly, Let Overlap(B, A) denotes the Overlap value between document B with
document A, as computed below:
k Overlap(B, A) = oe
Ai
i=l
We find that Overlap measure between document A with B is quotient of sum similar chunks (in both document A and B) and sum of chunks of document A
Similarly, Overlap measure between document B with A is quotient of sum similar
chunks (in both document A and B) and sum of chunks of document B After Overlap(A, B) and Overlap(B, A) are computed then we denote S(A, B) is Overlap measure between document A and B that is computed below:
Trang 15Chapter 2 Literature Review 14
This measure has a range from 0 to 1 It indicates the proportion of shared chunks in smaller documents Clearly, this method is simple and it is not normalized with the sizes of two documents In some cases, Overlap measure is effective For example, considering two sets, one containing 20 elements and the other 200 elements, if the intersection is 10 elements, this would account for 50% similarity
But if we use Cosine coefficient then the similarity measure may be very low 2.3.2 Cosine similarity measure
Another popular similarity measure is Cosine measure Let us denote the set of
chunks for document A as a vector with length n, V4 = [waa,Wa2,Wa3, - Wan),
and their associated occurrence frequency as another vector, Wa = [84,1, $4.2; $4.3
› 8A], where w4,; and s4,; represent the string in chunk i“ and it’s occurrence fre-
quency in document A respectively Similarly, we denote the set of chunks and their associated occurrence frequency in document B as vector Vg = [wg,1, WB,2, WB,3, - - -
Wa,m| and Wy = [s2,1, 58,2, $B,3, -$B,m] respectively The similarity between V4 and Vg could be determined in two steps:
Step 1: Since Va and Vg might contain different number of elements so that they must be normalized to the same length A reference vector, R, is first generated as a
union of the elements in both V4 and Vg , i.e., = (VAUVp) = [t0R1,10Ra, - - - t0R,k] with k <= m+n Let us denote the corresponding normalized vector of V4 as
X4 = (Tan, 0a2,0ag, -Cax] where x4, is defined as follows: 0 if wri ¢ Va Sai if wai = way € Va
Similarly, the normalized vector of Vg is Xg , can be computed Xz = [tp, 782,783
roc Opals
Trang 16Chapter 2 Literature Review lỗ
the similarity between V4 and Vg is S(V4, Vg) could be simply computed follows:
S(Va, Ve) =
Essentially, S(V4, Vg) is cosine value of angle which is created between two chunk vectors in k dimension space If the angle is small then value of S(V4, Vg) will be large and similar measure of two that documents are higher
This measure is simple, it proves robustness across collections Because it is nor-
malized so it is easy to obtain the enabling of its use in document classification
across queries Besides, it has performed well with respect to other approaches There are some problems with a vector-space model including: terms are assumed to be independent, no theoretical basis for the assumption of a term space, lack of
justification for choice of term weights (Griswold, 1993] 2.3.3 Greedy String tiling
Third method is discussed and described in this chapter is Greedy String Tiling
(GST) [Michael J Wise, 1993] This algorithm aims at the detection of longest
possible common strings between two documents(called the text and pattern) We can measure the similarity of this two document based on these longest common strings For example, supposing we have two strings the following:
T string:
"Thanh tra Sở Y tế tiếp tục kiểm tra tại cơ sở Hồng Hoa Trường Lạc cho thấy, sản phẩm trân châu của công ty có công bố chất lượng Tuy nhiên, cơ sở sản zuất
chưa được cấp chứng nhận đủ điều kiện ATVSTP, công nhân trực tiếp sản xuất
chưa tuân thủ đầy đủ quy định về ATVSTP Thanh tra Sở Y tế Hà Nội đã tạm đình chỉ hoạt động sản xuất cơ sé nay."
P string:
"Thanh tra Sé Y tế tiếp tục kiểm tra tại cơ sở sản xuất này, phát hiện chưa được
Trang 17động"
The longest possible common strings between two documents are: "Thanh tra Sở Y tế tiếp tục kiểm tra tại cơ sở" ; "sản xuất" ; "chưa được cấp chứng nhận đủ điều kiện" ; "đã tạm đình chỉ hoạt động"
Before presenting detail on GST, some terms are defined
Mazimal-match is number of syllable of a longest possible common substring
at start position p of pattern with start position t of text For example, at start
position 1 of above P string and at start position 1 of above T string then Maximal match is 12 (number of syllable of "Thanh tra Sở Y tế tiếp tục kiểm tra tại cơ sở" substring)
Tiles are substrings which are longest common substrings between text and pat-
tern For example, tiles of above T string and P string are "Thanh tra Sd Y té tiếp tục kiểm tra tại cơ sở" ; "sản xuất" ; "chưa được cấp chứng nhận đủ điều
kiện" ; "đã tạm đành chỉ hoạt động "
Minimum-match-length is a interger number which maximal - matches are
ignored It mean that length of each tile always is greater than or equal minimum- match-length For example, if minimum-match-length is 3 then all of two syllable tiles are ignored
A pseudo-code of the algorithm is described by Prechelt et at [2000] is given in Table 2.1:
According to above pseudo code, we find than the algorithm is separated two
phases The first phase starts from line 7 to line 22(called scanpattern in [Wise,
1993]) This phase searches for all of mazimal-matches Firstly, it sets marmatch variable equal to the Minimum-match-length value(line 5) In this phase, only un-
marked tokens are processed If a match is found between token A, of A string and
token B, of B string then this match is extended as far as possible This match processing finish when mismatch or marked token (line 12) The length of this match, j, is compared to maxmatch If they are equal then this match is added
Trang 18Greedy-String-Tiling (String A, String B) tiles = { }; do maxmatch = Minimum-match-length; matches = { }
Forall unmarked tokens A, in A Forall unmarked tokens B, in B J=0; while (A,,; == B,,; and Unmarked(Aq;;) and Unmarked(B,,;)) Jah maxmatch) matches = matches @ match(a,b,j); else if (j > maxmatch) matches = {match(a,b,j)} ; maxmatch = j; }
Forall match(a,b,maxmatch) € matches —
J HOC QUO: HA NO! For j =0to (maxmatch - 1) IN THU VIEN
mark_token(Aq+;); _ Ae (76 |
mark _token(By4;);
tiles = tiles U match(a,b,maxmatch); } while (maxmatch > Minimum-match-length);
return tiles ;
TABLE 2.1: The Greedy String Tiling algorithm maximal matches between string A and string B
The second phase starts from line 23 to 31(called markarrays in [Wise, 1993})
This phase stores tiles (line 30) which obtained previous phase and marks tokens
from these tiles(line 25-29) When a token is marked then it can not be used again The algorithm finish when there are not other matches longer than or equal to the
Trang 19If two above algorithm (Overlap and Cosine) use fix-length chunk then this algo-
rithm use variable-length chunk We hope that using this kink of the chunk may be better than fix-length chunk in some domain
When we have tiles set by using GST then the similarity, Sim(A,B), between two documents can quantify by two formula as follows: coverage(tiles) = » len(match) matchétiles 2 * couerage(tiles) Sim(A, B) = Tas Bl where:
e A, B: two string input
e tiles: set of the tiles between string A and string B e len(match): length of match in set tiles e |All, || Bl|: length of string A and string B respective
2.4 Some Plagiarism Detection Systems
In the world, there are some systems which can detect similar documents For ex- ample, CHECK; COPS; SCAM; YAP3, etc Recently, there is a finding plagiarism
competition We introduce and discuss about this competition in next paragraph
CHECK system [Si, Leong, and Lau, 1997] only works with Latex documents These documents are hierarchical structure documents and they are easy to find
section, subsection, etc by specially formatted keywords of them The hierarchical
Trang 20Chapter 2 Literature Review 19
Each document could be viewed at multiple abstraction levels which include the
document itself, its sections, subsections, and paragraphs, with paragraphs repre-
senting the lowest level of abstraction and resembles leaf node of the document tree This system uses Cosine measure and compare by each levels of document If at the root level two are similar then they continue to compare at the lower levels
of the document tree etc
COPS system[Brin et al., 1995] works with text documents In COPS, registered
documents are broken up into sentences or sequences of sentences, and are stored in a database When we have a new document, it is broken up in the same way and this new document will be compared against the registered documents in the database
SCAM system (Shivakumar and Molina, 1995] is based on the word occurrence frequencies of documents They computer a frequent vector of words which occurs
in the new document Then they compare this vector against registered vectors in
the database In the experiments, they used 1233 netnews article to experiment
They presented results comparing SCAM system against COPS system In case there are two different sentences of the same semantics and only some different words, plagiarism detection systems using sentence chunks usually don’t produce good results In this case, result shown by SCAM is better because it uses word chunks However, SCAM has more false positives than other systems which use sentence chunks False positives are pairs of documents that are reported to be similar documents , but in the fact that they are not
The famous tool which uses RKR-GST(Running - Karp - Rabin - Greedy-String- Tiling) algorithm is YAP3 [Wise, 1996] This tool is created by Michael J Wise YAP3 is third version of YAP (Yet Another Plague) tool YAP is a system for detecting plagiarism in computer program The third version of YAP has been de- veloped for use with English language YAP3 works in two phases The first phase generates token sequence from the source text The second phase uses RKR-GST algorithm to compare Each token string is compared with all other token strings This tool is useful to find plagiarism in computer program
Trang 21PAN’09 workshop! This competition divides plagiarism detection tasks into two kinds external and intrinsic plagiarism detection as we demented above The cor-
pus of this competition includes 20.611 suspicious documents and 20.612 source
documents Candidates of this competition have given a lot of approaches to solve plagiarism detection However, most of approaches are given to solve plagiarism
in the corpus of competition In each suspicious document in the corpus, there are some sentences or some paragraphs which are copied and pasted from one or some
source document Candidates have to find which sentence or paragraph is copied and from what source the documents are
There is a popular tool which is named WCopyfind? If the input of this tool is
set of documents then the output of the tool are number shared match(one match is one or some word) between all of pair of documents User can view this share match However, the tool only finds and shows shared matches between pairwise
of document and does not show if they are plagiarism or not
The systems are introduced above to work only with English document or com-
puter program In each system, they often use only one kind of chunk (sentence chunk or word chunk) and one comparing method In this thesis, we are to de- sign a unified plagiarism detection framework for Vietnamese document In our
framework, we will use some kinds of chunk and some comparing methods In each different domain, the framework will show what methods and parameters are the
most effective After choosing the most effective method and the most parameters
for this domain, users can compare a new document with all other documents in the database to detect plagiarism
In this chapter, we discussed about the existing related literature and background research Three popular methods which are included in our framework are pre- sented in this chapter Some well-known tools are introduced in here, too We have explained why we do not use the tools and we propose our framework In next chapter, we will present this framework in detail
Trang 22
Chapter 3
System architecture
We have built a system which is called UPDFVD (A Unified Plagiarism Detection Framework for Vietnamese Documents) to test our ideas The high level architec-
Trang 23Chapter 3 The architecture of the system 22
The inputs of the system are set of trained documents and list of pairs of simi- larity documents The trained documents are subset of a new domain In this set of trained documents, we will store all of pair of similarity documents in a List of
similarity documents Outputs of the system are a method or some methods with
their parameters which are the most effective methods in this domain
3.1 Parsing module
The current version of the system only works with Unicode plain text Three steps
of this module are presented as follow:
e Step 1: In this step, the module gets texts from all of files in the set of trained
documents The texts of each file are considered as a unique string So each
document in the set of trained documents is expressed as a unique string, too
Step 2: This step is pre-process step Before the strings is split into chunks
or input of GST method, all of punctuations and commas and semi-colons
and so on are removed from the strings All of characters of the strings are converted to lower characters
Step 3: In this step, All of chunks of each document and set of tiles of each document pair are generated As mention at above chapters, a chunk may be a word, some successive syllables or sentence and so on
After parsing, if the comparing method is Overlap or Cosine method then
this module will respond list of chunks with their occurrence frequency of each document in the set of trained documents For example, document D is input of the Parsing module then output of this module is list of chunks formatted tp(w;, s;), where:
Trang 24— $¡ : representing occurrence frequency ?* chunk in document D
If the comparing method is GST method then this module will respond set of tiles of each pair of documents For example, document P and document T are input documents of the Parsing module then output of the module is
a set of tiles formatted t,(p,t,len) , where:
— t¿ : representing i* tile
— p: representing position of i tile in document P — t: representing position of i** tile in document T — len : representing length of it* tile
Now, we will discuss about length of a chunk and minimum — match — length value As we introduced above, a chunk may be a word or some successive syllables or a sentence or paragraph and so on The bigger chunk has lower the probability of
matching unrelated documents For instance, consider two paragraphs that share 7 out of 8 identical sentences, with paragraph chunking, no match will be detected,
while with sentence chunking, it will be detected as matching However, smaller the chunking unit has higher the probability of matching unrelated documents
For example, when chunking unit is a word then two documents maybe share a lot of word but they are unrelated documents In this case, we will say that the system has false positives It means system shows that two documents are
similarity documents but in fact they are unrelated documents
Similarly, value of minimum — match —length(parameter in GST method) is very
important If the value is too great then set of tiles is null set but if this value is
too low then set of tiles is large although pair of documents may be not similarity
Trang 253.2 Comparing module
This module compares and computes the degree similarity between all of document
pair in the trained document set This module use set of chunks or set of tiles which are generated by the above module to compute these degree similarity Process of
this module divides into two steps as follow:
e Step 1: After all of documents are pared into list of chunks (or set of tiles) then each document are compared with remain documents to compute the similarity measures between them For example, in the trained documents
set, we have N documents are D;, D2, ,Dy then document D, is com- pared with N — 1 remain documents D2, D3, ,Dn and document Dy is compared with N — 2 documents D3, D4, ,Dy and so on It means that
each comparing method and with each special parameter, there are (N*(N- 1))/2 comparing pairs
Step 2:After the similarity measures of all of document pairs are computed and combine with the List of similarity document pairs (input data) then F-measure values (discussed in next paragraph) will be computed at this module
To evaluate the effectiveness of each experimental comparing method, we have used recall and precision metrics as follows:
Recall metric: the recall metric measures the ability to retrieve a piece of infor- mation from candidate information It is defined as the percentage of documents
identified as plagiarized with respect to the actual total number of plagiarized
documents
Precision metric: the precision metric represents the ability to retrieve a piece
of information correctly Here, it is defined as the percentage of plagiarized docu-
Trang 26Chapter 3 The architecture of the system 25 that is actual total number of related documents F-measure value is computed as follow: POR Precision = > POR Recall = ecai R We used F-measure be defined as follows: 2 * (Precision * Recall)
F- measure = (Precision + Recall)
With similar measures which are introduced above, there is a question how to choose threshold Threshold is value that similar measures of document pairs are lager this threshold value then we will say that two documents are similar docu- ments Choosing threshold is very important because it determines effect of the system With each kind of data or different domain, threshold value may be differ- ent Thus, each method with its parameter we can find the best F-measure value by changing the threshold value from 0 to 1 with each step by 0.01 For example, with threshold value is 0.5 there are n document pairs which their similarity measure values are lager or equal 0.5 In fact there are m similarity doc-
ument pairs (mvalue is sum of similarity document pairs in the List of similarity
documents) Supposing, n and m document pairs share k document pairs then F-measure value are computed as lollows: Precision = : n Recall = £ m We used F-measure be defined as follows: 2 (Precision * Recall) F- =
TEES (Precision + Recall)
Trang 27
Relation between F-measure and 1
FicurE 3.2: Resulting showing module
3.3 Resulting showing module
Output of Comparing module are F-measure values and threshold values of each method with different parameter This module shows the values by graph To
visualizing the result we use ZedGraph! for creating 2D line and bar graphs of arbitrary datasets Zedgraph is a very good open source C sharp graph plotting
library and distributed under the GNU lesser general public license Figure 3.2 is an example of this module This graph shows results of three method with their parameters According this graph, users can choose the most effective
method and the most efficient parameters for this domain easily All of the F- measure values and threshold values which are computed from above module are
showed by graph so users can choose the most effective method which has F-
Trang 28Chapter 4
Experimental results
In this chapter, our framework is tested and evaluated We present and discuss
some experiments to study the effectiveness of the framework To illustrate to
our framework can identify which method is the most effective method in a new
domain automatically, we use three different corpora in our experiment Different data set have different definition of what plagiarism is and it is implicit encoded in
the corpus The first corpus is Vietnamese documents We also try our framework
on English corpus to prove our conjecture that different domain may need differ-
ent methods and corresponding parameters The second corpus is subset of PAN
corpus ! The third corpus we use in our experiment is corpus of Paul Clough” We will describe about the corpora in next paragraph In each experiment, collection
of data and result of each comparative method are displayed and discussed, the purpose of these experiments is also introduced As above mention, we use three comparing methods (Cosine, Overlap and GST method) in our framework In all of our experiments, we use all of the methods with some different parameters
‘http://www.uni-weimar.de/cms-medien/webis/research/corpora/pan-pc-09.html *http://ir.shef.ac.uk/cloughie/resources/corpus-final09.zip
Trang 29Chapter 4 Experimental results 28
4.1 Experiment with Vietnamese corpus
4.1.1 Data collection
In this experiment, we use over 800 netnews as the testing document set These net-
news are published in some popular Vietnamese websites which are nezpress.nef,
dantri.com.vn, laodong.com.un, tienphong.vn, tuoitre.un, hanoimoi.com.vn, etc and published on 14 consecutive days During this period, a large number of net-
news of one website are copied from or have overlaps with those in other websites
Therefore, the chosen document set is good to test the system
The documents set are classified into five groups which are economics, sports, le-
gal, medicine and mixed netnews With each netnews group, we would like to test which method is the most effective method
4.1.2 Objective
All of documents in this corpus are netnews and collected from some websites at
some consecutive days where some document pairs share a lot of similar words but
they are different documents As mentioned above, the documents are classified into five netnews groups Except for mixed netnews group, in each remaining
netnews group, a document pair may share a lot of words although they are not relative documents The documents in mixed netnews group belong to a lot of
different fields, for example: education, culture, etc so that pair of documents in this group may share very few similar words
With own characters of each netnews group in this corpus, we aim to show that the corpus is a good device to test the capability identify which method is the most effective method in a new domain automatically of our framework Our other purpose is to test the efficiency of the order of syllables and word segmentation
Trang 30
same event or problem The creation of the list of plagiarized documents is done manually
4.1.3 Implementation
In this experiment, with Overlap and Cosine method, we used four kinds of chunks
They are: one syllables chunk (1-gram), two successive syllables chunk (2-gram), three successive syllables chunk (3-gram), and word chunk We used vnTokenizer
tool? for segmentation The vnTokenizer tool is distributed under the GNU Gen- eral Public License
With each kind of chunk, we divide into two cases: using frequency and not using frequency of chunk in document With GST method, MML values are 1, 2 and 3 In each netnews group, with Overlap and Cosine method, number of comparison
times for each document pair are 16 times:
((number of methods - Overlap and Cosine) * (kinds of chunks) * (frequency or not frequency) = 2*4*2 = 16)
Each comparison time, we compare each document with all of the remain docu- ments in this netnews group to get similarity measures
4.1.4 Result
Table 4.1 shows the most effective methods and the best parameters of them in
each group In this table, we find that in law netnews group, the most effective
method is GST method with MML = 2 and in the remain netnews groups the most effective method is Cosine method with 2-gram From table 4.1, the average F-measuse value is 88,76% and the average precision and recall value are 90.2% and 87.4% We find that the best chunk is 2-gram or MML=2 for all of three
methods in all of five netnews groups In next paragraph, we present and discuss the result of all of methods in each netnews group
Trang 31
Order | Group Method | Chunk F P R Thres
1 economics | Cosine 2-gram | 83,8% | 84,1% | 83,5% | 0.24
2 Law GST MML=2 | 97,0% | 97,6% | 96,5% | 0.26
3 Sport Cosine 2-gram | 81,7% | 86,8% | 77,1% | 0.20
4 Medicine | Cosine 2-gram | 91,7% | 93,3% | 90,1% | 0.20 5 mixed Cosine 2-gram | 89,6% | 89,3% | 89,9% | 0.22
Table 4.1: The most effective method in each group + The detail result of all of the methods in economics netnews group In table 4.2, we present result of comparing methods with different parameters in the economics netnews group From this table, we find that the best F-measure
value is 83% if we use Cosine method with 2-gram chunk and using frequency and
the threshold value is 0.24 The worst F-measure value is 47,7% if we use GST method with MML = 1 In this netnews group, Cosine method is more effective method than Overlap and GST method In top three results there are two results
of Cosine method with different parameters and remain result is result of GST
method with MML = 2 In this table, the best result of Overlap method is fourth position With both Cosine method and Overlap method, the 2-gram chunk is the best chunk but with GST method then F-measure value is the largest value when
MML value is three All of three methods are not effective with 1-gram chunk or
MML value is one Figure 4.1 present F-measure values and the best parameters
of each method In this netnews group, the most effective parameters of Cosine
method and Overlap are 2-gram chunk and using frequency but in GST method MML value is 3 Cosine method gives the highest F-measure value when we chose the threshold is 0.24 and with Overlap method this threshold is 0.14 The best threshold of GST method is very low and this value is 0.09
+ The detail result of all of the methods in law netnews group
Trang 32Order | Method | Chunk | Freq? F P R Thres Cosine 2-gram 1 83,8% | 84,1% | 83,5% | 0.24 _ 2 GST MML=3 + 80,3% | 74,5% | 87,0% | 0.09 3 Cosine 2-gram 0 80,2% | 84,3% | 76,5% | 0.11 4 Cosine 1-gram 1 80,0% | 76,4% | 84,0% | 0.56 5 Overlap | 2-gram 1 78,1% | 70,4% | 87,7% | 0.14 6 Overlap | 3-gram 1 76,8% | 76,0% | 77,6% | 0.05 7 Cosine 3-gram 0 76,4% | 80,6% | 72,6% | 0.04 8 GST MML=2 1 76,0% | 83,0% | 70,2% | 0.26 9 Cosine word 1 75,3% | 75,8% | 74,7% | 0.52 10 Cosine 3-gram 1 74,9% | 75,0% | 74,7% | 0.06 11 Overlap 3-gram 74,8% | 76,8% | 73,0% | 0.05 12 Overlap | 2-gram 74,2% | 68,6% | 80,8% | 0.13 13 66,8% | 62,3% | 71,9% | 0.49 64,4% | 74,6% | 56,6% | 0.44 16 Overlap word 64,3% | 60,5% | 68,7% | 0.67 17 Overlap 1-gram 1 63,6% | 58,7% | 69,4% | 0.52 18 Overlap | 1-gram 0 48,4% | 43,1% | 55,2% | 0.54 19 GST MML=1 1 47,7% | 53,5% | 43,1% | 0.58 14 Overlap word 15 Cosine 1-gram 0 0 Cosine word 0 72,7% | 67,3% | 79,2% | 0.64 1 0 0
Table 4.2: Results of comparing methods in economics netnews group F-measure is 97,2% if we use GST method with MML value is two The worst F- measure is 69,6% if we use Overlap method with word-chunk The best F-measure values of Overlap method and Cosine method are approximate values (94,2% and 94,9%) In top three results, there are two result of GST method with different
parameters Clearly, in this netnews group, the GST method is more effective than
Cosine method and Overlap method Figure 4.2 presents the best parameters of
Trang 33Chapter 4 Experimental results 32
FicureE 4.1: The best parameters of each method in economics netnews group
Trang 34Order | Method | Chunk | Freq? F P R Thres 1 GST MML=2 1 97,2% | 97,6% | 96,8% | 0.26 2 GST MML=3 1 95,7% | 95,9% | 95,6% | 0.12 3 Cosine 2-gram 1 94,9% | 96,6% | 93,2% | 0.20 4 Overlap | 2-gram 1 94,2% | 95,2% | 93,2% | 0.15 5 Cosine 1-gram 1 93,0% | 94,3% | 91,8% | 0.50 6 Cosine 3-gram 1 92,5% | 92,9% | 92,1% | 0.07 7 Cosine 2-gram 0 91,5% | 90,8% | 92,4% | 0.10 8 Overlap | 2-gram 0 90,5% | 93,1% | 87,9% | 0.14 9 Overlap | 3-gram 1 90,3% | 93,1% | 87,6% | 0.06 10 Cosine 1-gram 0 89,6% | 93,6% | 85,9% | 0.43 11 Cosine word 0 89,2% | 92,7% | 85,9% | 0.35 12 Overlap 3-gram 0 87,3% | 91,9% | 83,2% | 0.06 13 Cosine 3-gram 0 85,6% | 90,5% | 81,2% | 0.05 14 Cosine word ‡ 85,5% | 89,19% | 82,1% | 0.47 15 GST MML=1 1 83,8% | 81,6% | 86,2% | 0.57 16 Overlap 1-gram 1 76,4% | 73,9% | 79,1% | 0.51 17 Overlap word 1 71,3% | 89,0% | 59,4% | 0.6 18 Overlap 1-gram 0 70,1% | 67% |73,5% | 0.54 19 Overlap word 0 69,6% | 78,3% | 62,6% | 0.48
Table 4.3: Results comparing methods in law netnews group
Overlap method are the most effective with using 2-gram chunk and using fre-
quency The GST method has the highest F-measure value when MLL value is
two We find that all of the best F-measure values of three methods are very high
Trang 35
—— Cosin 2 Tue —— GSI
FIGURE 4.3: The best parameters of each method in sport netnews group
netnews group and the best F-measure values of Cosine method is higher than the best F-measure values of Overlap method In this netnews group, we find that both the Cosine method and Overlap are not effective with word chunk
+ The detail result of all of the methods in sport netnews group
In table 4.4, we present result of three comparing methods with different parame-
ters in sport netnews group In sports netnews group, the best F-measure is 81,7%
if we use Cosine method with 2-gram chunk and using frequency and the threshold is 0,20 The worst F-measure value is 40,3% if we use Overlap method with 1-gram chunk and no frequency The table 4.4 shows than the Cosine method is more effective than GST method and Overlap method In top three result, there are two results of Cosine method In this netnews group, the Overlap is not effective method The best F-measure value of Overlap method is only 71,4% but this value is 81,7% with Cosine method and 79,4% with GST method Figure 4.3 presents
Trang 36Order | Method | Chunk | Freq? F P R Thres 1 Cosine 2-gram 1 81,7% | 86,8% | 77,1% | 0.20 2 Cosine 1-gram 1 80,7% | 83,3% | 78,2% | 0.58 2 GST MML=2 1 79,4% | 71,4% | 89,4% | 0.23 4 Cosine 2-gram 0 77,4% | 74,6% | 80,4% | 0.10 5 |GST MML=3| 1 | 77,0% | 68,8% | 87,7% | 0.09 6 Cosine 3-gram 1 75,7% | 73,3% | 78,2% | 0.05 7 Cosine word 0 74,3% | 74.3% | 74.3% | 0.38 8 Cosine 1-gram 0 72,5% | 70,7% | 74.3% | 0.43 9 Overlap | 2-gram 1 71,4% | 73,1% | 69,8% | 0.15 10 Cosine word 1 70,1% | 68,8% | 71,5% | 0.56 11 | Cosine | 3-gram 69,5% | 77,4% | 63,1% | 0.04 0 12 Overlap | 3-gram 0 67,9% | 65,1% | 70,9% | 0.05 0 13 Overlap 2-gram 67,3% | 74,3% | 61,5% | 0.15 14 GST MML=1 1 66,9% | 67,8% | 65,9% | 0.60 15 Overlap | 3-gram 1 66,7% | 66, 3% | 67,0% | 0.05 16 Overlap 1-gram 1 46,6% | 56,3 |39,7%| 0.58 17 Overlap word 1 45,3% | 63,6% | 35,2% | 0.56 18 Overlap word 0 43,9% | 48,3% | 40,2% | 0.51 19 Overlap 1-gram 0 40,3% | 37,5% | 43,6% | 0.56
Table 4.4: Results comparing methods in sport netnews group group but the best F-measure value of Overlap method is very lower than the
values of Cosine method and GST method
Trang 37Overlap method does not work in this group
+ The detail result of all of the methods in medicine netnews group
In table 4.5, we present result of comparative methods with some different param-
eters in medicine netnews group Figure 4.4 presents the relationship between F- Order | Method | Chunk | Freq? F P R Thres 1 Cosine 2-gram 1 91,7% | 93,3% | 90,1% | 0.20 2 GST MML=2 1 83,6% | 89,5% | 78,4% | 0.20 3 GST MML=3 1 81,8% | 84,3% | 79,4% | 0.07 4 Cosine 3-gram 1 79,2% | 74,2% | 85,0% | 0.05 5 Overlap | 2-gram 1 78,7% | 77,8% | 79,7% | 0.14 6 Cosine 1-gram 1 77,2% | 73,0% | 82,0% | 0.52 7 |Cosine | 2gram | 0 | 76,7% | 71,6% | 82,7% | 0.10 8 Overlap 3-gram 1 75,8% | 75,9% | 75,6% | 0.05 9 Cosine 3-gram 0 75,1% | 72,6% | 77,7% | 0.04 10 Overlap 3-gram 0 74,5% | 71,7% | 77,5% | 0.05 11 Overlap | 2-gram 0 73,8% | 72,9% | 74,7% | 0.13 12 Cosine word 1 72,2% | 75,2% | 69,4% | 0.53 13 GST MML=1 1 68,7% | 63,4% | 74.9% | 0.36 14 Overlap word 1 68,5% | 66,3% | 70,9% | 0.42 15 Cosine 1-gram 0 68,3% | 66,9% | 69,8% | 0.42 16 Overlap 1-gram 1 65,3% | 60,7% | 70,7% | 0.47 17 Cosine word 0 64,4% | 57,2% | 73,7% | 0.34 18 Overlap word 0 59,2% | 57,6% | 60,8% | 0.42 19 Overlap 1-gram 0 58,3% | 60,1% | 56,5% | 0.51
Trang 38
Relation between F-measu! —— Cosin 2 The
Ficure 4.4: The best parameters of each method in medicine netnews group
value of Overlap method Both Cosine method and Overlap method are the most effective when we use 2-gram chunk and using frequency GST method is the most effective with MML value is two Table 4.5 shows the detail result of three meth- ods with their different parameters The highest F-measure value is 91,7% and
the lowest F-measure value is 58,3% In this medicine netnews group, the most
effective method is Cosine method with 2-gram chunk and using frequency and threshold value is 0.20 The GST method is more effective than Overlap method The F-measure values of Overlap method show than this method does not agree with this netnews group The best F-measure value of Overlap method is 78,8%
with 2-gram chunk and using frequency
In this netnews group, although in top three results there are two results of GST method (with MML = 2 and MML =3) and remain value is the result of Cosine method but the Cosine method is the most effectivemethod in this netnews group
Trang 39Chapter 4 Experimental results 38
different field, for example: sports field, law field, education field, culture field In table 4.6, we present result of comparing methods in mixed netnews group In this Order | Method | Chunk | Freq? F P R Thres 1 Cosine 2-gram 1 89,6% | 89,3% | 89,9% | 0.22 2 | Cosine | 1-gram 1 | 86,4% | 90,3% | 82,8% | 0.56 3 GST MML=2 1 85,7% | 90,2% | 81,6% | 0.18 4 GST MML=3 1 82,6% | 79,4% | 86,0% | 0.10 Cosine | 2gram | 0 | 82,4% | 82,7% | 82,2%| 0.11 on 6 GST MML=1 1 81,4% | 83,3% | 79,6% | 0.38 7 Overlap | 2-gram 1 81,3% | 77,6% | 85,4% | 0.15 8 Cosine 3-gram 1 79,7% | 80,5% | 79,0% | 0.07 9 Overlap 2-gram 0 77,4% | 83,7% | 72,0% | 0.15 10 Cosine 3-gram 0 76,7% | 73,0% | 80,9% | 0.04 11 Cosine word 1 76,7% | 84,6% | 70,1% | 0.55 12 Cosine 1-gram 0 76,4% | 79,9% | 73,2% | 0.46 13 Overlap | 3-gram 0 74,8% | 80,3% | 70,1% | 0.06 14 | Overlap | 3-gram 1 74,8% | 70,0% | 80,3% | 0.05 15 Cosine word 0 74,5% | 71,7% | 77,5% | 0.50 16 Overlap word 1 69,2% | 76,7% | 63,1% | 0.48 avg Overlap word 0 61,1% | 63,9% | 58,6% | 0.47 18 Overlap 1-gram 1 55,8% | 53,2% | 58,6% | 0.55 19 | Overlap | 1-gram | 0 | 45,6% | 61,3% | 36,3% | 0.62
Table 4.6: Results comparing methods in mixed netnews group mixed netnews group, as result in some above groups, the most effective method is Cosine method with 2-gram chunk and using frequency and the threshold value is 0.22 In the top three results, there are two results of Cosine method and one
remain result of GST method The worst result is Overlap method with 1-gram
and no using frequency Table 4.6 shows that the results of Overlap method in
Trang 40
FiGuRE 4.5: The best parameters of each method in mixed netnews group
are not bad In three cases of GST method, the results of them are not different much From table 4.6, using word chunk is not effective in this group and Cosine
method is the most effective method because this method has top two results in
this table and these values are quite high
Figure 4.5 presents relationship between F-measure values and the thresholds of
each method in the best case of mixed netnews group Both Cosine method and