UNIVERSITY OF ENGINEERING AND TECHNOLOGY
DO THUY DUONG
RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON
AUTOMATIC TEXT SUMMARIZATION
MASTER THESIS IN INFORMATION TECHNOLOGY
Trang 2UNIVERSITY OF ENGINEERING AND TECHNOLOGY
DO THUY DUONG
RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON
AUTOMATIC TEXT SUMMARIZATION
Field: Information technology Major: Software Engineering Code: 60480103
MASTER THESIS IN INFORMATION TECHNOLOGY SUPERVISOR: Assoc Prof Nguyen Xuan Hoai
Trang 3Declaration of authorship
I, Do Thuy Duong, declare that this thesis ‘Research and apply evolutionary computation techniques on automatic text summarization’ and the work presented in it are my own
I confirm that:
This work was done wholly or mainly while in candidature for a research degree at this University;
Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated;
Where I have consulted the published work of others, this is always clearly attributed;
I have acknowledged all main sources of help;
Trang 4Acknowledgements
IT am heartily thankful to my supervisor, Prof Nguyen Xuan Hoai, whose encouragement, guidance and support from the initial to the final level have enabled me to develop an understanding of the topic
I would like to show my gratitude to the teachers in the University of Engineering and Technology, Vietnam National University, Hanoi for helping me to gain a large body of knowledge during my two years of studying
Trang 5Contents Declaration Of AUthOrship ccc cccccccccccccceceeeeeeeseesenneeenneeeceeeeeeeeceeceeeeeeeeeeeeeeenaaas 3 (9 $i92Is 42019) 00070777 e 4 Ô9nï9ì).NGUdddddddddƠ 5 | ESS me) 020) on 7 IEinuari 5⁄1 8 9 ám 9 1920100100222 .a1I1 äốẺ,ăăăăẽăẽẽẽẽ 9 [.I MO(IVAatION QĂ Ăn HH ng re 9 1.2 Research OJeCtIV€S cà ng ng nhờ 10 AI nh Se 10 “M9 ẳ 1] Background knowledge II 2 Ì _ AutomatIC f€Xf SUIMAFI1ZAfIOTI -G Ă S101 1 ke 11 QeLL Definition oo ĐO II 2.1.2 Types OÍ text Summar1ZAfION - c5 sx2 12 2.1.3 Methodologtes for automatic text summar1zatIon .- 15 "2W? 00000) 0Ave0iiì 0) in ~ 16 2.3 _ Differential evolution (DE) . - c cccS nh creg 19
2.4 COnCÏUSIOT Gv 26
San 9 6m .ố 27
Trang 7List Figure 2.1 Figure 2.2 summary Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 population Figure 2.8 Figure 2.9 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 of figures A typIcal summar1ZafiOn SVS{€I - c2 < + + S1 S1 ve 12 A summarizer highlights all sentences included in an extractive H111 nọ 13 An example of the abstracf SummmarV . ««« «s5 ssssssssssssss 14 Multi-document Ssummar1Za(IOII . 5 - << + +*s*x«2 15 The general scheme of an Evolutionary Algorithm in pseudo-code 17 General scheme of evolutionary algorIthims «+ s++<<<<+ 18 Correlation between number of generations and best fitness in ¬ 19
Sfeps of differential evolutIlon aløorIthm << ++<<<<<<+ 20 Steps to get the next X1 (generation Ï) . - + s+<<<<<<<<s+ 25 Tllustration Of mutation OPeratiONn .cccccceeeeeeeseesessesestetteeeeeees 32 Tllustration Of CrOSSOVEr OPETAllON cc ceccccceceeceeeeeeesessesnsteteeaeeeeees 33 Changes in summary length in [DE] method on DUC2004 38
Changes in summary length in [DE] method on DUC2007 39
Summary length in [MultiDE] method on DUC2004 43
Trang 8Table 2.1 The basic evolutionary computation linking natural evolution to
DX09208.12021522125755 17 Table 2.2.Fitness of six individuals at generation Ơ - «<< «2 22 Table 2.3 Creatlon of mufant veCfOT VÌ cà HH ngờ 23 Table 2.4 Creation of trial V€C(OT Z2Í c0 HH ngờ 23 Table 2.5 Values of XI In øenerafION Ì «5c * 33333 vvssseees 24 Table 3.1 Description of the datasets used in the experImeIn( - ‹- 35 Table 3.2 Parameter settings of the first eXDerIm€H( .-«« «5 «55555 s52 37 Table 3.3 Summary lengths of some document collections in DUC2004 using
[DE] method - - - ceccccescccenscceessccesscceesccceuscccesscceusssceuseceeseceeussceuaeceueess 38 Table 3.4 Summary lengths of some document collections in DUC2007 using
[DE] method - - - ceccccescccenscceessccesscceesccceuscccesscceusssceuseceeseceeussceuaeceueess 40 Table 3.5 F-Values of three evaluation measures of method [DE] on DUC2004 and DUC2O7 - + + 119991 1111 1 111 ng go nh 40 Table 3.6 Parameter settIngs of the second experImI - ««« «s5 «5 «<2 42 Table 3.7 Summary lengths of some document collections in DUC2004 using
[MultiDE] method wo cece ccccecccessccesscccessccescccesssceusssceussceeseceeessceeaeceeuacs 44 Table 3.8 Summary lengths of some document collections in DUC2007 using
[MultiDE] method wo cece ccccecccessccesscccessccescccesssceusssceussceeseceeessceeaeceeuacs 44 Table 3.9 F-Values of three evaluation measures of method [MultiDE] on
Trang 91 Chapter Í
Introduction
Automatic text summarization means detecting important and condensed contents in one or more documents This is a very challenging problem, relating to many scientific areas such as artificial intelligence, statistics, linguistics, etc Many researches have been conducted world wide since 1950 and produced some systems such as SUMMARIST, SweSUM, MEAD, SUMMON, etc However, this research area is still challenging and attracts more and more attention
In this thesis, we are going to study some evolutionary computation techniques, then apply the differential evolution algorithm to the practical problem: automatic text summarization, in particular, multi-document summarization Moreover, we also attempt to deal with constraint on the summary length that has not been handled effectively in these stochastic popular-based methods
1.1 Motivation
Evolutionary computation techniques use different algorithms to evolve a population of individuals over a certain number of generations These population are applied with operations on such as mutation, crossover and selection to reproduce new offspring, which then compete with each other and the previous generation to survive based on some evaluation function The process ends when a stopping criteria is reached and we found the best individual — the best solution to our real-world problem
Trang 101.2 Research Objectives
The thesis is aimed to study evolutionary computation techniques, especially the differential evolution algorithm, and its application to the problem of automatic text summarization We find the limitation of other researchers’ ways to handle the summary length of this algorithm, then propose a new method to manage this length constraint satisfying users’ demand, but still keep the quality of the summary
1.3 Thesis overview
The rest of this thesis is organized as follows In chapter 2, we review the background knowledge of text summarization, its classification and introduce the main principles of evolutionary computation In particular, the differential evolution algorithm is discussed
Chapter 3 explains in details the above algorithm when applied to automatic text summarization, in our case it is on miulti-document collections Then, an experiment is performed to test the original differential evolution algorithm Besides, we improve the result of the previous experiment, dealing with the summary length so that the document collection is compressed quickly and effectively
Trang 112 Chapter 2
Background knowledge
In this chapter, text summarization is reviewed before we introduce and classify evolutionary computation Then, an evolutionary algorithm namely differential evolution is discussed in details
2.1 Automatic text summarization 2.1.1 Definition
Automatic text summarization is the generation of a shorter version of a text by a computer program but still keep the most important points of the original text
[6]
The aim of automated text summarization is to take a source text, extract the most significant content from it, and present it in a condensed form and in a way sensitive to the user’s or application’s needs
Trang 12Summarization system Document Preprocessing Document Representation Summary representation Summary generation
Figure 2.1 A typical summarization system
2.1.2 Types of text summarization
There are some ways to classify approaches to automatic text summarization as follows: [16]
- Content:
Trang 13* Gnome-Summarizer File Summary % 34 || English Summarize |
Syria, which controls Lebanon, is allowing the country to serve as a haven for leading terrorist organizations, including remnants of al-Qaeda seeking refuge from Afghanistan
At least 25 percent of the groups designated as foreign terrorist organizations by the State Department have a presence in Lebanon and are receiving some form of Syrian support Despite repeated calls from the United States to end its support for terror, Damascus, with the help of Iran, is continuing to grant these groups safe haven, logistical assistance, training facilities and political backing
The report also notes that the Lebanese government has so far "refused to freeze the assets of Hizballah or close down the offices of rejectionist Palestinian organizations."
Iran's close cooperation with Lebanon-based terrorist groups was evident during a recent visit by Iranian President Mohammed Khatami to Beirut
Weapons shipments are delivered regularly from Tehran and Damascus to Hizballah terrorists in Lebanon, resulting in a stockpile of at least 10,000 Katyusha rockets with the capability of hitting major Israeli population centers When the
Palestinian Authority (PA) attempted to import more than 50 tons of Iranian arms aboard the Karine-A ship, those
nurchases were candiicted hv PA financial advisar Fuad Shihaki during a meeting in | ahannn >4 Article talks about: terror, lebanon group al-qaeda,hizballah,
Figure 2.2 A summarizer highlights all sentences included in an extractive summary
Trang 14EVALUATION MEASURES FOR TEXT SUMMARIZATION
Josef STEINBERGER Karel JEZEK
Department of Computer Science and Engineering University of West Bohemia in Pilsen
Univerzitni 8
306 14 Plzeri, Czech Republic
e-mail: {jstein, jezekka}@kiv.zcu.cz
Revised manuscript received 20 March 2007
Abstract We explain the ideas of automatic text summarization approaches and the taxonomy of summary evaluation methods Moreover, we propose a new eval- uation measure for assessing the quality of a summary The core of the measure is
covered by Latent Semantic Analysis (LSA) which can capture the main topics of a document The summarization systems are ranked according to the similarity of
the main topics of their summaries and their reference documents Results show
a high correlation between human rankings and the LSA-based evaluation measure The measure is designed to compare a summary with its full text It can com-
Figure 2.3 An example of the abstract summary
- Audience:
e Generic: A generic summary provides the author’s point of views of the source text, paying the same attention to every aspect of the text
e Query-oriented: A query-oriented (or user-oriented) summary prefers some particular aspects of the text, depending on aspects that a user desires to learn about
- Usage:
e Indicative: An indicative summary only indicates the main — subject matter or domain of the input text without including its contents After reading an indicative summary, one can explain what the input text was about, but not necessarily what was contained in it
e Informative: An informative summary covers (some of) the content, and allows one to describe (parts of) what was in the input text
Trang 15e Background: Assumes readers do not have prior knowledge about the source text topic
e Just-the-news: Supposes reader’s prior knowledge is up-to-date
- Monolingual vs cross-lingual: Just summarizes in the same language vs summarizes as well as translates into another language
- Single-document vs multi-document source: Summarizes only one source text vs fuses together many source texts Figure 2.4 demonstrates a multi-document summarizer, which summarizes five documents into only one summary Dox — Summary _ *®> PS Figure 2.4 Multi-document summarization
In this thesis, we intend to generate extractive summaries for multi-document collections Summarizing a single text is challenging enough, summarizing a document collection poses even more difficulties We have to avoid repetitions, manage potential inconsistencies among documents, but can still cover all essential information of the original text
2.1.3 Methodologies for automatic text summarization
Up to now, there have been many methods applied to summarize text automatically including [21]:
- Traditional methods: term, word, phrase frequencies
- Corpus-based approaches: combination of statistical features, learning to
extract
- Discourse structures: Word-net, Rhetorical analysis
- Knowledge rich approaches: different for particular domains
Trang 162.2 Evolutionary computation
In computer science, evolutionary computation is a subfield of artificial intelligence, defined by some types of evolutionary algorithms which is based on Darwinian principles They belong to the family of trial and error problem solvers and can be regarded as global optimization methods with meta-heuristic or stochastic optimization character, in which there exists the utilization of a population of candidate solutions [1]
Evolutionary computation uses continuous progression of the population, which is then selected in a guided random search to get the required stop
Automated problem solving that uses Darwinian principles started in the 1950s However, three different interpretations of this idea started to be implemented in
1960s in three strands
Evolutionary programming (EP) was invented by Lawrence J.Fogel in the US, while John Henry suggested a method named genetic algorithm (GA) Ingo Rechenberg and Hans-Paul Schwefel introduced evolution strategies (ES) Although these algorithms are proposed quite soon, they are only considered as different types of one technology known as evolutionary computation from the early nineties [1 ]
This is a concept based on natural evolution In nature, all plants and animals which can exist and adapt to the changing environment so far are the best ones, not eliminated by the natural selection process Individuals in a population are parents, producing new offspring through mutation and crossover These new children have to fight against others including their parents to survive in the next generation Overall, mutation and crossover diversify properties of offspring while natural selection results in an increase in the quality (fitness) of population [2] Table 2.1 below shows us equivalent concepts between natural evolution
Trang 17Evolution Problem solving Environment Problem Individual Candidate solution Fitness Quality Table 2.1 The basic evolutionary computation linking natural evolution to problem solving Figure 2.5 and Figure 2.6 illustrates typical pseudo-code and scheme of evolutionary algorithms /3/ BEGIN
INITIALISE population with random candidate solutions;
EVALUATE each candidate;
REPEAT UNTIL ( TERMINATION CONDITION is satisfied ) DO 1 SELECT parents;
2 RECOMBINE pairs of parents; 3 MUTATE the resulting offspring; 4 EVALUATE new candidates;
Trang 18Parent selection »' Pcrerls Inlflcllscrlon Recombination _——> Population Mutation Ỹ ¥ Termination Offspring Survivor selection
Figure 2.6 General scheme of evolutionary algorithms
Evolutionary computation consists of some algorithms which are used to search for optimal solutions to a problem
Figure 2.6 illustrates the transformation of typical population till the end An evolutionary algorithm starts by initializing a number of individuals forming a population Each individual is evaluated based on a fitness function, which is varied among types of algorithms and specific problem Some or all of these individuals are chosen to be parents They experienced reproduction operators to produce new children Fitness values of those offspring are calculated In other words, those offspring’s quality are assessed Better ones between parents and children are chosen to be member of the next generation The process is repeated until the best individual is found based on a certain stopping criteria
Trang 19Cc OM © 2 Qo oO 2 j nd c Progress in 2" half œ) 0œ) ®œ = _ Progress in 1°! half œ) q) m >
Time (number of generations)
Figure 2.7 Correlation between number of generations and best fitness in population
The mechanisms deciding how to create children and the way to choose among parents and children varies among specific evolution algorithms The following section explains in details a technique applied to the real world problem: Automatic text summarization In this thesis, we deal with extractive multi- document summarization
There are some typical evolutionary algorithms such as: differential evolution , genetic algorithm, genetic programming, evolutionary programming, etc In this research, we focus on the first mentioned one
2.3 Differential evolution (DE)
Trang 20C Initiahzaton › Evaluation Mutation Crossover Selection —-( ma) Figure 2.8 Steps of differential evolution algorithm
Trang 21otherwise the value P[i] = 0, i={1, 2, ., n} Figure 2.8 illustrates main steps of a typical differential evolution algorithm
Pseudo-code of this algorithm is given below [15]: Begin
Generate randomly an initial population of solutions Calculate the fitness of the initial population
Do
For each parent
Select three different solutions at random
Create one offspring using DE operators (mutation, crossover) If offspring is the same or better than its parent (selection) Parent is replaced End For While the stopping condition is not satisfied End Example [7]:
The following numerical example is given to demonstrate the DE algorithm We have the objective/fitness function:
Maximize f(X) = x; + X2 + x3in which X = [X}, Xo, Xa]
Our goal is to find xj, Xz, X3 We will follow steps in pseudo-code above to solve this problem
Generate randomly an initial population of solutions:
Trang 22Overall, we generate randomly six three-dimension vectors X1, X2, X3, X4, X5 and X6 such that elements of these vectors are bounded in (0,1)
Calculate the fitness of the initial population (generation 0): XI X2 X 3 X4 X5 X6 XI 0.68 0.92 0.22 0.12 0.40 0.94 Xa 0.69 0.92 0.14 0.09 0.81 0.63 X3 0.04 0.33 0.40 0.05 0.83 0.13 f(X) 1.61 2.17 0.76 0.26 2.04 1.70
Table 2.2.Fitness of six individuals at generation O For each parent:
Now these six individuals are going to produce their own children Firstly, choose individual | as the first target vector (the first parent): Select three different solutions at random:
We select randomly three different individuals, for example: individual 2, 4 and 6
Create one offspring using DE operators:
Mutation: Mutant vector V1 = [v1 1, Vi2, V1.3] Vii = Xoit F* (X2i-X4;)
i= {1,2,3}
Trang 23
Difference Weighted Mutant
X2 X4 Vector Difference X6 | vector Vector (V1) x, | 0.92 0.12 0.80 0.64 0.94} 1.58 x F x» |0.92] - | 0.09 0.83 0.66 + |0.63 1.29 (F=0.8) x3 | 0.33 0.05 0.28 0.22 0.13} 0.35
Table 2.3 Creation of mutant vector V1
The result of mutation operator is a mutant vector We say the mutant vector corresponding to target vector X1= [0.68, 0.89, 0.04] is V1 = [1.58, 1.29, 0.35] Crossover: The mutant vector V1 does a crossover with the target vector X1 to create the trial vector Z1 as shown in Table 2.4 pi (Pe if rand,; <CRori=k Zpi= Pa x;; otherwise k: random number, k € {1, 2, 3} CR: crossover rate rand, ;: random number within [0,1], reassigned for each i” component of the p” vector If k = 1, then
Target | Random Mutant Trial
Trang 24If offspring is the same or better than its parent (selection), parents are replaced:
f(XI) < f£(Z1), then the trial vector becomes target vector X1 of the next generation (generation 1) as shown in Table 2.5 X1 X2 X3 X4 X5 X6 XỊ 1.58 Xo 0.89 X3 0.04 f(X) 2.51
Table 2.5 Values of XI in generation 1
Trang 25X1 X6 X2 X4 F * (X2-X4) V1 X6 + F *(X2-X4) Crossover L Z1 Selection Next X1
Figure 2.9 Steps to get the next XI (generation l) On the whole, properties that we need to care for in DE:
- Solution representation: a real-valued or binary vector
- Number of individuals in population when initialized: Population is usually initialized randomly, bounded in an interval Population size shows the number of individuals in the population in a generation This is an important parameter we need to decide If population size is too small, the algorithm converge too fast, individuals can just reach a small part of the searching space On the other hand, population size is too big, leading to resources waste, extending the searching process
- Qbjective/fitness function: This function evaluates how good the solution is, therefore it needs to be built carefully
- Operators [22]:
Trang 26while small mutant factor means convergence comes rapidly, but there is a high probability for the algorithm to stop in a regional optimum
o Crossover: The aim is to create trial vectors from mutant vectors Trial vectors will be a mixture between parent and mutant vectors The higher the crossover rate (CR) is, the more likelihood that the trial vectors have more properties of mutant vectors than parent ones In other words, CR decides the swap probability between target and trial vector
o Selection: This makes sure the next generation at least remains or gets better than the previous generation
2.4 Conclusion
Trang 273 Chapter 3 Automatic text summarization using differential evolution algorithm
This chapter is going to reexamine differential evolution algorithm [5], for multi-document summarization task, suggested by group of researchers from Institute of Information Technology of National Academy of Sciences of Azerbaijan
This algorithm could work with float values, nonetheless, we focus more on binary version (binary DE) because we are working with problem of sentence selection (1 for selection, 0 for elimination)
In the problem of automatic text summarization, the task of adjusting the content coverage and content redundancy is essential, especially in multi-document summarization, where documents have high probabilities of having the same content, leading to high likelihood of redundancy
3.1 Automatic text summarization using differential evolution (DE) 3.1.1 Document collection representation
We have a document collection D={dj, do, ., dip}
in which |D] is the number of documents in the collection
Trang 28D= {SI,$› Sa}
where n is the number of sentences in collection D
On the other hand, we can take T = {t,, to , ty} as collection of distinct terms
in the collection D m is the total number of different terms in collection D
Then, each sentence 1 is demonstrated as
Si = {Wil, Wi2, Wim }
in which: w;, is the weight of the term t, in sentence s; Wix = fix X log C) (1)
f;, 1s the number of term t, in sentence s; n, is the number of sentences t, appears in
n is the total number of sentences in document collection D
From this, we can refer that if the term t, which appears many times in a certain sentence s,, but not many times in the whole document collection then the weight w;, will be high In particular, - Wy, 1S higher if t, appears many times within a few number of sentences - Wx 1S lower if t, appears fewer times in a sentence or occurs in many sentences
- Wjx 1S lowest if the term occurs in almost all sentences
Our goal is the output S CD to be a set of sentences forming a summary 3.1.2 Objective/ Fitness function
Trang 29nevertheless, this does not assure that they will generate the best summary For instance, the summary which includes large number of duplicated sentences is absolutely not expected Overall, this approach takes care of three aspects of summarization:
- Content coverage: summary should contains significant sentences covering the main content of the documents
- Diversity: sentences carrying the same content should not be all in the summary
- Length: summary’s length should be restricted
The task of optimizing all these three properties is an instance of a global summarization problem Therefore, sentences included in the summary or not depends not only on their own properties but also on properties of every other sentences in the summary [20]
Consequently, the problem of summarization is now formalized as follows:
Find vector U such that maximize
_ Ícover (U)
1U) 7 fdiver (U) (2)
in which: we need to Increase fsov¿(U) — the contenf coverage of the summary compared with the original collection and decrease fgjve(U) — the redundancy in the summary
The fitness function is equivalent to
sim (O,0Š).X7—+ sim (O,s¡)u¡ —1 Yat >j=i+1 SỈ (Si,sj 0i) (3) maximize f(U) = such that: diziliu;y SL ,u; € {0,1} (4) where:
O and O° are mean vectors of collection D and summary S, respectively k™ coordinate O, of the mean vector O is
1
Ok == i=1 Wik »k=1, ,m (5)
Trang 301
O = T— 2s¡es Wặc k= 1, m (6)
Ns is the number of sentences in the summary S ], is the length (in terms of words) of sentence s;
u; = 1 means sentence i is chosen to be included in the summary, otherwise
uj = 0
It means the problem now has a constraint of summary length, which must be less than or equal to a specified L
Radev et al (2004) [18] affirmed that the centre of the document collection (O° in this case) indicates its main content Accordingly, sim(O, O°) assesses the
importance of the summary and ))_,sim(O,s;)u; assesses the importance of
each sentence in the summary The denominator of formula (3) means we calculate the sum of similarity of each pair of sentences s; (i= 1, 2, ,n-1) and §; Qj =it+1, 14+2, ,n)
3.1.3 Main steps of differential evolution
This section explains step by step the operation of the differential evolution algorithm in solving the problem of automatic text summarization, in particular, solving (1)
Step 1: Initialization:
Generate randomly a population of P individuals: Each individual is a real-valued vector:
U;(Ð = [up) u;a(Ð]
p=1, 2, ., P (population size)
n is the number of sentences in document collection t is the generation number
However, at first at generation 0, each elements u,;(0) of individual U,(O) 1s initialized as:
Trang 31in which, rand,; is a random number within [0,1] and is reassigned for each
element u,; of vector U,(0)
min
u¡”” and u¡"* ; are often set at -5 and 5, correspondingly
Because we are working with problem of text summarization, the solution vectors should be in binary representation Step 2: Binarization: Convert P real-valued vectors to P binary vectors using the formula: 1, if rand, ; < sigm(u, ;(t)) 0, otherwise up i(t) = (8) 1 1+exp i€-z) (9) sigm(z) = rand, ; is a random number within [0,1], and reassigned for each i component of the p” vector Step 3: Evaluation: Calculate fitness value of each of P individuals in the population Step 4: Mutation:
The aim of this operator is to generate mutant vectors, making the algorithm to expand the searching direction/ explore the searching space
For each target vector U,(t), we choose three random vectors U,:(t), Up2(t) and U,3(t) in which p, pl, p2 and p3 are different from each other
The mutant vector: V,(t) = U,)(t) + F x (U,2()-U,3(t)) (10)
F: mutant factor, specifying the scale of the difference (U,2(t) — U,3(t)), often in the interval [0.4, 1.0] [19]
Figure 3.1 describes the position of vector V, relative to vector Up, Up2 and U,3
Trang 32u2 AN Vy F * q Un © Ups Vv ul
Figure 3.1 Illustration of mutation operation
Step 5: Check the boundary restriction:
Components of the mutant vector are examined if they violate the boundary constraints 2ul™ — vy i(t), if vpa(t) < ur" Vy i(t) = 42 * tị “ế 7 Vp i (t), if Vp; (t) > uj” (11) Vp i(t), otherwise min max i ° U; ) Formula (11) makes sure that v,j(t) is always in the interval (u Step 6: Crossover:
This operator is used in order for offspring vectors to inherit some features of their parents, diversifying children’s properties The target vector is mixed with the mutant vector to generate a trial vector:
Trang 33Up¡(t), ƒ rand,;¡ S CR or = k "` Up i (t), otherwise _ rand, ¡ is a random number within [0,1], refreshed for each i” component of the p” parameter vector CR € [0,1]: crossover constant
k €f1, 2, ., n} randomly chosen for each p” parameter vector to make sure the population will evolve because at least one element of the trial vector prefers mutant vector than target/parent vector, otherwise no new vector is created If CR is big then there is more likelihood that the trial vector is generated from more mutant vector elements than target/parent vector element Figure 3.2 gives
us an example when rand, < CR and i=k=4 U,(t) V,(t) Z,(t) Figure 3.2 Illustration of crossover operation Step 7: Binarization:
Trang 34However, we also have to satisfy the constraint of summary length, the way these researchers manage this restriction as follows:
- Any feasible solution overweighs any infeasible solution
- Two feasible solutions will be compared based on their fitness values - Two infeasible solutions will be compared based on how much they
violate the constraint
where feasible solutions are vectors/individuals satisfying the restriction, otherwise they are infeasible solutions In this method, feasible solutions will be emphasized more than infeasible ones, moreover, we could still keep infeasible solutions with high fitness values
Step 9: Selection:
This operator is performed to keep the population size constant We will select better vector between target and trial one to survive to the next generation:
Zy(t),if f (Zp) = f (Up)
U,(t +1) =
Í U,(t), otherwise (13)
f(.) Is the fitness function
Thus, if the trial vector has a better or equal value of the fitness function, it will replace its target vector in the next generation, otherwise the target vector is maintained That is why the population will get better or keep constant but never get worse Step 10: Stopping criteria: The process of evolving will continue by going back step 2 until one of the criteria is matched: - tax 1S reached - the best fitness of the population does not change considerably over continuous iterations
- aspecified CPU time limit is reached - gaining a pre-specified fitness value
In this case, we choose the first one as the termination criteria
Trang 35Return the best vector ever found as the final solution, from that, build the summary
3.1.4 Experiment, result and discussion
We choose to implement the above algorithm for the task of multi-document summarization The program is called [DE] for short
3.1.4.1 Datasets
We used DUC2004 and DUC2007 (Document Understanding Conference)
datasets to test our methods of summarization as shown in Table 3.1
The dataset DUC2004 contains 50 document collections of 10 documents Overall, each collection has from 150 to 650 sentences Each collection is summarized by four experts, resulting in four reference summaries, each of which is about 6 sentences in length on average
The dataset DUC2007 contains 45 document collections of 25 documents Overall, each collection has from 300 to 1000 sentences Each collection is summarized by four experts, resulting in four reference summaries, each of which is no more than 250 words in length (12 sentences on average)
Original document collections are all in xml format, therefore we have to extract plain text before summarizing Properties DUC2004 DUC2007 Number of collections 50 45 Number of documents in 10 25 each collection Number of sentences in 150-650 300-1000 each collection
Experts’ summary length 6 12
(in sentences on average)
Trang 36
3.1.4.2 Evaluation measures
We use ROUGE (Recall — Oriented Understudy for Gisting Evaluation) package and take the average F-value to evaluate and compare our summaries [10, 11] There are some terms related to a summary evaluation such as Precision, Recall and F-value wo correct Precision = ——————— (14) correct +wrong correct Recall = (15) correct +missed
Where, correct = the number of text units extracted by both system and human; wrong = the number of text units extracted by system but not by human; and missed = the number of text units extracted by human but not by system
Therefore, Precision reflects the percentages of the system’s extracted sentences were good, and Recall reflects the percentages of good sentences the system missed In even simpler terms, a high recall means you have not missed anything but you may have a lot of useless results to sift through (which would imply low precision) High precision means that everything returned was a relevant result, but you might not have found all the relevant items (which would imply low recall)
F-value is assigned to be a weighted average of Precision and Recall, best at 1 and worst at 0
2x Precision x Recall
F = Precision +Recall (16)
The F-value is always a number between the values of recall and precision, and is higher when recall and precision are closer
Trang 373.1.4.3 Experimental settings Parameters DUC2004 DUC2007 Population size P 50 50 Number of generation tnax 1000 1000 mịn -5 -5 Umax 5 5 F 0.6 0.6 CR 0.7 0.7 Number of runs 20 20 Goal: number of sentences in the 6 12 summary
Table 3.2 Parameter settings of the first experiment
Table 3.2 lists all necessary parameters needed to assign values Because this is a stochastic popular-based algorithm, we run the program for 20 times (runs), then get their mean value as the final result These parameters all follow the setting of experiments in [5]
3.1.4.4 Result and discussion
Trang 38Summary length [DE] Length 92 100 90 80 70 37 60 50 25 40 —=—=Length 30 20 10 ~- 0 100 200 300 400 500 600 700 800 900 1000 Gen
Original text: 212 sentences - Time:135 mins
Figure 3.3 Changes in summary length in [DE] method on DUC2004
Trang 39Table 3.3 presents summary lengths of some randomly chosen document collections in DUC2004 As we can see, all of the summary lengths do not satisfy the goal of a summary of 6 sentences at last
Length 230 Summary length [DE] 250 225 139 200 119 175 150 125 to —=—=Length 75 50 25 0 0 100 200 300 400 500 600 700 800 900 1000 Gen
Original text: 507 sentences - Time: 204 mins
Trang 40Document collections Original length Summary length D0704 255 39 D0705 330 58 D0706 462 103 D0711 507 119 Table 3.4 Summary lengths of some document collections in DUC2007 using [DE] method
Table 3.4 dipicts summary lengths of some randomly chosen document collections in DUC2007 to confirm that the summary is not shorten sufficiently because the objective is 12-sentence summaries
The next thing need to be cared is the summary quality The following Table 3.5 lists three F-values corresponding to three ROUGE measures: ROUGE-1, ROUGE-2 and ROUGE-L on DUC2004 and DUC2007 Measures DUC2004 DUC2007 ROUGE -1 0.204 0.138 ROUGE -2 0.051 0.057 ROUGE —-L 0.157 0.120 Table 3.5 F-Values of three evaluation measures of method [DE] on DUC2004 and DUC2007 3.2 Improvement 3.2.1 Method