We choose to implement the above algorithm for the task of multi-document summarization. The program is called [DE] for short.
3. 1. 4. 1. Datasets
We used DUC2004 and DUC2007 (Document Understanding Conference) datasets to test our methods of summarization as shown in Table 3.1
The dataset DUC2004 contains 50 document collections of 10 documents.
Overall, each collection has from 150 to 650 sentences. Each collection is summarized by four experts, resulting in four reference summaries, each of which is about 6 sentences in length on average .
The dataset DUC2007 contains 45 document collections of 25 documents.
Overall, each collection has from 300 to 1000 sentences. Each collection is summarized by four experts, resulting in four reference summaries, each of which is no more than 250 words in length (12 sentences on average).
Original document collections are all in xml format, therefore we have to extract plain text before summarizing.
Properties DUC2004 DUC2007
Number of collections 50 45
Number of documents in
each collection 10 25
Number of sentences in
each collection 150-650 300-1000
Experts’ summary length
(in sentences on average) 6 12
Table 3.1. Description of the datasets used in the experiment
(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION
3. 1. 4. 2. Evaluation measures
We use ROUGE (Recall – Oriented Understudy for Gisting Evaluation) package and take the average F-value to evaluate and compare our summaries [10, 11].
There are some terms related to a summary evaluation such as Precision, Recall and F-value.
Precision = correct
correct +wrong (14)
Recall = correct
correct +missed (15)
Where, correct = the number of text units extracted by both system and human;
wrong = the number of text units extracted by system but not by human; and missed = the number of text units extracted by human but not by system.
Therefore, Precision reflects the percentages of the system’s extracted sentences were good, and Recall reflects the percentages of good sentences the system missed. In even simpler terms, a high recall means you have not missed anything but you may have a lot of useless results to sift through (which would imply low precision). High precision means that everything returned was a relevant result, but you might not have found all the relevant items (which would imply low recall).
F-value is assigned to be a weighted average of Precision and Recall, best at 1 and worst at 0.
F = 2 x Precision x Recall
Precision +Recall (16)
The F-value is always a number between the values of recall and precision, and is higher when recall and precision are closer.
In this case, we use three types of ROUGE measures: ROUGE-N where N is the length of the n-gram (ROUGE-1: unigram/ one word and ROUGE-2: bigram/
two words) and ROUGE-L (Longest common subsequence); taking F-value from ROUGE output to compare among summaries.
(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION
3. 1. 4. 3. Experimental settings
Parameters DUC2004 DUC2007
Population size P 50 50
Number of generation t max 1000 1000
u min -5 -5
u max 5 5
F 0.6 0.6
CR 0.7 0.7
Number of runs 20 20
Goal: number of sentences in the
summary 6 12
Table 3.2. Parameter settings of the first experiment
Table 3.2 lists all necessary parameters needed to assign values. Because this is a stochastic popular-based algorithm, we run the program for 20 times (runs), then get their mean value as the final result. These parameters all follow the setting of experiments in [5].
3. 1. 4. 4. Result and discussion
After summarizing all collections, having ROUGE output, we take the average of their F-values as well as the summary length during generations. We choose a typical document collection that contains 212 sentences in DUC2004 and a 507- sentence collection in DUC2007 to show changes in its summary length during the process and the time for the algorithm to summarize.
(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION
Figure 3.3. Changes in summary length in [DE] method on DUC2004
Figure 3.3 indicates changes in the summary length during 1000 generations on DUC2004. It is clear that the algorithm needs 135 minutes to compress a collection of 212 sentences to a summary of 25 sentences over 1000 generations.
Moreover, the length decreases more and more slowly, in particular, 92 sentences at generation 0 to 37 sentences at generation 500, but the resulting length at generation 1000 is considerably great - 25 sentences.
Document collections Original length Summary length
d30001t 212 25
d30006t 408 74
d30011t 250 34
d30033t 642 131
Table 3.3. Summary lengths of some document collections in DUC2004 using [DE] method
(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION
Table 3.3 presents summary lengths of some randomly chosen document collections in DUC2004. As we can see, all of the summary lengths do not satisfy the goal of a summary of 6 sentences at last.
Figure 3.4. Changes in summary length in [DE] method on DUC2007 Figure 3.4 inllustrates the running process of differential evolution algorithm on DUC2007. It takes 204 minutes to finish 1000 generations and the length decreases from 230 sentences at generation 0 to 119 sentences at last. It means the algorithm compresses the document collection of 507 sentences to a summary of 119 sentences over 1000 iterations. One more point is that the length decreases more slowly at the end than the beginning of the run. In particular, a summary of 230 sentences reduces to a 139-sentence summary over the first 500 generations while a summary of 139 sentences decreases to 119 sentences over the next 500 generations. Apparently, this method is not effective in reducing summary length.
(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION
Document collections Original length Summary length
D0704 255 39
D0705 330 58
D0706 462 103
D0711 507 119
Table 3.4. Summary lengths of some document collections in DUC2007 using [DE] method
Table 3.4 dipicts summary lengths of some randomly chosen document collections in DUC2007 to confirm that the summary is not shorten sufficiently because the objective is 12-sentence summaries.
The next thing need to be cared is the summary quality. The following Table 3.5 lists three F-values corresponding to three ROUGE measures: ROUGE-1, ROUGE-2 and ROUGE-L on DUC2004 and DUC2007
Measures DUC2004 DUC2007
ROUGE -1 0.204 0.138
ROUGE -2 0.051 0.057
ROUGE –L 0.157 0.120
Table 3.5. F-Values of three evaluation measures of method [DE] on DUC2004 and DUC2007