Experiment, result and discussion

We choose to implement the above algorithm for the task of multi-document summarization. The program is called [DE] for short.

3. 1. 4. 1. Datasets

We used DUC2004 and DUC2007 (Document Understanding Conference) datasets to test our methods of summarization as shown in Table 3.1

The dataset DUC2004 contains 50 document collections of 10 documents.

Overall, each collection has from 150 to 650 sentences. Each collection is summarized by four experts, resulting in four reference summaries, each of which is about 6 sentences in length on average .

The dataset DUC2007 contains 45 document collections of 25 documents.

Overall, each collection has from 300 to 1000 sentences. Each collection is summarized by four experts, resulting in four reference summaries, each of which is no more than 250 words in length (12 sentences on average).

Original document collections are all in xml format, therefore we have to extract plain text before summarizing.

Properties DUC2004 DUC2007

Number of collections 50 45

Number of documents in

each collection 10 25

Number of sentences in

each collection 150-650 300-1000

Experts’ summary length

(in sentences on average) 6 12

Table 3.1. Description of the datasets used in the experiment

(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION(LUAN.van.THAC.si).RESEARCH.AND.APPLY.EVOLUTIONARY.COMPUTATION.TECHNIQUES.ON.AUTOMATIC.TEXT.SUMMARIZATION

3. 1. 4. 2. Evaluation measures

We use ROUGE (Recall – Oriented Understudy for Gisting Evaluation) package and take the average F-value to evaluate and compare our summaries [10, 11].

There are some terms related to a summary evaluation such as Precision, Recall and F-value.

Precision = correct

correct +wrong (14)

Recall = correct

correct +missed (15)

Where, correct = the number of text units extracted by both system and human;

wrong = the number of text units extracted by system but not by human; and missed = the number of text units extracted by human but not by system.

Therefore, Precision reflects the percentages of the system’s extracted sentences were good, and Recall reflects the percentages of good sentences the system missed. In even simpler terms, a high recall means you have not missed anything but you may have a lot of useless results to sift through (which would imply low precision). High precision means that everything returned was a relevant result, but you might not have found all the relevant items (which would imply low recall).

F-value is assigned to be a weighted average of Precision and Recall, best at 1 and worst at 0.

F = 2 x Precision x Recall

Precision +Recall (16)

The F-value is always a number between the values of recall and precision, and is higher when recall and precision are closer.

In this case, we use three types of ROUGE measures: ROUGE-N where N is the length of the n-gram (ROUGE-1: unigram/ one word and ROUGE-2: bigram/

two words) and ROUGE-L (Longest common subsequence); taking F-value from ROUGE output to compare among summaries.

3. 1. 4. 3. Experimental settings

Parameters DUC2004 DUC2007

Population size P 50 50

Number of generation t max 1000 1000

u min -5 -5

u max 5 5

F 0.6 0.6

CR 0.7 0.7

Number of runs 20 20

Goal: number of sentences in the

summary 6 12

Table 3.2. Parameter settings of the first experiment

Table 3.2 lists all necessary parameters needed to assign values. Because this is a stochastic popular-based algorithm, we run the program for 20 times (runs), then get their mean value as the final result. These parameters all follow the setting of experiments in [5].

3. 1. 4. 4. Result and discussion

After summarizing all collections, having ROUGE output, we take the average of their F-values as well as the summary length during generations. We choose a typical document collection that contains 212 sentences in DUC2004 and a 507- sentence collection in DUC2007 to show changes in its summary length during the process and the time for the algorithm to summarize.

Figure 3.3. Changes in summary length in [DE] method on DUC2004

Figure 3.3 indicates changes in the summary length during 1000 generations on DUC2004. It is clear that the algorithm needs 135 minutes to compress a collection of 212 sentences to a summary of 25 sentences over 1000 generations.

Moreover, the length decreases more and more slowly, in particular, 92 sentences at generation 0 to 37 sentences at generation 500, but the resulting length at generation 1000 is considerably great - 25 sentences.

Document collections Original length Summary length

d30001t 212 25

d30006t 408 74

d30011t 250 34

d30033t 642 131

Table 3.3. Summary lengths of some document collections in DUC2004 using [DE] method

Table 3.3 presents summary lengths of some randomly chosen document collections in DUC2004. As we can see, all of the summary lengths do not satisfy the goal of a summary of 6 sentences at last.

Figure 3.4. Changes in summary length in [DE] method on DUC2007 Figure 3.4 inllustrates the running process of differential evolution algorithm on DUC2007. It takes 204 minutes to finish 1000 generations and the length decreases from 230 sentences at generation 0 to 119 sentences at last. It means the algorithm compresses the document collection of 507 sentences to a summary of 119 sentences over 1000 iterations. One more point is that the length decreases more slowly at the end than the beginning of the run. In particular, a summary of 230 sentences reduces to a 139-sentence summary over the first 500 generations while a summary of 139 sentences decreases to 119 sentences over the next 500 generations. Apparently, this method is not effective in reducing summary length.

Document collections Original length Summary length

D0704 255 39

D0705 330 58

D0706 462 103

D0711 507 119

Table 3.4. Summary lengths of some document collections in DUC2007 using [DE] method

Table 3.4 dipicts summary lengths of some randomly chosen document collections in DUC2007 to confirm that the summary is not shorten sufficiently because the objective is 12-sentence summaries.

The next thing need to be cared is the summary quality. The following Table 3.5 lists three F-values corresponding to three ROUGE measures: ROUGE-1, ROUGE-2 and ROUGE-L on DUC2004 and DUC2007

Measures DUC2004 DUC2007

ROUGE -1 0.204 0.138

ROUGE -2 0.051 0.057

ROUGE –L 0.157 0.120

Table 3.5. F-Values of three evaluation measures of method [DE] on DUC2004 and DUC2007

Methodologies for automatic text summarization

Main steps of differential evolution