1. Trang chủ
  2. » Luận Văn - Báo Cáo

(LUẬN văn THẠC sĩ) RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION

96 5 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 96
Dung lượng 1,08 MB

Cấu trúc

  • 1. Chapter 1 (14)
  • 2. Chapter 2 (18)
    • 2.1.1. Definition (18)
    • 2.1.2. Types of text summarization (20)
    • 2.1.3. Methodologies for automatic text summarization (26)
  • 3. Chapter 3 (50)
    • 3.1.1. Document collection representation (50)
    • 3.1.3. Main steps of differential evolution (56)
    • 3.1.4. Experiment, result and discussion (66)
    • 3.2.1. Method (76)
    • 3.2.2. Experiment, result and discussion (80)
  • 4. Chapter 4 (90)
  • 5. Reference (92)

Nội dung

Chapter 1

Automatic text summarization involves identifying and condensing key information from one or more documents, posing significant challenges across various scientific fields, including artificial intelligence, statistics, and linguistics Since the 1950s, extensive research has led to the development of systems like SUMMARIST, SweSUM, MEAD, and SUMMON Despite these advancements, the field remains complex and continues to garner increasing interest.

This thesis explores various evolutionary computation techniques and applies the differential evolution algorithm to the practical challenge of automatic text summarization, specifically focusing on multi-document summarization Additionally, it addresses the constraint of summary length, which has been inadequately managed in existing stochastic population-based methods.

Evolutionary computation techniques employ various algorithms to enhance a population of individuals across multiple generations By utilizing operations like mutation, crossover, and selection, these populations generate new offspring that compete for survival against each other and their predecessors, guided by an evaluation function This iterative process concludes when a predetermined stopping criterion is met, revealing the optimal individual—the best solution to a real-world problem.

Evolutionary algorithms have been utilized across diverse fields, including automatic text summarization However, they exhibit limitations in managing summary length compared to traditional sentence ranking methods This research aims to enhance the capability of evolutionary algorithms in addressing this specific challenge.

This article explores the application of evolutionary computation techniques in the field of automatic text summarization It emphasizes the significance of these techniques in enhancing the efficiency and accuracy of summarization processes By leveraging evolutionary algorithms, the research aims to improve the extraction of key information from large text datasets, ultimately facilitating better comprehension and information retrieval The findings suggest that integrating evolutionary computation can lead to more effective summarization methods, catering to the growing demand for concise and relevant content in various domains.

This thesis explores evolutionary computation techniques, focusing on the differential evolution algorithm and its application in automatic text summarization It identifies limitations in existing methods for managing summary length and introduces a novel approach that effectively addresses length constraints while meeting user demands and maintaining summary quality.

This thesis is structured into several chapters, beginning with Chapter 2, which explores the foundational concepts of text summarization and its various classifications, while also introducing key principles of evolutionary computation, with a particular focus on the differential evolution algorithm.

Chapter 3 provides a detailed explanation of the algorithm's application in automatic text summarization, specifically focusing on multi-document collections Additionally, an experiment is conducted to evaluate the performance of the original differential evolution algorithm.

Besides, we improve the result of the previous experiment, dealing with the summary length so that the document collection is compressed quickly and effectively

Chapter 4 will recapitulate the thesis, present our contributions and state some future research directions in this field

This article explores the application of evolutionary computation techniques in automatic text summarization It emphasizes the importance of these advanced computational methods in enhancing the efficiency and accuracy of summarization processes By integrating evolutionary algorithms, the research aims to improve the extraction of key information from large text datasets, ultimately contributing to more effective and coherent summaries The findings highlight the potential of evolutionary computation in transforming how automated systems process and condense information, making it a valuable tool for various applications in data analysis and content management.

Chapter 2

Definition

Automatic text summarization is the generation of a shorter version of a text by a computer program but still keep the most important points of the original text

Automated text summarization seeks to extract and condense the most important content from a source text, tailoring the results to meet the specific needs of users or applications.

A summarization system follows several key steps to generate a summary from a document or a collection of documents Initially, the document undergoes preprocessing, which includes handling punctuation, adjusting case sensitivity, and splitting it into paragraphs, sentences, and words Next, the document is represented as vectors, with each vector corresponding to a sentence The critical phase involves creating a summary representation by selecting specific vectors to include in the summary Finally, the summary is formed from this representation during the summary generation stage.

Figure 2.1 represents a typical summarization system

This article explores the application of evolutionary computation techniques in automatic text summarization It highlights the significance of these advanced algorithms in enhancing the efficiency and accuracy of summarizing large volumes of text By integrating evolutionary computation methods, the research aims to improve the extraction of key information and generate coherent summaries The findings underscore the potential of these techniques to revolutionize how we process and understand textual data in various domains.

Types of text summarization

There are some ways to classify approaches to automatic text summarization as follows: [16]

An extract-type summary consists of selected units, from single words to entire paragraphs, taken verbatim from the original text As illustrated in Figure 2.2, a summarization system identifies and selects key sentences to create an effective extractive summary.

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization It highlights the significance of using these advanced computational methods to enhance the efficiency and accuracy of summarizing large volumes of text By leveraging evolutionary algorithms, the study aims to improve the extraction of key information while maintaining the coherence and relevance of the summarized content The findings contribute to the ongoing development of intelligent systems that facilitate better information retrieval and processing in various domains.

Figure 2.2 A summarizer highlights all sentences included in an extractive summary

An abstract serves as a concise summary that encapsulates the main content and reviews of a source text, necessitating the summarizer's prior understanding of the topic Figure 2.3 illustrates an example of an abstract that effectively summarizes the entire paper's content.

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization By leveraging these advanced algorithms, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The findings suggest that evolutionary computation can significantly improve the extraction of key information, making it a valuable tool for various applications in natural language processing Overall, this research contributes to the growing body of knowledge on automated summarization methods and their practical implications.

Figure 2.3 An example of the abstract summary

 Generic: A generic summary provides the author’s point of views of the source text, paying the same attention to every aspect of the text

 Query-oriented: A query-oriented (or user-oriented) summary prefers some particular aspects of the text, depending on aspects that a user desires to learn about

An indicative summary serves to highlight the main subject or domain of a text without revealing its specific content By reviewing an indicative summary, readers can grasp the general topic of the input material, although they may not fully understand the details contained within it.

 Informative: An informative summary covers (some of) the content, and allows one to describe (parts of) what was in the input text

The research focuses on the application of evolutionary computation techniques for automatic text summarization By leveraging advanced algorithms, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text This approach not only streamlines information processing but also improves the accessibility of essential content The integration of evolutionary computation in text summarization represents a significant advancement in the field, promising more effective solutions for information retrieval and management.

 Background: Assumes readers do not have prior knowledge about the source text topic

 Just-the-news: Supposes reader’s prior knowledge is up-to-date

- Monolingual vs cross-lingual: Just summarizes in the same language vs summarizes as well as translates into another language

When comparing single-document and multi-document summarization, the former focuses on condensing a single source text, while the latter integrates information from multiple sources into a cohesive summary As illustrated in Figure 2.4, a multi-document summarizer can effectively consolidate five different documents into a single, comprehensive summary.

This thesis focuses on creating extractive summaries for collections of multiple documents, addressing the increased complexity compared to summarizing a single text Key challenges include eliminating repetitions and managing inconsistencies across documents while ensuring that all vital information from the original texts is effectively captured.

Methodologies for automatic text summarization

Up to now, there have been many methods applied to summarize text automatically including [21]:

- Traditional methods: term, word, phrase frequencies

- Corpus-based approaches: combination of statistical features, learning to extract

- Discourse structures: Word-net, Rhetorical analysis

- Knowledge rich approaches: different for particular domains Evolutionary computation is a new approach to summarize text automatically, in which solutions are evolved until a certain benchmark is satisfied

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization By leveraging these advanced computational methods, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The integration of evolutionary algorithms offers innovative solutions to improve the extraction and condensation of information, making it easier for users to grasp essential content quickly Overall, the findings contribute to the ongoing development of automated systems that facilitate effective information retrieval and comprehension.

Evolutionary computation, a subfield of artificial intelligence in computer science, encompasses various evolutionary algorithms rooted in Darwinian principles These algorithms function as trial and error problem solvers and are classified as global optimization methods with meta-heuristic or stochastic optimization characteristics, utilizing a population of candidate solutions.

Evolutionary computation uses continuous progression of the population, which is then selected in a guided random search to get the required stop

Automated problem solving that uses Darwinian principles started in the 1950s However, three different interpretations of this idea started to be implemented in 1960s in three strands

Evolutionary programming (EP), developed by Lawrence J Fogel in the US, along with John Henry's genetic algorithm (GA) and the evolution strategies (ES) introduced by Ingo Rechenberg and Hans-Paul Schwefel, represent foundational concepts in the field of evolutionary computation Despite their early introduction, these algorithms are recognized as distinct variations of a single technological framework that gained prominence in the early 1990s.

Natural evolution is a process where plants and animals that adapt to changing environments thrive, while those unable to do so are eliminated through natural selection In any given population, parents produce offspring through mutation and crossover, leading to diverse traits among the young These offspring must compete against each other, including their parents, for survival in future generations Ultimately, mutation and crossover enhance the diversity of traits, while natural selection improves the overall fitness and quality of the population.

[2] Table 2.1 below shows us equivalent concepts between natural evolution and problem solving [3]

This article explores the research and application of evolutionary computation techniques for automatic text summarization By leveraging advanced algorithms, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The findings highlight the potential of evolutionary methods to improve content relevance and coherence in generated summaries, making them valuable tools for information processing in various fields.

Table 2.1 The basic evolutionary computation linking natural evolution to problem solving

Figure 2.5 and Figure 2.6 illustrates typical pseudo-code and scheme of evolutionary algorithms [3]

Figure 2.5 The general scheme of an Evolutionary Algorithm in pseudo-code

This article explores the research and application of evolutionary computation techniques in automatic text summarization It highlights the significance of these techniques in enhancing the efficiency and accuracy of summarization processes By leveraging evolutionary algorithms, the study aims to improve the extraction of essential information from large text datasets, ultimately facilitating better comprehension and information retrieval The findings suggest that integrating evolutionary computation can lead to more effective summarization methods, addressing the growing need for automated solutions in handling vast amounts of textual data.

Figure 2.6 General scheme of evolutionary algorithms

Evolutionary computation consists of some algorithms which are used to search for optimal solutions to a problem

Figure 2.6 depicts the evolution of a typical population throughout the algorithm's process An evolutionary algorithm begins by initializing a population of individuals, each evaluated using a fitness function tailored to the specific algorithm and problem Selected individuals serve as parents, undergoing reproduction to generate offspring, whose fitness values are subsequently assessed The best individuals, whether parents or offspring, are chosen to advance to the next generation This iterative process continues until the optimal individual is identified, adhering to predetermined stopping criteria.

According to A.E.Eiben and J.E.Smith, the typical progression of fitness in a run is in the following Figure 2.7 [3]:

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization By leveraging these advanced computational methods, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The integration of evolutionary algorithms provides innovative solutions to improve the extraction and generation of concise summaries, making information more accessible and manageable for users.

Figure 2.7 Correlation between number of generations and best fitness in population

Evolution algorithms differ in their mechanisms for generating offspring and selecting parents and children This article focuses on a specific application of these algorithms in the realm of automatic text summarization, particularly through extractive multi-document summarization techniques.

There are some typical evolutionary algorithms such as: differential evolution , genetic algorithm, genetic programming, evolutionary programming, etc In this research, we focus on the first mentioned one

Differential Evolution (DE) originated when Ken Price sought to address the Chebychev Polynomial fitting Problem posed by Rainer Storn A significant advancement occurred when Ken introduced the concept of utilizing vector differences to perturb the vector population This collaboration between Ken and Rainer, along with extensive computer simulations, led to numerous enhancements, establishing DE as a versatile and robust optimization tool in contemporary applications.

Since its inception in the early years of 1994 to 1996, the DE community has seen significant growth, with an increasing number of researchers engaging with DE Ken and Rainer hope for continued advancements in DE through global scientific collaboration, aiming to enhance its utility for users in their daily tasks This vision is also why DE remains unpatented, fostering open development and innovation.

This article explores the application of evolutionary computation techniques in automatic text summarization It highlights the significance of these advanced methods in enhancing the efficiency and accuracy of summarizing large volumes of text By leveraging evolutionary algorithms, researchers aim to improve the extraction of key information, ultimately facilitating better comprehension and quicker access to essential content The study underscores the potential of combining artificial intelligence with natural language processing to revolutionize the way we condense and interpret textual data.

Figure 2.8 Steps of differential evolution algorithm

This algorithm begins by initializing a population of individuals represented as float-valued vectors within a defined range These target vectors are then binarized and assessed using a fitness or objective function The core concept involves generating new individuals through processes such as mutation, which relies on the differences between randomly selected pairs, and crossover, which exchanges elements between pairs to enhance offspring diversity Additionally, a selection process is employed to retain the best individuals from both parents and offspring for the next generation This iterative process continues until either the maximum number of generations is reached or a specified fitness threshold is met Ultimately, the algorithm yields the optimal individual represented by a float-valued or binary vector of n dimensions, where n corresponds to the number of sentences in a document collection for text summarization.

This article explores the research and application of evolutionary computation techniques for automatic text summarization It emphasizes the significance of these methods in enhancing the efficiency and accuracy of summarizing large text data The study highlights the differential evolution algorithm, illustrating its main steps in Figure 2.8, which serves as a foundational approach in this field Additionally, it establishes the importance of assigning values to parameters, where P[i] = 0 for i={1, 2, …, n}, to optimize the summarization process.

Pseudo-code of this algorithm is given below [15]:

Generate randomly an initial population of solutions Calculate the fitness of the initial population

Select three different solutions at random

Create one offspring using DE operators (mutation, crossover)

If offspring is the same or better than its parent (selection)

Parent is replaced End For

While the stopping condition is not satisfied End

The following numerical example is given to demonstrate the DE algorithm We have the objective/fitness function:

Our goal is to find x 1 , x 2 , x 3 We will follow steps in pseudo-code above to solve this problem

Generate randomly an initial population of solutions:

Each individual or solution is represented as a three-dimensional vector, denoted as Xp = [xp.1, xp.2, xp.3], where values for x1, x2, and x3 are defined We initialize P individuals within the specified bounds of the interval [xmin, xmax], with xmin set to 0.

Chapter 3

Document collection representation

We have a document collection D={d 1 , d 2 , …, d |D| } in which |D| is the number of documents in the collection or D can be represented as a set of sentences in the collection

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization By leveraging advanced algorithms, the study aims to enhance the efficiency and accuracy of summarizing large text datasets The integration of evolutionary computation methods offers innovative solutions for generating concise and coherent summaries, making it a significant contribution to natural language processing Overall, this research highlights the potential of evolutionary techniques to improve automated summarization processes.

D = {s 1 , s 2 , …, s n } where n is the number of sentences in collection D

In the context of our analysis, let T = {t1, t2, …, tm} represent a set of distinct terms within the collection D, where m indicates the total number of unique terms Each sentence i can be expressed as si = {wi1, wi2, …, wim}, with wik denoting the weight of term tk in sentence si, calculated as wik = fik × log(n).

𝑛 𝑘 ) (1) f ik is the number of term t k in sentence s i n k is the number of sentences t k appears in n is the total number of sentences in document collection D

In a document collection, when a term \( t_k \) frequently appears in a specific sentence \( s_i \) but is infrequently found throughout the entire collection, the weight \( w_{ik} \) assigned to that term will be significantly high.

- w ik is higher if t k appears many times within a few number of sentences

- w ik is lower if t k appears fewer times in a sentence or occurs in many sentences

- w ik is lowest if the term occurs in almost all sentences

Our goal is the output S  D to be a set of sentences forming a summary

Sentence extractive summarization algorithms rely on two key metrics: relevance, ensuring selected sentences are essential, and non-redundancy, which eliminates overlapping content These criteria are typically assessed independently to identify optimal candidates that effectively balance relevance and redundancy Huang et al (2010) emphasized the importance of these measurements in summarization processes.

This article explores the application of evolutionary computation techniques in automatic text summarization While these methods can enhance summarization processes, they do not guarantee the generation of optimal summaries, particularly if they produce excessive duplicate sentences The approach focuses on three key aspects of effective summarization, ensuring a more coherent and meaningful output.

- Content coverage: summary should contains significant sentences covering the main content of the documents

- Diversity: sentences carrying the same content should not be all in the summary

- Length: summary’s length should be restricted

Optimizing these three properties represents a global summarization challenge, where the inclusion of sentences in the summary is influenced not only by their individual characteristics but also by the attributes of all other sentences within the summary.

The summarization challenge is defined as finding a vector U that maximizes the function f(U) = f cover(U) * f diver(U) This involves enhancing f cover(U), which represents the content coverage of the summary in relation to the original collection, while simultaneously minimizing f diver(U), which measures redundancy within the summary.

The fitness function is equivalent to maximize f(U) = sim (O,O

O and O S are mean vectors of collection D and summary S, respectively k th coordinate O k of the mean vector O is

O k = 1 n n i=1 w ik , k = 1, …, m (5) k th coordinate O of the mean vector O is

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization It highlights the effectiveness of these techniques in generating concise and meaningful summaries from large volumes of text The study emphasizes the potential of evolutionary algorithms to enhance the quality and relevance of automated summaries, making information more accessible and manageable for users By integrating advanced computational methods, this research contributes to the ongoing development of intelligent summarization tools.

O k S = 1 n S s i ∈S w ik , k = 1, …, m (6) n S is the number of sentences in the summary S l i is the length (in terms of words) of sentence s i u i = 1 means sentence i is chosen to be included in the summary, otherwise u i = 0

It means the problem now has a constraint of summary length, which must be less than or equal to a specified L

Radev et al (2004) highlighted that the center of a document collection, denoted as O S, reflects its core content The similarity measure sim(O, O S) evaluates the significance of the summary, while the expression n i=1 sim O, s i u i determines the relevance of each individual sentence within the summary Additionally, the denominator in formula (3) accounts for the cumulative similarity between each pair of sentences s i (where i ranges from 1 to n-1) and s j (where j spans from i+1 to n).

Main steps of differential evolution

This section explains step by step the operation of the differential evolution algorithm in solving the problem of automatic text summarization, in particular, solving (1)

Generate randomly a population of P individuals:

Each individual is a real-valued vector:

U p (t) = [u p,1 (t), …, u p,n (t)] p=1, 2, …, P (population size) n is the number of sentences in document collection t is the generation number

However, at first at generation 0, each elements u p,i (0) of individual U p (0) is initialized as: u p,i (0) = u i min + u i max − u i min rand p,i (7)

The research focuses on the application of evolutionary computation techniques for automatic text summarization It involves utilizing random values, denoted as rand p,i, which are generated within the range of [0,1] and reassigned to each element of the vector U p (0) The parameters u i min and u i max are typically set to -5 and 5, respectively, to ensure effective optimization in the summarization process.

Because we are working with problem of text summarization, the solution vectors should be in binary representation

Convert P real-valued vectors to P binary vectors using the formula: u p,i t = 1, if rand p,i < 𝑠𝑖𝑔𝑚(u p,i (t))

1+exp ⁡ (−z) (9) rand p,i is a random number within [0,1], and reassigned for each i th component of the p th vector

Calculate fitness value of each of P individuals in the population

The aim of this operator is to generate mutant vectors, making the algorithm to expand the searching direction/ explore the searching space

For each target vector U p (t), we choose three random vectors U p1 (t), U p2 (t) and

U p3 (t) in which p, p1, p2 and p3 are different from each other

F: mutant factor, specifying the scale of the difference (U p2 (t) – U p3 (t)), often in the interval [0.4, 1.0] [19]

Figure 3.1 describes the position of vector V p relative to vector U p1 , U p2 and U p3

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization It highlights the significance of these advanced algorithms in enhancing the efficiency and accuracy of summarizing large volumes of text By leveraging evolutionary computation, the study aims to improve the extraction of essential information while maintaining the coherence and relevance of the summarized content The findings suggest that integrating these techniques can lead to more effective summarization solutions, benefiting various applications in information retrieval and content management.

Figure 3.1 Illustration of mutation operation

Step 5: Check the boundary restriction:

Components of the mutant vector are examined if they violate the boundary constraints

Formula (11) makes sure that v p,i (t) is always in the interval (𝑢 𝑖 𝑚𝑖𝑛 , 𝑢 𝑖 𝑚𝑎𝑥 )

This operator enables offspring vectors to inherit characteristics from their parents, enhancing the diversity of their attributes By combining the target vector with the mutant vector, a trial vector is produced.

This article explores the application of evolutionary computation techniques in the field of automatic text summarization It emphasizes the importance of leveraging these advanced algorithms to enhance the efficiency and accuracy of summarizing large volumes of text The research focuses on optimizing summarization processes by utilizing a specific formula that considers random variables and constraints By integrating these computational methods, the study aims to improve the generation of concise and meaningful summaries, ultimately contributing to the advancement of natural language processing technologies.

𝑢 𝑝,𝑖 𝑡 , 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒 (12) rand p,i is a random number within [0,1], refreshed for each i th component of the p th parameter vector

The crossover constant CR, which ranges from 0 to 1, is a crucial factor in the evolutionary process, influencing the selection of parameter vectors For each parameter vector, a random integer k is chosen from the set {1, 2, …, n} to ensure that the population evolves This mechanism guarantees that at least one element of the trial vector favors the mutant vector over the target or parent vector; if this condition is not met, no new vector is generated.

In scenarios where the crossover rate (CR) is high, there is an increased probability that the trial vector is derived from a greater number of mutant vector elements rather than solely from the target or parent vector For instance, as illustrated in Figure 3.2, when the random parameter p is set to 2 and both i and k equal 4, this phenomenon can be observed clearly.

Figure 3.2 Illustration of crossover operation

Convert real-valued trial vectors to binary trial vectors (the same as step 2)

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization By leveraging these advanced methodologies, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The integration of evolutionary algorithms offers innovative solutions for extracting key information, thereby streamlining the summarization process Ultimately, this research contributes to the ongoing development of intelligent systems capable of processing and condensing textual data effectively.

However, we also have to satisfy the constraint of summary length, the way these researchers manage this restriction as follows:

- Any feasible solution overweighs any infeasible solution

- Two feasible solutions will be compared based on their fitness values

This article compares two infeasible solutions by assessing the extent of their constraint violations Feasible solutions, which satisfy all restrictions, are prioritized over infeasible ones However, the approach allows for the retention of infeasible solutions that possess high fitness values, ensuring that valuable alternatives are not discarded.

This operator is performed to keep the population size constant We will select better vector between target and trial one to survive to the next generation:

In each generation, if the trial vector demonstrates an equal or improved fitness function value compared to the target vector, it will replace the target vector; otherwise, the target vector will remain unchanged This process ensures that the population either improves or maintains its quality, never declining.

The process of evolving will continue by going back step 2 until one of the criteria is matched:

- the best fitness of the population does not change considerably over continuous iterations

- a specified CPU time limit is reached

- gaining a pre-specified fitness value

In this case, we choose the first one as the termination criteria

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization It emphasizes the effectiveness of these techniques in enhancing the summarization process, leading to more concise and relevant outputs By leveraging evolutionary algorithms, the study aims to improve the quality and efficiency of automatic summarization systems, making them more adept at processing and condensing large volumes of text The findings highlight the potential of evolutionary computation to transform traditional summarization methods, paving the way for advanced applications in information retrieval and natural language processing.

Return the best vector ever found as the final solution, from that, build the summary.

Experiment, result and discussion

We choose to implement the above algorithm for the task of multi-document summarization The program is called [DE] for short

We used DUC2004 and DUC2007 (Document Understanding Conference) datasets to test our methods of summarization as shown in Table 3.1

The DUC2004 dataset comprises 50 collections, each containing 10 documents, with a total of 150 to 650 sentences per collection Each collection features summaries created by four experts, yielding four reference summaries that average around six sentences in length.

The DUC2007 dataset comprises 45 document collections, each containing 25 documents with a total of 300 to 1000 sentences per collection Each collection features four expert-generated reference summaries, with each summary limited to 250 words, averaging around 12 sentences.

Original document collections are all in xml format, therefore we have to extract plain text before summarizing

Number of documents in each collection 10 25

Number of sentences in each collection 150-650 300-1000

Table 3.1 Description of the datasets used in the experiment

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization By leveraging advanced algorithms, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The integration of these computational methods not only improves the quality of summaries but also addresses the challenges associated with traditional summarization techniques Overall, the findings contribute to the ongoing development of intelligent systems capable of processing and condensing information effectively.

We use ROUGE (Recall – Oriented Understudy for Gisting Evaluation) package and take the average F-value to evaluate and compare our summaries [10, 11]

There are some terms related to a summary evaluation such as Precision, Recall and F-value

In evaluating the performance of both system and human text extraction, "correct" refers to the number of text units accurately identified by both parties Conversely, "wrong" denotes the text units extracted by the system that were not recognized by humans Lastly, "missed" indicates the text units identified by humans but overlooked by the system.

Precision measures the percentage of relevant sentences extracted by the system, while Recall indicates the percentage of relevant sentences that were missed In simpler terms, high Recall suggests that no relevant sentences were overlooked, but it may result in many irrelevant outcomes, indicating low Precision Conversely, high Precision means that all returned results are relevant, but some pertinent items may not have been identified, leading to low Recall.

F-value is assigned to be a weighted average of Precision and Recall, best at 1 and worst at 0

The F-value is always a number between the values of recall and precision, and is higher when recall and precision are closer

This article discusses the application of three ROUGE measures for evaluating summaries: ROUGE-N, which includes ROUGE-1 (unigrams) and ROUGE-2 (bigrams), and ROUGE-L, which focuses on the longest common subsequence The F-value derived from the ROUGE output is utilized to facilitate comparisons between different summaries.

The research focuses on the application of evolutionary computation techniques to enhance automatic text summarization By leveraging these advanced algorithms, the study aims to improve the efficiency and accuracy of summarizing large volumes of text This innovative approach seeks to transform how information is condensed, making it more accessible and digestible for users Ultimately, the research contributes to the field of natural language processing by providing effective solutions for automatic summarization challenges.

Number of generation t max 1000 1000 u min -5 -5 u max 5 5

Goal: number of sentences in the summary 6 12

Table 3.2 Parameter settings of the first experiment

Table 3.2 outlines the essential parameters required for value assignment As this method utilizes a stochastic popularity-based algorithm, we execute the program 20 times to obtain an average value as the final result These parameters are consistent with the experimental setup described in [5].

In our analysis, we summarize various collections and utilize ROUGE output to calculate the average F-values and summary lengths generated We focus on a representative document collection from DUC2004, which consists of 212 sentences, and a 507-sentence collection from DUC2007 This comparison highlights the variations in summary length throughout the summarization process and the time required for the algorithm to generate these summaries.

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization It highlights the effectiveness of these techniques in generating concise summaries while preserving the essential meaning of the original text By leveraging evolutionary algorithms, the study aims to enhance the quality and efficiency of automatic summarization processes, making it a valuable tool for information retrieval and content management.

Figure 3.3 Changes in summary length in [DE] method on DUC2004

Figure 3.3 indicates changes in the summary length during 1000 generations on

The DUC2004 algorithm requires 135 minutes to compress a collection of 212 sentences into a concise summary of 25 sentences across 1000 generations Notably, the reduction in length slows over time, with the count decreasing from 92 sentences at generation 0 to 37 sentences by generation 500, ultimately achieving a significant summary of 25 sentences by generation 1000.

Document collections Original length Summary length d30001t 212 25 d30006t 408 74 d30011t 250 34 d30033t 642 131

Table 3.3 Summary lengths of some document collections in DUC2004 using

This article explores the application of evolutionary computation techniques in automatic text summarization It emphasizes the significance of utilizing advanced algorithms to enhance the efficiency and accuracy of summarizing large volumes of text By integrating evolutionary strategies, the research aims to improve the extraction of key information while maintaining the original context and coherence of the content The findings highlight the potential of these techniques to revolutionize how we process and condense information in various domains.

Table 3.3 displays the summary lengths of various randomly selected document collections from DUC2004 Notably, none of the summaries meet the target length of six sentences.

Figure 3.4 Changes in summary length in [DE] method on DUC2007

Figure 3.4 inllustrates the running process of differential evolution algorithm on

The DUC2007 algorithm takes 204 minutes to complete 1000 generations, compressing a document collection from 230 sentences at the start to 119 sentences by the end This indicates a reduction from 507 sentences to a concise summary of 119 over 1000 iterations Notably, the length decreases more slowly in the latter half of the process, with a summary reducing from 230 to 139 sentences in the first 500 generations, and then from 139 to 119 sentences in the subsequent 500 generations Overall, the method demonstrates limited effectiveness in significantly reducing summary length.

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization It highlights the effectiveness of these techniques in generating concise and coherent summaries from extensive textual data By leveraging evolutionary algorithms, the study aims to enhance the quality and efficiency of text summarization processes, making it a valuable contribution to natural language processing The findings suggest that evolutionary computation can significantly improve the ability to distill key information from larger documents, thereby facilitating better information retrieval and comprehension.

Document collections Original length Summary length

Table 3.4 Summary lengths of some document collections in DUC2007 using

Table 3.4 dipicts summary lengths of some randomly chosen document collections in DUC2007 to confirm that the summary is not shorten sufficiently because the objective is 12-sentence summaries

Method

To enhance the [DE] method outlined in section [3.1.4], it is crucial to address the slow reduction in summary length In our attempt to condense 507 sentences into approximately 120, we utilized 204 minutes, far exceeding our target of a 12-sentence summary Additionally, the F-score remains suboptimal, primarily due to the extended summary length, which hinders overall effectiveness.

This article explores the research and application of evolutionary computation techniques for automatic text summarization It emphasizes the significance of these advanced methodologies in enhancing the efficiency and accuracy of summarizing large volumes of text By leveraging evolutionary algorithms, the study aims to improve the extraction of key information, making it easier for users to grasp essential content quickly The findings highlight the potential of evolutionary computation in transforming traditional text summarization approaches, paving the way for more intelligent and adaptive systems.

In alternative sentence extraction ranking methods, each sentence is evaluated individually and assigned scores, allowing for flexible summary compression by selecting the highest-scoring sentences However, the current stochastic population-based approach generates solutions through various operators, making it challenging to manage summary length effectively These limitations highlight the need for a new method that improves control over summary length.

- Taking very long time to summarize a document collections containing large number of sentences

- Reducing the summary length more and more slowly during the process of summarization

- The F-values are low when our summaries are compared with experts’ summaries

Our approach involves multi-step summarization, where we iteratively condense the initial summary until we achieve an optimal length This method is effective because early summaries often result in significant reductions in length By summarizing the first round's output, we ensure that users quickly receive a comprehensive and satisfying summary.

We will reduce the number of generations in the DUC2004 dataset from 1000 to 150 and in the DUC2007 dataset from 1000 to 100, while keeping all other parameters consistent with the initial experiment After completing the first run of 100-150 generations, the generated summary will undergo a second summarization process, repeating the 100-150 generation cycle until it meets the desired length constraint This iterative process continues until the resulting summary achieves an acceptable length.

This method reduces the search space, significantly accelerating the search process As a result, the time required for summarization decreases, allowing for easier control over the length of the content.

We call this method [MultiDE] for short

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization It highlights the significance of these advanced algorithms in enhancing the efficiency and accuracy of summarizing large volumes of text By leveraging evolutionary computation, the study aims to improve the extraction of essential information while maintaining the coherence and relevance of the summarized content The findings demonstrate the potential of these techniques to revolutionize how we process and digest textual information in various domains.

Experiment, result and discussion

The datasets are the same as the previous method [DE]

ROUGE package is still used to evaluate our result

Number of generation t max 150 100 u min -5 -5 u max 5 5

Goal: number of sentences in the summary 6 12

Table 3.6 Parameter settings of the second experiment

We run the program with the settings illustrated in Table 3.6 getting a summary, then continue summarizing that returned summary until we get satisfying summary lengths

The following is the results of our experiment

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization It highlights the significance of using advanced algorithms to enhance the efficiency and accuracy of summarizing large volumes of text By leveraging evolutionary computation, the study aims to improve the extraction of key information and generate concise summaries that retain essential meaning The findings contribute to the ongoing development of automated tools that facilitate better information retrieval and comprehension in various domains.

Figure 3.5 Summary length in [MultiDE] method on DUC2004

Figure 3.6 Summary length in [MultiDE] method on DUC2007

This article explores the research and application of evolutionary computation techniques in the field of automatic text summarization By leveraging advanced algorithms, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The integration of evolutionary computation methods allows for the optimization of summarization processes, ultimately improving the quality of generated summaries This research contributes to the ongoing development of intelligent systems for better information retrieval and comprehension.

Figures 3.5 and 3.6 illustrate the effective use of multi-step summarization in differential evolution, yielding promising results A 6-sentence summary for DUC2004 was generated in just 12 minutes, while a more detailed 12-sentence summary for DUC2007 took 114 minutes to complete.

Document collections Original length Summary length d30001t 212 8 d30006t 408 8 d30011t 250 2 d30033t 642 8

Table 3.7 Summary lengths of some document collections in DUC2004 using

Document collections Original length Summary length

Table 3.8 Summary lengths of some document collections in DUC2007 using

Table 3.7 and Table 3.8 dipict summary lengths of four randomly choosen document collections in DUC2004 and DUC2007 correspodingly to confirm that the summary is shorten sufficiently

The following Table 3.9 presents our summary quality using differential evolution algorithm combined with multi-step summarization method

This article explores the research and application of evolutionary computation techniques for automatic text summarization By leveraging advanced algorithms, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The integration of evolutionary computation methods provides a novel approach to generating concise and coherent summaries, making it a significant contribution to the field of natural language processing The findings highlight the potential of these techniques in improving information retrieval and comprehension in various applications.

Table 3.9 F-Values of three evaluation measures of method [MultiDE] on

The comparison of summary quality between the two methods, [DE] and [MultiDE], demonstrates that multi-step summarization yields results that closely align with expert summaries This improvement in quality is visually represented in Figures 3.7 and 3.8.

Figure 3.7 Comparison between F-values of [DE] and [MultiDE] on DUC2004

This article explores the application of evolutionary computation techniques in the field of automatic text summarization It emphasizes the significance of these advanced methods in enhancing the efficiency and accuracy of summarizing large volumes of text By leveraging evolutionary algorithms, researchers aim to improve the extraction and synthesis of key information, thereby facilitating better comprehension and analysis of textual data The study highlights the potential of these techniques to revolutionize how we process and summarize information in various domains.

Figure 3.8 Comparison between F-values of [DE] and [MultiDE] on DUC2007

This chapter introduces the DE algorithm for automatic text summarization and conducts two experiments to compare its effectiveness in controlling summary length The results indicate that our method efficiently meets user requirements for summary length while simultaneously enhancing summary quality.

This article explores the research and application of evolutionary computation techniques for automatic text summarization It emphasizes the effectiveness of these advanced methods in enhancing the summarization process, making it more efficient and accurate By leveraging evolutionary algorithms, the study aims to improve the quality of generated summaries, catering to the growing demand for concise information in various fields Ultimately, the research highlights the potential of evolutionary computation in transforming how we process and summarize large volumes of text.

Chapter 4

This chapter summaries the contributions of this thesis and gives some future extensions

In this thesis, we have studied the evolutionary algorithms: differential evolution, applied DE to a practical problem Automatic text summarization A new method of handling summary length has been proposed

In particular, 45 collections each of which contains 25 documents from DUC

In 2007, we summarized 50 collections of 10 documents from DUC2004 using both the original and improved DE methods These summaries were evaluated against those created by experts, revealing that our proposed method outperformed earlier approaches suggested by other researchers.

We will explore advanced evolutionary algorithms, including genetic algorithms (GA) and genetic programming (GP), focusing on their application in both single and multiple document text summarization Additionally, we aim to test various methods for managing constraints, particularly regarding summary length, in future studies.

This research focuses on the application of evolutionary computation techniques for automatic text summarization By leveraging advanced algorithms, the study aims to enhance the efficiency and accuracy of summarizing large volumes of text The integration of these computational methods seeks to optimize the extraction of key information, making it easier for users to grasp essential content quickly This approach not only improves the summarization process but also contributes to the broader field of natural language processing.

Ngày đăng: 17/12/2023, 01:51

TRÍCH ĐOẠN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN