1. Trang chủ
  2. » Luận Văn - Báo Cáo

Summarizing Reviews For E-Commerce Website.pdf

49 2 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Summarizing Reviews For E-Commerce Website
Tác giả Nguyễn Minh Khôi
Người hướng dẫn Dr. Lê Thi Ngoc Tho
Trường học Ho Chi Minh City University Of Technology
Chuyên ngành Information Technology
Thể loại Graduation Project
Năm xuất bản 2024
Thành phố Ho Chi Minh City
Định dạng
Số trang 49
Dung lượng 3,72 MB

Nội dung

Overall, the project aims to provide a valuable tool for both customers and e-commerce businesses by generating concise and informative product review summaries, improving the overall us

Trang 1

a HUTECH MINISTRY OF EDUCATION & TRAINING

Đại học Công nghệ Tp.HCM HOCHIMINH CITY UNIVERSITY OF TECHNOLOGY

GRADUATION PROJECT

SUMMARIZING REVIEWS FOR E-COMMERCE WEBSITE

Supervisor: Dr Lé Thi Ngoc Tho

1 Nguyễn Minh Khôi Student`s ID: 2011063897 Class: 20DTHQA1

Ho Chi Minh City, June 2024

Trang 2

a HUTECH MINISTRY OF EDUCATION & TRAINING

Đại học Công nghệ Tp.HCM HOCHIMINH CITY UNIVERSITY OF TECHNOLOGY

GRADUATION PROJECT

SUMMARIZING REVIEWS FOR E-COMMERCE WEBSITE

Supervisor: Dr Lé Thi Ngoc Tho

1 Nguyễn Minh Khôi Student’s ID: :2011063897 Class: 20DTHQA1

Ho Chi Minh City, June 2024

Trang 3

COMMENTS OF THE LECTURER

Student’s full name:

1 Nguyễn Minh Khôi Student’s ID: 2011063897 Class: 2ODTHQA1

Lecturer’s comments:

HGMC Please sign here

Trang 4

2.2.1 Sequence-fo-sequence ch nh nh hà He tk àu 8

2.2.2 Encoder-Decoder Architecfure nnnnn ng HH HH ky kg 10 2.2.3 Fine-Tuning BART for Summarization cà: 11 2.3 Tool And Environment ch n nh nh TH HH tk hong tr Ha 13 2.3.1 Anaconda cccnnnnn nh n nh n kg ng 13 VAT Đ 8a ai-cgrracaẠâO^-s-ầậDạDạadQdiiiđ.ẢẢẢ 16 Chapter 3: APPROACH 01117 a1 19 3.1 Transformers Model cc ch nh nh HH HH ng 19 3.2 Data preparatiOn nh nh ng khen kh kg văn 20 3.2.1 Text Collection and Pre-processing che 20 3.2.2 _ Training Data Curation HH nnnnnn nh nàn nh háo 22 3.2.3 Data Evaluation and Refinemert - nnhn nh nh nh HH keo 24 Chapter 4 IMPLEMENTATION AND RESULTS Ặ re 25 4.1 Implementing and Training cnnnnnnnhn nh nHn HH ky ky 25 4.1.1 Using Fine-Tuning cc ch nh Heo 25

` °°- Ha cQCQcF-caa 29 4.2 N =0 hn -Ìn::::(adaiiaảẢ 30 4.3 Deploy model on the e-commerce websif€ c eo 34 Chapter 5 CONCLUSIONS L2 nh HH Hàn kh khe nay 36

Trang 5

LIST OF ACRONYMS

ame

ecurrent Neu ong Short-Term

eu user

eg ext-to- Text ran

ng Abstractive Summarization

Trang 6

vi

LIST OF FIGURES Figure 1: Encoder-decoder architecfure nh nha 9 Figure 2: Example Encoder-decoder tì nh nh nh nh nh HH ra 9 Figure 3: Using BART for text summarizafilon cho 11

Figure 5: Sample of mobile Revi©WS ng nh Hà tk áo 30 Figure 6: Result of orginal model ác n nn n1 ng nh HH na 30 Figure 7: Sample of Mobile Revi©WS nh ng Hà tk táo 30 Figure 8: Result of Summarizing Mobile ReviewS co nen He 31 Figure 9: Sample of Houseware r©VI©WS nh nh HH uhau 31 Figure 10: Result of Summarizing Houseware ReviewS co 32 Figure 11: Sample of Spelling Error RevieWwS ch na ào 33 Figure 12: result of summarizing Spelling Error Reviews 33 Figure 13: Mobile E-Commerce websife LH nh Ha 34 Figure 14: Comment and Summaries Comment nha 34

is 0i- is Nsojiins-e -ÍÍ“Ci(i(cổc 35 Figure 16: Place to Customer €ommernt 2c nhu Ha 35

Trang 7

vil

LIST OF TABLES Table 1: Sample of crawling commens L2 nh nha Table 2: Example of Summaries

Trang 8

vill

GUARANTEE

| guarantee this is my own research The figures and results stated in Assignment are honest and have never been announced by anyone in any other project

| guarantee that all the help for this thesis Had thanks and the information cited in the Assignment has been indicated the origin

HGMC Please sign here

Trang 9

ACKNOWLEDGEMENT

| would like to express my sincere gratitude to my supervisor, Dr Lé Thi Ngọc Thơ, for her invaluable guidance, support, and encouragement throughout the development of this graduation project Her expertise, insightful feedback, and dedication have been instrumental in shaping the direction and quality of this work

| would also like to extend my appreciation to the faculty and staff of HUTECH for providing the necessary resources, infrastructure, and academic environment that enabled me to successfully complete this project

Furthermore, | am grateful to my family and friends for their unwavering support and encouragement during this journey Their understanding and motivation have been a constant source of strength and inspiration

Finally, | would like to acknowledge the contribution of the e-commerce platforms and the users whose reviews and comments were utilized in this study Without their valuable data, this research would not have been possible

Trang 10

ABSTRACTED

This research project aims to develop a system that aggregates user- generated product reviews, ratings, and comments on Vietnamese e-commerce sites The main purpose is to create concise but quality summaries, helping users easily grasp the necessary information without overloading The study will exploit both extracted text reunification and visualization methods to achieve this goal The background chapter explains the two main text summarization techniques: extractive and abstractive It also discusses the use of Transformers and Encoder-Decoder architecture for summarization, introducing the BART model and its application for summarization Additionally, the chapter provides an overview of the Anaconda environmental model used in the project

The approach focuses on the Transformers model and outlines the data preparation steps, including text collection, pre-processing, training data curation, and data evaluation/refinement This ensures the quality and relevance of the data used to train the summarization model

The experiment section describes the implementation and training of the summarization model using fine-tuning It also evaluates the empirical results of the model, providing insights into its performance

In the conclusion, the project summarizes the key findings, the results achieved, and the contributions of the work Additionally, it discusses new proposals for future research and development to further enhance the capabilities

of the summarization system

Overall, the project aims to provide a valuable tool for both customers and e-commerce businesses by generating concise and informative product review summaries, improving the overall user experience on Vietnamese e-commerce platforms

Trang 11

Chapter 1: INTRODUCTION

1.1 Overview of the Project

This research project aims to develop a system that aggregates user- generated product reviews, ratings and comments on Vietnamese e-commerce sites The main purpose is to create concise but quality summaries, helping users easily grasp the necessary information without overloading

This study will exploit both extracted text reunification and visualization methods The extraction method identifies the most important sentences or phrases from the original and combines them Meanwhile, the object method will create target sentences that circumvent the new nature of the text, instead simply rearranging existing content

A significant feature of this project is the processing of Vietnamese data, which is different from English and some other language Issues such as handling language specificity, handling encoding conversions, and using contextual and cultural information need to be adder

In short, this project is important in improving user experience with

product reviews on Vietnamese e-commerce sites | hope this research will achieve positive results, contributing to improving the quality of information and user experience

This project will include several main parts:

e process currency data: Collect and clean raw text data from e-commerce platforms Encode data text to prepare for the next processing steps

e Model choosing: Evaluate and choose advanced machine learning models such as BART or T5 suitable for review summaries This model will be used to generate summaries from the data input

e Refine the model: Adjust and adjust the selected models simply to fit the specific requirements of the hamburger product review product on the e- commerce platform The tuning process will help the models achieve optimal performance in the specific application context

Trang 12

1.2

e Compare: Measure the quality of the summaries produced using metric associations, taking into account limitations such as the number of loss epochs This evaluation will help determine the performance results of the models and during the sequence editing process, thereby improving the optimization results

The main goal of project is providing a valuable tool for both buyers and e- commerce businesses For Customer, they will get the benefit from concise summaries and glossaries and helping them make more informed to make clearly purchasing decisions For businesses, the system will help them better understand customer emotions and feedback, thereby improving service quality to better meet consumer needs

Motivation

The main motivation for the research of project was the major challenges posed by the large amount of user-generated content on e-commerce platforms As Vietnamese e-commerce continues to grow, these platforms are accumulating more and more product reviews, ratings and comments from customers With the ever-increasing volume of user-generated data, effectively analyzing and summarizing this content becomes a significant challenge An automatic review summary system will help solve this problem, benefiting both buyers and businesses

For Customer, to consider and evaluate the product reviews on e-commerce platforms can be a difficult and is a time-consuming job It is not practical to manually read and analyze all reviews to find the important information needed for a purchasing decision Information overload can make buyers feel tired and discouraged, which can hinder the overall shopping experience They need a good tool to summarize reviews, helping them quickly grasp core product information and make more informed decisions

For businesses, to understand customer feedback to improve products, services and overall satisfaction, businesses need some good tools to easily access them Collecting, compiling, analyzing, synthesizing and extracting meaningful insights from extensive customer feedback data Processing and aggregating large

2

Trang 13

1.3

volumes of user-generated content is an extremely difficult and resource-intensive endeavor Therefore, applying automation solutions will help businesses improve the efficiency and quality of customer feedback analysis, thereby making data- based decisions instead of relying primarily on feedback This will help improve products, services and customer experiences

Developing an automated review summary system can address these challenges by providing concise and informative summaries of user-generated content By leveraging the latest advances in natural language processing, especially the power of transformer-large models, this system can automatically generate high-quality summaries, helping businesses Significantly improve the efficiency and quality of customer feedback analysis These summaries will provide managers and strategists with important, detailed and reliable information about customer feedback and sentiment, from which they can make informed decisions data-driven decisions to improve products, services and customer experiences

By creating high-quality aromatic plants, this project can bring significant benefits to both customers and businesses, while contributing to the development

of the field of natural language processing in Vietnam By addressing these motivations, this research project aims to have a meaningful impact on both the user experience and the operational efficiency of Vietnamese e-commerce platforms

Purpose of the research project

In this research project, our purpose is twofold: technical exploration and practical application

Technical Exploration:

The project will investigate the effectiveness of transformer-large models for the task of e-commerce review summarization in the Vietnamese language This will involve exploring the capabilities and limitations of various transformer architectures, such as BART and T5, and fine-tuning them to achieve optimal performance on the specific requirements of summarization task

3

Trang 14

Through this technical exploration, the project aims to:

Understand the suitability of transformer-large models for handling Vietnamese e commerce data, which way have unique linguistic and contextual challenges compared to English-language counterparts Experiment with different fine-tuning techniques and hyperparameter configurations to enhance the accuracy and coherence of the generated summaries

Contribute to the ongoing research in the field of natural language processing, particularly in the areas of text summarization and the application of transformer models to non-English languages

b Practical Application:

In addition to the technical exploration, the project also aims to develop a usable system that can provide valuable summaries to users of Vietnamese e- commerce platforms This practical application component of the project will focus on addressing real-world challenges, such as:

Handling large-scale e-commerce data, includi ng efficiently processing and storing the user-generated content

Designing an intuitive user interface that seamlessly integrates the summarization functionality into the e-commerce platform

Ensuring the summaries generated by system are accurate, concise, and informative, meeting the needs of both shopper and businesses

By addressing both the technical and practical aspects of project, the research aims to contribute to advancement of natural language processing techniques while also providing a tangible solution to improve the user experience and operational efficiency of Vietnamese e-commerce platforms

Trang 15

There are two main approaches to text summarization:

Sentence Scoring: The first step is to score each sentence in the input text based

on its importance or relevance to the overall content This is typically done by extracting various features from the sentence, such as term frequency, position in the document, presence of proper nouns, sentence length, and sentence-to-sentence similarity

Sentence Selection: Once the sentences have been scored, the next step is to select the most important sentences to include in the summary This can be done using optimization techniques, such as submodular optimization, integer linear programming, or greedy algorithms, to choose the subset of sentences that maximize the overall quality of the summary

Text rank:

One popular sentence scoring algorithm used in extractive summarization is TextRank TextRank is a graph-based algorithm that models the text as a network, with sentences as nodes and edges representing the similarity between sentences The algorithm then uses the concept of eigenvector centrality to identify the most important sentences

The TextRank algorithm works as follows:

Trang 16

Construct the Graph: The first step is to construct a graph representation of the text, where each sentence is a node, and the edges between nodes represent the similarity between the corresponding sentences The similarity between two sentences can be computed using various techniques, such as cosine similarity between the sentence vectors

Compute Sentence Scores: The algorithm then applies the PageRank algorithm to the constructed graph to compute the centrality score of each sentence The PageRank score reflects the importance of a sentence based on its connections (similarities) to other important sentences in the text

Select Top Sentences: Finally, the algorithm selects the sentences with the highest PageRank scores to include in the summary, up to a target summary length or compression ratio

The TextRank algorithm has the advantage of being unsupervised and language-independent, making it applicable to a wide range of summarization tasks It has been shown to perform well in comparison to other extractive summarization approaches, particularly for longer and more complex documents

In addition to TextRank, there are several other sentence scoring algorithms used in extractive summarization, such as LexRank and Latent Semantic Analysis (LSA), each with their own strengths and characteristics

Abstractive summarization is generally considered more challenging than extractive summarization, as it requires advanced natural language processing capabilities, such as semantic understanding, logical reasoning, and language generation However, abstractive summaries can potentially provide more insightful and comprehensive representations of the original text

The choice between extractive and abstractive summarization often depends

on the specific requirements of the task, the availability of training data, and the capabilities of the underlying natural language processing models

Abstractive Summarization

Abstractive summarization is a more sophisticated approach to text summarization, where the model generates new, concise sentences that capture the

6

Trang 17

key information and essence of the input text Unlike extractive summarization, which selects and combines existing sentences, abstractive summarization aims to produce novel, human-like summaries

The key distinctions between abstractive and extractive summarization approaches are as follows:

Sentence Generation: Extractive summarization selects and extracts existing sentences from the input text to form the summary, while abstractive summarization generates novel, human-like sentences that convey the main ideas and information in a concise manner

Semantic Understanding: Extractive summarization primarily focuses on identifying the most important sentences based on surface-level features like word frequency, position, and keyword matching In contrast, abstractive summarization requires a deeper understanding of the semantics, context, and relationships within the text to produce meaningful and relevant summaries

Paraphrasing and Abstraction: Extractive summarization tends to retain the original wording and structure of the selected sentences, whereas abstractive summarization involves paraphrasing the key points and representing them in a more abstract way, leading to more concise and readable summaries

Implementation Complexity: Extractive summarization is relatively simpler to implement, as it often relies on statistical or rule-based approaches to identify the most important sentences Abstractive summarization, on the other hand, is generally more challenging to design and train, as it requires the model to learn to generate coherent and fluent text while maintaining accuracy and relevance to the input

Output Quality: While extractive summarization may sometimes lack cohesion and readability due to the patchwork of extracted sentences, abstractive summarization has the potential to produce more coherent, fluent, and informative summaries by generating new text that accurately captures the essence of the input

7

Trang 18

2.2

2.2.1

Overall, the tradeoff between the two approaches is that abstractive summarization is more complex to implement but can potentially yield higher- quality and more concise summaries by generating novel sentences that deeply understand the semantics and context of the input text

Transformers in Summarization

The Transformer [1] architecture relies heavily on self-attention mechanisms

to process input sequences This allows the model to weigh the importance of different words in a sentence relative to each other, enabling it to capture context and meaning more effectively than previous models like [2]

Sequence-to-sequence

Sequence-to-sequence (seq2seq) is a deep learning model architecture that

is commonly used for tasks that involve transforming one sequence of text or data into another sequence, such as:

- Machine translation - Translating a sequence of text from one language to another

- Text summarization - Summarizing a long document or text into a shorter, concise version

- Question answering - Generating an answer given a question

- Text generation - Generating new text given some starting prompt

The key characteristics of a seq2seq model are:

- Encoder: The encoder takes the input sequence (e.g a sentence in one language) and encodes it into a fixed-length vector representation, or

"thought vector"

- Decoder: The decoder takes the encoded vector representation and generates the output sequence (e.g the translated sentence in another language) one token at a time

Trang 19

3E _ mm}

Encoder Context Decoder Input

Sequence ka Decoder Outputs

recently, transformer architectures

The seq2seq approach allows the model to handle variable-length input and output sequences, which is useful for tasks where the length of the input and output can vary significantly During training, the model learns to map the input sequence

to the corresponding output sequence For Example:

Encoder builds a Target sentence

representation of the source [I sawa cat on a mat<eos>|

and gives it to the decoder t

Figure 2: Example Encoder-decoder [5]

Seq2seq models have been widely successful in a variety of natural language processing and generation tasks, as they provide a flexible and powerful way to transform one sequence into another

Trang 20

2.2.2 Encoder-Decoder Architecture

Encoder

The encoder processes the input sequence and transforms it into a context-rich representation In the context of Transformers, the encoder consists of multiple layers, each comprising two main components:

a Multi-Head Self-Attention Mechanism:

- This mechanism allows the model to focus on different parts of the input sequence when computing the representation for each word

- Multiple attention heads enable the model to capture various types of dependencies and relationships within the input sequence

b Feed-Forward Neural Network (FFN):

- Each position's output from the self-attention layer is passed through a feed- forward network

- Typically consists of two linear transformations with a ReLU activation in between

- Applies the same set of weights to all positions, making the process parallelizable

c Add & Norm:

- The outputs of the self-attention and feed-forward layers are processed through residual connections followed by layer normalization This helps in stabilizing and speeding up the training process

Each encoder layer takes the previous layer's output as input, with the initial layer receiving the embedded representation of the input tokens combined with positional encodings

Trang 21

Multi-Head Attention Mechanism over Encoder's Output:

This layer allows each position in the decoder to attend to all positions in the encoder's output It helps the decoder utilize the context provided by the encoder Facilitates the alignment between the input and the output sequences

Feed-Forward Neural Network (FFN):

Similar to the feed-forward network in the encoder layers

Applies the same set of weights to all positions

Add & Norm: As with the encoder, residual connections and layer normalization are applied after each sub-layer

Fine-Tuning BART for Summarization

BART (Bidirectional Encoder Representations from Transformers) is a transformer-large model that has been pre-trained on a large corpus of text data It

is particularly well-suited for text generation tasks, including summarization

Trang 22

12 The process of fine-tuning BART for summarization involves the following steps:

Pre-training BART:

Before training, Bart went through an important process of handling During this period, the model was trained on a large and diverse text data set The process of handling the use of self -monitoring targets, such as masked and disturbed language model These techniques help Bart learn strong language performances, as well as gain general knowledge about text

This handling plays an important role, providing a solid foundation before the official training It helps the model to learn the characteristics and rules of the language more effectively in the later training process

Adapting the Model Architecture:

To summarize, the Bart model architecture is adjusted to suit the text mission Specifically, the output layer of the model is modified, instead of restructuring the input text, it will create a target summary

The model's encoder-decoder structure is very suitable for summary tasks The encoder can grasp the semantics and context of the input text, while the decoder will use this information to create a high quality summary

The adjustment of architecture in this direction promotes the advantages of Bart The encryption kit can deeply understand the content and context of the text, while the decoder focuses on creating a concise summary, as much as possible As

a result, Bart can create a summary of text with outstanding quality

Fine-Tuning on Summarization Data:

The Bart model was trained first and then Fine-tune on a set of data of texts and their corresponding summaries In this process, the model parameters are updated to optimize performance in the summary task, taking advantage of the knowledge gained from the previous training stage

Trang 23

13 The process of Fine-tune has helped improve the performance of the model

in the text mission By updating the parameters of the model based on summary data, the Bart model has learned how to make the most of the knowledge gained from the initial training stage

As a result, the Bart model can create higher quality summaries, more suitable for the needs of users This brings many practical benefits in practical applications, such as helping users save time when reading long documents, as well as supporting the summary and synthesizing information quickly

The Fine-tuning has proven the importance of maximizing information from previous training stages, helping to improve the performance of the model in specific tasks This is a typical example of how to learn machine can be effectively applied in practical applications

d Inference and Generation:

When the fine-tuning completed, the model can be used to generate summaries for new input text The model takes the input text, encodes it through the transformer-large encoder, and then uses the decoder to generate the summary token by token

By fine-tuning the powerful BART model on summarization data, the system can leverage the model's strong language understanding capabilities and adapt them to the specific task of generating high-quality, concise summaries This approach has been shown to produce state-of-the-art results in various text summarization benchmarks

2.3 Tool And Environment

2.3.1 Anaconda

* Overview

As of our last knowledge update in January 2022, there is no specific

"Anaconda Environmental Model [6]" that is widely recognized However, there may have been new developments or releases since then This is an overview of Anaconda in the field of data science and Python package manage ment

Trang 24

a Anaconda distribution

Description: Anaconda is an open-source distribution of the Python and R programming languages for data science, machine learning, and _ scientific computing It aims to simplify the process of installing, managing, and deploying data science environments and libraries

Components: The Anaconda distribution includes a core Python or R interpreter along with a collection of commonly used libraries and tools for data science, such as NumPy, pandas, Jupyter, scikit-learn and many other tools

b Conda Package Manager

Description: Conda is the package manager and environment manager included with the Anaconda Distribution It allows users to easily install, update, and manage dependencies in a variety of software environ ments

Features: Conda allows creating isolated environments, making it easier to manage different library versions and avoid conflicts between dependencies It supports both Python and non-Python packages

c Anaconda navigation tool

Description: Anaconda Navigator is a graphical user interface (GUI) for managing Conda packages and environments It provides an easy way to create, manage, and switch between environments

Functionality: The Navigator includes a package manager, an environment manager, and a suite of popular data science applications, all accessible through a user-friendly interface

Ngày đăng: 19/08/2024, 19:16