Overall, the project aims to provide a valuable tool for both customers and e-commerce businesses by generating concise and informative product review summaries, improving the overall us
Trang 1a HUTECH MINISTRY OF EDUCATION & TRAINING
Đại học Công nghệ Tp.HCM HOCHIMINH CITY UNIVERSITY OF TECHNOLOGY
GRADUATION PROJECT
SUMMARIZING REVIEWS FOR E-COMMERCE WEBSITE
Supervisor: Dr Lé Thi Ngoc Tho
1 Nguyễn Minh Khôi Student`s ID: 2011063897 Class: 20DTHQA1
Ho Chi Minh City, June 2024
Trang 2a HUTECH MINISTRY OF EDUCATION & TRAINING
Đại học Công nghệ Tp.HCM HOCHIMINH CITY UNIVERSITY OF TECHNOLOGY
GRADUATION PROJECT
SUMMARIZING REVIEWS FOR E-COMMERCE WEBSITE
Supervisor: Dr Lé Thi Ngoc Tho
1 Nguyễn Minh Khôi Student’s ID: :2011063897 Class: 20DTHQA1
Ho Chi Minh City, June 2024
Trang 3COMMENTS OF THE LECTURER
Student’s full name:
1 Nguyễn Minh Khôi Student’s ID: 2011063897 Class: 2ODTHQA1
Lecturer’s comments:
HGMC Please sign here
Trang 42.2.1 Sequence-fo-sequence ch nh nh hà He tk àu 8
2.2.2 Encoder-Decoder Architecfure nnnnn ng HH HH ky kg 10 2.2.3 Fine-Tuning BART for Summarization cà: 11 2.3 Tool And Environment ch n nh nh TH HH tk hong tr Ha 13 2.3.1 Anaconda cccnnnnn nh n nh n kg ng 13 VAT Đ 8a ai-cgrracaẠâO^-s-ầậDạDạadQdiiiđ.ẢẢẢ 16 Chapter 3: APPROACH 01117 a1 19 3.1 Transformers Model cc ch nh nh HH HH ng 19 3.2 Data preparatiOn nh nh ng khen kh kg văn 20 3.2.1 Text Collection and Pre-processing che 20 3.2.2 _ Training Data Curation HH nnnnnn nh nàn nh háo 22 3.2.3 Data Evaluation and Refinemert - nnhn nh nh nh HH keo 24 Chapter 4 IMPLEMENTATION AND RESULTS Ặ re 25 4.1 Implementing and Training cnnnnnnnhn nh nHn HH ky ky 25 4.1.1 Using Fine-Tuning cc ch nh Heo 25
` °°- Ha cQCQcF-caa 29 4.2 N =0 hn -Ìn::::(adaiiaảẢ 30 4.3 Deploy model on the e-commerce websif€ c eo 34 Chapter 5 CONCLUSIONS L2 nh HH Hàn kh khe nay 36
Trang 5LIST OF ACRONYMS
ame
ecurrent Neu ong Short-Term
eu user
eg ext-to- Text ran
ng Abstractive Summarization
Trang 6vi
LIST OF FIGURES Figure 1: Encoder-decoder architecfure nh nha 9 Figure 2: Example Encoder-decoder tì nh nh nh nh nh HH ra 9 Figure 3: Using BART for text summarizafilon cho 11
Figure 5: Sample of mobile Revi©WS ng nh Hà tk áo 30 Figure 6: Result of orginal model ác n nn n1 ng nh HH na 30 Figure 7: Sample of Mobile Revi©WS nh ng Hà tk táo 30 Figure 8: Result of Summarizing Mobile ReviewS co nen He 31 Figure 9: Sample of Houseware r©VI©WS nh nh HH uhau 31 Figure 10: Result of Summarizing Houseware ReviewS co 32 Figure 11: Sample of Spelling Error RevieWwS ch na ào 33 Figure 12: result of summarizing Spelling Error Reviews 33 Figure 13: Mobile E-Commerce websife LH nh Ha 34 Figure 14: Comment and Summaries Comment nha 34
is 0i- is Nsojiins-e -ÍÍ“Ci(i(cổc 35 Figure 16: Place to Customer €ommernt 2c nhu Ha 35
Trang 7vil
LIST OF TABLES Table 1: Sample of crawling commens L2 nh nha Table 2: Example of Summaries
Trang 8vill
GUARANTEE
| guarantee this is my own research The figures and results stated in Assignment are honest and have never been announced by anyone in any other project
| guarantee that all the help for this thesis Had thanks and the information cited in the Assignment has been indicated the origin
HGMC Please sign here
Trang 9ACKNOWLEDGEMENT
| would like to express my sincere gratitude to my supervisor, Dr Lé Thi Ngọc Thơ, for her invaluable guidance, support, and encouragement throughout the development of this graduation project Her expertise, insightful feedback, and dedication have been instrumental in shaping the direction and quality of this work
| would also like to extend my appreciation to the faculty and staff of HUTECH for providing the necessary resources, infrastructure, and academic environment that enabled me to successfully complete this project
Furthermore, | am grateful to my family and friends for their unwavering support and encouragement during this journey Their understanding and motivation have been a constant source of strength and inspiration
Finally, | would like to acknowledge the contribution of the e-commerce platforms and the users whose reviews and comments were utilized in this study Without their valuable data, this research would not have been possible
Trang 10ABSTRACTED
This research project aims to develop a system that aggregates user- generated product reviews, ratings, and comments on Vietnamese e-commerce sites The main purpose is to create concise but quality summaries, helping users easily grasp the necessary information without overloading The study will exploit both extracted text reunification and visualization methods to achieve this goal The background chapter explains the two main text summarization techniques: extractive and abstractive It also discusses the use of Transformers and Encoder-Decoder architecture for summarization, introducing the BART model and its application for summarization Additionally, the chapter provides an overview of the Anaconda environmental model used in the project
The approach focuses on the Transformers model and outlines the data preparation steps, including text collection, pre-processing, training data curation, and data evaluation/refinement This ensures the quality and relevance of the data used to train the summarization model
The experiment section describes the implementation and training of the summarization model using fine-tuning It also evaluates the empirical results of the model, providing insights into its performance
In the conclusion, the project summarizes the key findings, the results achieved, and the contributions of the work Additionally, it discusses new proposals for future research and development to further enhance the capabilities
of the summarization system
Overall, the project aims to provide a valuable tool for both customers and e-commerce businesses by generating concise and informative product review summaries, improving the overall user experience on Vietnamese e-commerce platforms
Trang 11Chapter 1: INTRODUCTION
1.1 Overview of the Project
This research project aims to develop a system that aggregates user- generated product reviews, ratings and comments on Vietnamese e-commerce sites The main purpose is to create concise but quality summaries, helping users easily grasp the necessary information without overloading
This study will exploit both extracted text reunification and visualization methods The extraction method identifies the most important sentences or phrases from the original and combines them Meanwhile, the object method will create target sentences that circumvent the new nature of the text, instead simply rearranging existing content
A significant feature of this project is the processing of Vietnamese data, which is different from English and some other language Issues such as handling language specificity, handling encoding conversions, and using contextual and cultural information need to be adder
In short, this project is important in improving user experience with
product reviews on Vietnamese e-commerce sites | hope this research will achieve positive results, contributing to improving the quality of information and user experience
This project will include several main parts:
e process currency data: Collect and clean raw text data from e-commerce platforms Encode data text to prepare for the next processing steps
e Model choosing: Evaluate and choose advanced machine learning models such as BART or T5 suitable for review summaries This model will be used to generate summaries from the data input
e Refine the model: Adjust and adjust the selected models simply to fit the specific requirements of the hamburger product review product on the e- commerce platform The tuning process will help the models achieve optimal performance in the specific application context
Trang 121.2
e Compare: Measure the quality of the summaries produced using metric associations, taking into account limitations such as the number of loss epochs This evaluation will help determine the performance results of the models and during the sequence editing process, thereby improving the optimization results
The main goal of project is providing a valuable tool for both buyers and e- commerce businesses For Customer, they will get the benefit from concise summaries and glossaries and helping them make more informed to make clearly purchasing decisions For businesses, the system will help them better understand customer emotions and feedback, thereby improving service quality to better meet consumer needs
Motivation
The main motivation for the research of project was the major challenges posed by the large amount of user-generated content on e-commerce platforms As Vietnamese e-commerce continues to grow, these platforms are accumulating more and more product reviews, ratings and comments from customers With the ever-increasing volume of user-generated data, effectively analyzing and summarizing this content becomes a significant challenge An automatic review summary system will help solve this problem, benefiting both buyers and businesses
For Customer, to consider and evaluate the product reviews on e-commerce platforms can be a difficult and is a time-consuming job It is not practical to manually read and analyze all reviews to find the important information needed for a purchasing decision Information overload can make buyers feel tired and discouraged, which can hinder the overall shopping experience They need a good tool to summarize reviews, helping them quickly grasp core product information and make more informed decisions
For businesses, to understand customer feedback to improve products, services and overall satisfaction, businesses need some good tools to easily access them Collecting, compiling, analyzing, synthesizing and extracting meaningful insights from extensive customer feedback data Processing and aggregating large
2
Trang 131.3
volumes of user-generated content is an extremely difficult and resource-intensive endeavor Therefore, applying automation solutions will help businesses improve the efficiency and quality of customer feedback analysis, thereby making data- based decisions instead of relying primarily on feedback This will help improve products, services and customer experiences
Developing an automated review summary system can address these challenges by providing concise and informative summaries of user-generated content By leveraging the latest advances in natural language processing, especially the power of transformer-large models, this system can automatically generate high-quality summaries, helping businesses Significantly improve the efficiency and quality of customer feedback analysis These summaries will provide managers and strategists with important, detailed and reliable information about customer feedback and sentiment, from which they can make informed decisions data-driven decisions to improve products, services and customer experiences
By creating high-quality aromatic plants, this project can bring significant benefits to both customers and businesses, while contributing to the development
of the field of natural language processing in Vietnam By addressing these motivations, this research project aims to have a meaningful impact on both the user experience and the operational efficiency of Vietnamese e-commerce platforms
Purpose of the research project
In this research project, our purpose is twofold: technical exploration and practical application
Technical Exploration:
The project will investigate the effectiveness of transformer-large models for the task of e-commerce review summarization in the Vietnamese language This will involve exploring the capabilities and limitations of various transformer architectures, such as BART and T5, and fine-tuning them to achieve optimal performance on the specific requirements of summarization task
3
Trang 14Through this technical exploration, the project aims to:
Understand the suitability of transformer-large models for handling Vietnamese e commerce data, which way have unique linguistic and contextual challenges compared to English-language counterparts Experiment with different fine-tuning techniques and hyperparameter configurations to enhance the accuracy and coherence of the generated summaries
Contribute to the ongoing research in the field of natural language processing, particularly in the areas of text summarization and the application of transformer models to non-English languages
b Practical Application:
In addition to the technical exploration, the project also aims to develop a usable system that can provide valuable summaries to users of Vietnamese e- commerce platforms This practical application component of the project will focus on addressing real-world challenges, such as:
Handling large-scale e-commerce data, includi ng efficiently processing and storing the user-generated content
Designing an intuitive user interface that seamlessly integrates the summarization functionality into the e-commerce platform
Ensuring the summaries generated by system are accurate, concise, and informative, meeting the needs of both shopper and businesses
By addressing both the technical and practical aspects of project, the research aims to contribute to advancement of natural language processing techniques while also providing a tangible solution to improve the user experience and operational efficiency of Vietnamese e-commerce platforms
Trang 15There are two main approaches to text summarization:
Sentence Scoring: The first step is to score each sentence in the input text based
on its importance or relevance to the overall content This is typically done by extracting various features from the sentence, such as term frequency, position in the document, presence of proper nouns, sentence length, and sentence-to-sentence similarity
Sentence Selection: Once the sentences have been scored, the next step is to select the most important sentences to include in the summary This can be done using optimization techniques, such as submodular optimization, integer linear programming, or greedy algorithms, to choose the subset of sentences that maximize the overall quality of the summary
Text rank:
One popular sentence scoring algorithm used in extractive summarization is TextRank TextRank is a graph-based algorithm that models the text as a network, with sentences as nodes and edges representing the similarity between sentences The algorithm then uses the concept of eigenvector centrality to identify the most important sentences
The TextRank algorithm works as follows:
Trang 16Construct the Graph: The first step is to construct a graph representation of the text, where each sentence is a node, and the edges between nodes represent the similarity between the corresponding sentences The similarity between two sentences can be computed using various techniques, such as cosine similarity between the sentence vectors
Compute Sentence Scores: The algorithm then applies the PageRank algorithm to the constructed graph to compute the centrality score of each sentence The PageRank score reflects the importance of a sentence based on its connections (similarities) to other important sentences in the text
Select Top Sentences: Finally, the algorithm selects the sentences with the highest PageRank scores to include in the summary, up to a target summary length or compression ratio
The TextRank algorithm has the advantage of being unsupervised and language-independent, making it applicable to a wide range of summarization tasks It has been shown to perform well in comparison to other extractive summarization approaches, particularly for longer and more complex documents
In addition to TextRank, there are several other sentence scoring algorithms used in extractive summarization, such as LexRank and Latent Semantic Analysis (LSA), each with their own strengths and characteristics
Abstractive summarization is generally considered more challenging than extractive summarization, as it requires advanced natural language processing capabilities, such as semantic understanding, logical reasoning, and language generation However, abstractive summaries can potentially provide more insightful and comprehensive representations of the original text
The choice between extractive and abstractive summarization often depends
on the specific requirements of the task, the availability of training data, and the capabilities of the underlying natural language processing models
Abstractive Summarization
Abstractive summarization is a more sophisticated approach to text summarization, where the model generates new, concise sentences that capture the
6
Trang 17key information and essence of the input text Unlike extractive summarization, which selects and combines existing sentences, abstractive summarization aims to produce novel, human-like summaries
The key distinctions between abstractive and extractive summarization approaches are as follows:
Sentence Generation: Extractive summarization selects and extracts existing sentences from the input text to form the summary, while abstractive summarization generates novel, human-like sentences that convey the main ideas and information in a concise manner
Semantic Understanding: Extractive summarization primarily focuses on identifying the most important sentences based on surface-level features like word frequency, position, and keyword matching In contrast, abstractive summarization requires a deeper understanding of the semantics, context, and relationships within the text to produce meaningful and relevant summaries
Paraphrasing and Abstraction: Extractive summarization tends to retain the original wording and structure of the selected sentences, whereas abstractive summarization involves paraphrasing the key points and representing them in a more abstract way, leading to more concise and readable summaries
Implementation Complexity: Extractive summarization is relatively simpler to implement, as it often relies on statistical or rule-based approaches to identify the most important sentences Abstractive summarization, on the other hand, is generally more challenging to design and train, as it requires the model to learn to generate coherent and fluent text while maintaining accuracy and relevance to the input
Output Quality: While extractive summarization may sometimes lack cohesion and readability due to the patchwork of extracted sentences, abstractive summarization has the potential to produce more coherent, fluent, and informative summaries by generating new text that accurately captures the essence of the input
7
Trang 182.2
2.2.1
Overall, the tradeoff between the two approaches is that abstractive summarization is more complex to implement but can potentially yield higher- quality and more concise summaries by generating novel sentences that deeply understand the semantics and context of the input text
Transformers in Summarization
The Transformer [1] architecture relies heavily on self-attention mechanisms
to process input sequences This allows the model to weigh the importance of different words in a sentence relative to each other, enabling it to capture context and meaning more effectively than previous models like [2]
Sequence-to-sequence
Sequence-to-sequence (seq2seq) is a deep learning model architecture that
is commonly used for tasks that involve transforming one sequence of text or data into another sequence, such as:
- Machine translation - Translating a sequence of text from one language to another
- Text summarization - Summarizing a long document or text into a shorter, concise version
- Question answering - Generating an answer given a question
- Text generation - Generating new text given some starting prompt
The key characteristics of a seq2seq model are:
- Encoder: The encoder takes the input sequence (e.g a sentence in one language) and encodes it into a fixed-length vector representation, or
"thought vector"
- Decoder: The decoder takes the encoded vector representation and generates the output sequence (e.g the translated sentence in another language) one token at a time
Trang 193E _ mm}
Encoder Context Decoder Input
Sequence ka Decoder Outputs
recently, transformer architectures
The seq2seq approach allows the model to handle variable-length input and output sequences, which is useful for tasks where the length of the input and output can vary significantly During training, the model learns to map the input sequence
to the corresponding output sequence For Example:
Encoder builds a Target sentence
representation of the source [I sawa cat on a mat<eos>|
and gives it to the decoder t
Figure 2: Example Encoder-decoder [5]
Seq2seq models have been widely successful in a variety of natural language processing and generation tasks, as they provide a flexible and powerful way to transform one sequence into another
Trang 202.2.2 Encoder-Decoder Architecture
Encoder
The encoder processes the input sequence and transforms it into a context-rich representation In the context of Transformers, the encoder consists of multiple layers, each comprising two main components:
a Multi-Head Self-Attention Mechanism:
- This mechanism allows the model to focus on different parts of the input sequence when computing the representation for each word
- Multiple attention heads enable the model to capture various types of dependencies and relationships within the input sequence
b Feed-Forward Neural Network (FFN):
- Each position's output from the self-attention layer is passed through a feed- forward network
- Typically consists of two linear transformations with a ReLU activation in between
- Applies the same set of weights to all positions, making the process parallelizable
c Add & Norm:
- The outputs of the self-attention and feed-forward layers are processed through residual connections followed by layer normalization This helps in stabilizing and speeding up the training process
Each encoder layer takes the previous layer's output as input, with the initial layer receiving the embedded representation of the input tokens combined with positional encodings
Trang 21Multi-Head Attention Mechanism over Encoder's Output:
This layer allows each position in the decoder to attend to all positions in the encoder's output It helps the decoder utilize the context provided by the encoder Facilitates the alignment between the input and the output sequences
Feed-Forward Neural Network (FFN):
Similar to the feed-forward network in the encoder layers
Applies the same set of weights to all positions
Add & Norm: As with the encoder, residual connections and layer normalization are applied after each sub-layer
Fine-Tuning BART for Summarization
BART (Bidirectional Encoder Representations from Transformers) is a transformer-large model that has been pre-trained on a large corpus of text data It
is particularly well-suited for text generation tasks, including summarization
Trang 2212 The process of fine-tuning BART for summarization involves the following steps:
Pre-training BART:
Before training, Bart went through an important process of handling During this period, the model was trained on a large and diverse text data set The process of handling the use of self -monitoring targets, such as masked and disturbed language model These techniques help Bart learn strong language performances, as well as gain general knowledge about text
This handling plays an important role, providing a solid foundation before the official training It helps the model to learn the characteristics and rules of the language more effectively in the later training process
Adapting the Model Architecture:
To summarize, the Bart model architecture is adjusted to suit the text mission Specifically, the output layer of the model is modified, instead of restructuring the input text, it will create a target summary
The model's encoder-decoder structure is very suitable for summary tasks The encoder can grasp the semantics and context of the input text, while the decoder will use this information to create a high quality summary
The adjustment of architecture in this direction promotes the advantages of Bart The encryption kit can deeply understand the content and context of the text, while the decoder focuses on creating a concise summary, as much as possible As
a result, Bart can create a summary of text with outstanding quality
Fine-Tuning on Summarization Data:
The Bart model was trained first and then Fine-tune on a set of data of texts and their corresponding summaries In this process, the model parameters are updated to optimize performance in the summary task, taking advantage of the knowledge gained from the previous training stage
Trang 2313 The process of Fine-tune has helped improve the performance of the model
in the text mission By updating the parameters of the model based on summary data, the Bart model has learned how to make the most of the knowledge gained from the initial training stage
As a result, the Bart model can create higher quality summaries, more suitable for the needs of users This brings many practical benefits in practical applications, such as helping users save time when reading long documents, as well as supporting the summary and synthesizing information quickly
The Fine-tuning has proven the importance of maximizing information from previous training stages, helping to improve the performance of the model in specific tasks This is a typical example of how to learn machine can be effectively applied in practical applications
d Inference and Generation:
When the fine-tuning completed, the model can be used to generate summaries for new input text The model takes the input text, encodes it through the transformer-large encoder, and then uses the decoder to generate the summary token by token
By fine-tuning the powerful BART model on summarization data, the system can leverage the model's strong language understanding capabilities and adapt them to the specific task of generating high-quality, concise summaries This approach has been shown to produce state-of-the-art results in various text summarization benchmarks
2.3 Tool And Environment
2.3.1 Anaconda
* Overview
As of our last knowledge update in January 2022, there is no specific
"Anaconda Environmental Model [6]" that is widely recognized However, there may have been new developments or releases since then This is an overview of Anaconda in the field of data science and Python package manage ment
Trang 24a Anaconda distribution
Description: Anaconda is an open-source distribution of the Python and R programming languages for data science, machine learning, and _ scientific computing It aims to simplify the process of installing, managing, and deploying data science environments and libraries
Components: The Anaconda distribution includes a core Python or R interpreter along with a collection of commonly used libraries and tools for data science, such as NumPy, pandas, Jupyter, scikit-learn and many other tools
b Conda Package Manager
Description: Conda is the package manager and environment manager included with the Anaconda Distribution It allows users to easily install, update, and manage dependencies in a variety of software environ ments
Features: Conda allows creating isolated environments, making it easier to manage different library versions and avoid conflicts between dependencies It supports both Python and non-Python packages
c Anaconda navigation tool
Description: Anaconda Navigator is a graphical user interface (GUI) for managing Conda packages and environments It provides an easy way to create, manage, and switch between environments
Functionality: The Navigator includes a package manager, an environment manager, and a suite of popular data science applications, all accessible through a user-friendly interface