cseessessesesseeseesesseeseeseesnsceeesceeceaesaeeaeeasenscceceueeeeenteneenss 24 Figure 2-5 Overview of the sentiment analysis approaches...-.---- - - s+c+csxc+xsxerxex 27 Figure 2-6 Th
Trang 1VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
FACULTY OF INFORMATION SYSTEMS
NGUYEN THI KHANH HA - 18520692
VU THI QUY - 18521317
GRADUATION THESIS
SENTIMENT ANALYSIS OF CUSTOMER REVIEWS
OF FOOD DELIVERY SERVICES
INFORMATION SYSTEMS ENGINEERING
INSTRUCTOR
PhD DO TRONG HOP
MSc NGUYEN THU THUY
HO CHI MINH CITY, 2023
Trang 2In order to complete this graduation thesis, in addition to our personal efforts and
unwavering commitment, it is imperative to acknowledge the indispensable support
and assistance rendered by the faculty members at the University of Information
Technology, VNU-HCMC We would like to express our profound and sincere
appreciation to PhD Do Trong Hop, our esteemed supervisor, who wholeheartedly
aided us from the inception of our Deep Learning studies Dr Hop's unwavering trust
and encouragement during challenging times throughout the thesis composition have been invaluable Furthermore, we are immensely grateful for his insightful
contributions from the early stages of topic selection His astute guidance has
accompanied us throughout the entire research and writing process of this thesis We could not have envisioned an exemplary advisor and mentor for our academic
pursuit.
Although we have diligently endeavored to acquire knowledge, conduct
research, experiment, and achieve initial promising outcomes, the inherent limitations in our expertise and experience necessitate the anticipation of constructive feedback to refine and enhance the thesis.
Advisor Nguyễn Thi Khánh Hà
Vũ Thị Quý
Trang 32.3.1 Sentiment Analysis
2.3.2 Word CIGNH22 øất NEN !o 2Ú 2.3.3 Topic Modeling ceceeececec ees eeee sees neeeeeereeeeeeasseessasseeseseseaeseeneseseee 22
2.4 Current Approaches ccscecereeeerrererrrrerrrerre 24
2.4.1 Current Approaches in Sentiment Analysis 0.0.0 24 2.4.2 Topic Modeling Approaches ¿5552 S+Sc+csezxererrkerrree 29 2.5 Algorithm and Fundamental Concept
2.5.1 Vietnamese Tokenization -c:ccc+sxcrsrtrrererirrrrerrrrerree 30 2.5.2 RNN, LSTM and Transformer Arehitecture - : s+<< 38 2.5.3 Pretrained BERT Model Theory -.: +:-+cc++c+cccx+se+ 47
2.5.4 The theory of PhoBERT pretraining model - ‹- - +=++ 5
2.5.5 The theory of LDA model
2.6 Problem Statement ceeceesccesesseseeseseesesesesseseseeaesesesesesessesesaeseeneaes 55
Chapter 3 THEORETICAL DATASET AND SOLUTION APPROACH 57
3.3 Data Crawling cà HH Hi 8
Trang 43.4 Introduction to the Training Dataset
3.5 Process and Algorithms used in Sentiment AnalySis -. 6l 3.5.1 Data Preprocessing ccc St the 63 3.5.2 Tokenization and Encoding :- - 55+ ++s++x+xvezxererxrrererxer 71
3.5.3 Sentiment Analysis Model building - - c5++cscecxsxererxey 75
Chapter 4 SYSTEM DESIGN & IMPLEMENTATION .: 102
4.3 Overview of System Design.
4.4 Deployment of System ccsceseseesesessesesseresesrenesesnesesrensseeneaeerenseeenenees 105
4.4.1 Systems to be Deployed and Technologies Ủsed - 105 4.2.2 Database 46 5 ⁄⁄⁄ 59%/E C Thạ, oi 106 4.2.3 Visualization of Analytical Results -¿-¿55+5x+sxccxscxccxe 107
Chapter 5 RESULTS AND CONCLUSIONS sccsssessssseessesssesseesesstesseesseeseesee F12
5.1 Implementation Result -+c-++cccxccvsxsrxeererxer 112
5.2 LiL ati Tan SR AF osssecsvensensensceseasenssveasessesvesvensensensenseas 113
Trang 5LIST OF FIGURES
Figure 1-1 Statista (2022, March 29) Online food delivery in Vietnam Report [1] 11
Figure 1-2 General Online Review Statistics [2] eccessssessesesesseseceeeseeeeseseeeeneceseaeeeeeeeeeeees 12
Figure 2-1 Uncover emotion: Social media sentiment analysis [3] -. -‹ 18
Figure 2-2 Word example 011712777 20
Figure 2-3 Visual illustration of undeleted word cloud Stop word in English 21
Figure 2-4 Sentiment analysis levels cseessessesesseeseesesseeseeseesnsceeesceeceaesaeeaeeasenscceceueeeeenteneenss 24
Figure 2-5 Overview of the sentiment analysis approaches -. - - s+c+csxc+xsxerxex 27
Figure 2-6 The chronological release of these proposed models over the years 29
Figure 2-7 Structure of words in Vietnamese cecssessssssessseseeseeseessesseasessesecseesesneeaseaeenee 31
Figure 2-8 Tokemization ceccecccccccseesessessesesseeseceseesecsesessecsesssessecussessecseseesesseeeseesesnsseeneeeseeneeneess 34
Figure 2-9 Compare the difference of word separation at different levels - 35
Figure 2-10 A simple RÌNN cà HH HH gàng nghe 39
Figure 2-11 Visualization of the repeating component of an LSTM Network 42
Figure 2-12 RNN, LSTM and Transformer 1lÏustrafiOn 5- +55 5++cs>xszsrezxsrxersree 43
Figure 2-13 Architecture of Transformers Neural NetWOrK «5c cs+csrererkerererree 45
Figure 2-14 BERT input representation cceecessessesesseesesseseesececsessececsecseeseceeseenecucseeeesecaeeneeesens 48
Figure 2-15 BERT pre-training and fine-funing - se x+xxerxerxerkerxerererkerkerkee 49
4010210809720) :.)0000205 ồ 53
Figure 3-1 Overview of the data processing pipeline in the system to address the three
tasks of Word Cloud, Topic Modeling, and Sentiment Analys1 . -++c+ccszssce2 57
Figure 3-2 Sentiment Analysis Processing 5-52-1111 kgrkrree 61
Trang 6Figure 3-3 Example of lowercase converSion 1n DF€DTOC€SSITE -5- 5+ 5++c+s>c+x 65
Figure 3-4 Emojis on Facebook -: ++-++++++k+rxt+kt+ktrkttrkrkktrktrkkrtrrkrrrrerrrkrrrerke 66
Figure 3-5 Example of tOKefniZafIOIA - 5-2 < sex k1 1111111111111 re 72
Figure 3-6 Compare result between different fOOÌS ¿5+ 5++2+2z+vczvrtrterxerrrertersrree 73
Figure 3-7 Number of tokens per COMMENC sessesessecseseessesecsesesseescseeaeenececaecnecuceeenecueaeeneeneess 74 Figure 3-8 WordCloud Processing -¿- 2 <5 5< kề SE E11 1111111111111 ree 93
Figure 3-9 Word cloud image r€SuÏÍ - ¿- +5 5++5++++EE2EEYSESEEEEeEkerrktrkerrrrrkrrkrrrrkrrkrrrree 96
Figure 3-10 Topic modeling Sf€p - - «- << +s+s++keEeEvEkEEkEEEEEEKEE111111 111111 rkee 97
Figure 3-12 Results of the model €X€CUtIOI - 2-52 25+‡E‡ESEeEkerkerkerkrrkererkerkrrrrrk 100
Figure 4-1 illustrates the batch processing WOrkfÏOW -ccccxcreerrsrrrrrererrrerrek 103
Figure 4-2 Overview system architecture dia8TaIm se +xsx++xsxeexererxerkerkrrrrk 104
Figure 4-3 Dashboatd c.cceccccessssesssssssssseesessssessessesscsessecuescsesuessesecssansacsesucsessecuesuesecusseeseeneaeeneeneenes 109
Figure 4-4 Dashboard of result ceccececsessecsessessessnessessessecsessssueenesneeseeneeaeeaeeseeassesseeeeeeeeeneeaeenees 111
Trang 7LIST OF TABLES
Table 1: Comparison of Transformer IIO(@ÏS SG St 50Table 2: LDA Result f(LÏ€ So 1 vn TH ng TH nh ghHhg 101Table 3: PostgreSQL database SG 33kg ket 105Table 4: Tables of the database Structure S531 SEEEESkeeeerereeereese 107
Trang 8LIST OF ACRONYMS
AI: Artificial Intelligence
API: Application Programming Interface
ANN: Artificial Neural Network
CNN: Convolutional Neural Network
DBMS: Database Management System
MLP: Multilayer Perceptrons
NN: Neural Network
FC Layer: Full-connected Layer
RNN: Recurrent Neural Network
ReLU: Rectified Linear Unit
ResNet: Residual Network
Trang 9Sentiment analysis plays a crucial role in understanding and evaluating customeropinions and feedback In the context of food delivery services, analyzing customerreviews can provide valuable insights into the overall satisfaction and sentiment ofcustomers towards the services provided
This study focuses on sentiment analysis of customer reviews specifically related tofood delivery services The aim is to analyze and classify customer sentiments aspositive, neutral, or negative based on their reviews By applying natural languageprocessing techniques and machine learning algorithms, the study aims to extractmeaningful information from textual data and identify the sentiment expressed by
customers.
The analysis of customer reviews can help service providers gain a deeperunderstanding of customer preferences, identify areas of improvement, and makeinformed decisions to enhance the quality of their services It also enables them to
address customer concerns and complaints promptly, thereby improving customer
satisfaction and loyalty
The findings of this sentiment analysis can be used by food delivery service providers
to monitor customer sentiments over time, track the effectiveness of serviceimprovements, and make data-driven decisions to enhance their overall customer
experience
In conclusion, sentiment analysis of customer reviews of food delivery services
provides valuable insights into customer sentiment, allowing service providers tomake informed decisions and improvements to enhance customer satisfaction andloyalty
10
Trang 10Chapter 1 INTRODUCTION
1.1 Problem statement
Vietnam's culinary landscape is widely acknowledged for its diverse offerings andembodies a dynamic food culture Food holds deep significance within Vietnamesesociety and has undergone transformative changes over time
Examining the market trends, the online food delivery industry in Vietnamexperienced remarkable growth from 2016 to 2020, with an annual growth rate of
96.8 percent However, from 2021 to 2025, the growth rate gradually decreased to
35.8 percent per year Despite this decline, it is projected that by 2025, the
Vietnamese food delivery market will attain a substantial value of 2,709.7 million
USD, reflecting the sustained demand and potential in the future of the industry
Figure 1-1 Statista (2022, March 29) Online food delivery in Vietnam Report [1]
In recent years, a notable shift in consumer preferences and behaviors can beobserved, as contemporary Vietnamese consumers increasingly gravitate towardsdigital modes of consumption, driven by their fast-paced lifestyles that prioritizeconvenience This shift has resulted in a significant surge in the demand for onlinefood delivery services in Vietnam
11
Trang 11The emergence of online platforms has bestowed customers with the power to voice
their opinions, both positive and negative, at any hour of the day According to arecent survey conducted by Podium, nearly everyone (93%) say that an online review
has impacted their purchase habits More details, consumers expect high standardsfrom the brands they do business, with most saying they will not engage with abusiness or product that has less than a 3.3-star rating In order to uphold theirreputation, enhance the guest experience, and avert negative trends from affectingsales, food brands are increasingly resorting to sentiment analysis as a valuable tool
of consumers say online reviews
%
93% impact purchase decisions
of consumers say content of a review
82%
has convinced them to make a purchase
of consumers say online reviews for local businesses are
80% as helpful as product reviews on sites like Amazon.com
on average, is the minimum star rating of a business
consumers would consider engaginng with
3
Figure 1-2 General Online Review Statistics [2]
Nevertheless, manually scrutinizing and interpreting customer reviews of online fooddelivery services is an onerous task for businesses due to various factors Firstly, there
is a voluminous amount of textual data to process and interpret, which can be consuming and laborious for businesses Additionally, customer reviews are ofteninformal and lack a structured format, making it arduous to extract meaningful
time-insights from the data Secondly, different customers may express their opinions and
sentiments in various ways, including sarcasm, irony, and humor, which maynecessitate additional effort to comprehend accurately Thirdly, businesses mayencounter difficulties in detecting and categorizing different aspects of customersatisfaction, such as food quality, delivery speed, and customer service Lastly,
Trang 12businesses must stay abreast of the evolving trends in customer preferences andneeds, which exacerbates the complexity of manual analysis.
1.2 Problem Solution
In this era where the confidence index is paramount, businesses require effective toolsand techniques to analyze and comprehend customer reviews of online food deliveryservices in Vietnam Sentiment analysis, a Natural Language Processing technique,can automatically classify the polarity of textual data, providing insights into theemotions and opinions expressed in customer reviews By leveraging sentimentanalysis, businesses can promptly and proficiently process extensive volumes ofcustomer reviews, discern patterns and trends in customer feedback, extract
meaningful insights, and enhance their business strategies accordingly
The problem of sentiment analysis can be regarded as a subtask within the field ofNatural Language Processing (NLP), aiming to extract the affective states andsubjective opinions expressed by individuals through textual means Numerousresearch endeavors have been dedicated to developing applications based on
sentiment analysis However, the task of comprehending human sentiment is
inherently intricate In practice, it necessitates discerning the underlying intentions ofcommenters, such as identifying instances of irony, sarcasm, or subjectivityembedded within the text Moreover, user-generated review often diverges from the
formal linguistic conventions found in traditional literature or journalism, exhibiting
a range of linguistic imperfections such as orthographic errors, informal vernacular,
slang expressions, or abbreviations
Furthermore, due to the target domain of natural language data processing andanalysis, each language exhibits distinct characteristics that warrant diversetechnological approaches and preprocessing procedures Consequently, despite theabundance of solutions proposed for sentiment analysis, predominantly tailored forthe English language, their direct applicability to sentiment analysis tasks involvingVietnamese comment data remains limited Therefore, the application of sentiment
13
Trang 13analysis models demands adaptability and customization to address the intricacies ofVietnamese language data effectively.
1.3 Goal and Study Scope
The first important objective of this study is to analyze thoroughly the Vietnamese
people's comments on the Food Finding and Reviewing website Comments are avaluable source of opinions and feedback that users provide after using a product,service, or visiting a place, and they can be positive, negative, or neutral
In this thesis, we will craw food reviews from the Foody website in all of the cities
and provinces in Vietnam, which is a prominent platform for reviewing and searching
for food locations across most provinces and cities in Vietnam
Furthermore, a crucial aim of this research endeavor is to employ a diverse array ofanalytical techniques, including Word Cloud, Latent Dirichlet Allocation, as well as
LSTM and PhoBERT sentiment analysis models, to visualize the outcomes
comprehensively This multifaceted visualization approach facilitates an in-depthexamination of the sentiment distribution inherent in the comments, enabling theidentification of pivotal patterns and trends that hold relevance for informed businessdecision-making The integration of these advanced analytical methods serves toelucidate the sentiment landscape from a macroscopic overview to intricate details,thereby fostering a holistic understanding of the data
Finally, we will build a system that can automate the entire process of data processing,
analysis, and displaying the results on a web platform This system will help us to
efficiently analyze a large volume of comments and present the results in a
user-friendly and interactive way, allowing users to interact with the data and gain deeper
insights Overall, this study will contribute to the field of sentiment analysis in theVietnamese language and provide valuable insights for businesses looking to improvetheir products and services based on customer feedback
14
Trang 14Chapter 2: Background Knowledge and Current Works
This chapter presents an in-depth overview of the analysis problem, focusing on theconcept of sentiment analysis and its relevance in the field It also explores the twokey techniques employed in this study, namely Word Cloud and Latent DirichletAllocation (LDA), discussing their principles and applications in sentiment analysis
Chapter 3: Dataset and Solution Approach
Chapter 3 delves into the details of the dataset used in this research, outlining the
process of crawling and preprocessing the data to create the training dataset
Furthermore, it provides insights into the algorithms and techniques employed in thesentiment analysis process
Chapter 4: System Design & Implementation
Chapter 4 focuses on the system design and implementation aspects It elaborates onthe experiments conducted for each of the techniques and models employed in thisstudy, providing a detailed account of the setup and configuration used forimplementation
15
Trang 15Chapter 5: Results and Conclusions
Chapter 5 presents the results obtained from the implementation and experimentationprocess It includes an analysis of the implementation outcomes, achievements, andlimitations encountered during the study Additionally, this chapter offers insightsinto potential future work and areas of improvement for further research
Overall, this thesis structure encompasses the necessary components to provide acomprehensive understanding of the research, from the initial overview to thedetailed technical aspects, and finally, concluding with the results and implications
of the study
16
Trang 16Chapter 2 THEORETICAL FOUNDATIONS
2.3 Overview of the Analysis Problem
As mentioned in section 1.1, understanding user sentiment is a key factor inimproving the ideal service/product experience When users perceive that theiremotions are being acknowledged, they tend to trust and establish long-termconnections with the business/organization Therefore, the problem at hand is how tounderstand the sentiments and opinions of users Merely concluding the sentimentlevel of each comment as positive, neutral, or negative is not truly effective inunderstanding user preferences
The analysis problem in this context involves going beyond predicting sentiment
labels for each comment and instead analyzing various aspects Specifically, theproblem can be divided into two sub-problems:
e Sentiment Analysis: This task involves predicting labels for each comment.
The labels consist of three categories: positive, neutral, or negative,represented as 1, 0, and -1, respectively
e Word Cloud: The Word Cloud involves displaying the most frequently
occurring meaningful words or word clusters in a text It aims to visually
represent the importance or relevance of different words based on theirfrequency of occurrence in the text A word cloud provides a quick overview
of the key terms or concepts present in the text, with larger and bolder wordsindicating higher frequency
e Topic Modeling: This task aims to discover hidden topics within the text A
group of keywords and their respective frequencies of occurrence representeach topic
By addressing these sub-problems, we can gain a comprehensive understanding ofuser sentiments, preferences, and the underlying topics discussed in the comments
17
Trang 172.3.1 Sentiment Analysis
Definition
Sentiment Analysis, also known as Opinion Mining, is the task of extracting and
analyzing opinions, emotions, attitudes, and perceptions of individuals regardingvarious entities such as topics, products, and services The rapid development ofInternet applications, such as websites, social networks, and blogs, has led users togenerate a vast amount of opinions and reviews about products, services, and dailyactivities Sentiment analysis is considered a powerful tool for businesses,governments, and researchers to extract and analyze the mood and viewpoints of thepublic, gaining deep insights into their operations and making better decisions
POSITIVE NEUTRAL NEGATIVE
"Great service for an affordable “Just booked two nights “Horrible services The room
price at this hotel.” was dirty and unpleasant.
We will definitely be booking again." Not worth the money."
Figure 2-1 Uncover emotion: Social media sentiment analysis [3]
Importance
Sentiment analysis is one of the crucial tasks in the field of natural languageprocessing It holds significance not only in academia and research but also plays avital role in various industries and services, specifically in understanding the behaviorand attitudes of customers towards products and services they use We are living inthe digital age, particularly in recent years, where social media platforms and onlinewebsites have millions of users worldwide generating a tremendous amount ofinformation and content on a daily basis, with diverse cultural backgrounds,perspectives, and knowledge levels Even online information and events can becollected from media sources The wide-ranging impact of media channels on our
18
Trang 18lives has propelled the application of sentiment analysis in text across variousdomains of social life, including brand management, customer opinion surveys, andpsychological behavior analysis.
Data sources
Sentiment analysis is used as a powerful tool to automate the process of analyzingand evaluating user opinions In this thesis, we will craw food reviews data from theFoody website in all of the cities and provinces in Vietnam, which is a prominent
platform for reviewing and searching for food locations across most provinces and
cities in Vietnam
In customer feedback systems, comments are often collected in the form of ratingscales (e.g., 1-5 stars or 1-10 stars) or satisfaction levels (e.g., very unsatisfied,unsatisfied, neutral, satisfied, very satisfied) provided by customers These rating
scales or satisfaction levels reflect the customers’ satisfaction, perspectives, and
opinions on negative, neutral, and positive emotional values In addition to recordingratings, the system also collects user opinions in the form of textual comments
Objectives of Sentiment Analysis
The objective of sentiment analysis is the process of determining and classifying text
into different sentiments, such as positive, negative, or neutral, or specific emotionslike happiness, sadness, anger, or fear It aims to identify the human's attitude towards
a particular subject or entity In this thesis, the output of the sentiment analysis task
is to assign one of three labels to each comment: positive, negative, or neutral
19
Trang 192.3.2 Word Cloud
Definition:
A Word Cloud, also known as a tag cloud, is a visual representation method thatallows us to describe and highlight the most frequently used keywords in a piece oftext True to its name, when visualizing the data in the given text, the data isrepresented in the form of a word cloud, where the words are displayed in varying
sizes and intensities based on their frequency of occurrence in the original text The
more a word appears, the larger and bolder it appears in the word cloud, indicatingits popularity and significance in the text
staff crime customer
effort : hot
movie economy
together note attorney
bill address p 2 n early
figure special difficult
drop
en i t
Bae senior _ send price _ recently
design foot show
, tl CS mrS _ meet read live improve S9Y body Dinh p
` yun
certain words called stop words, such as "thi," "va," "là," "mà," "to," etc These wordsaccount for about 25% of the text Stop words often carry little meaningful value and
do not differ significantly across different texts Imagine if the analysis result only
showed words like "mình," "này," "để," etc Clearly, this would affect the analysis
20
Trang 20result and would not provide much assistance to the intended audience reading the
analysis result
emissions
report action united that
at paris is Change have Would it global ice be from greenhouse this average
convention nations
TO joo sas with 2 © 2 of Climate
agreement rise warming C earth on
A sea wollen level
In are parties
Figure 2-3 Visual illustration of undeleted word cloud Stop word in English
In NLP, it is common practice to filter out noise words, including stop words To
remove stop words from a text, there are several approaches, but two main methods
are commonly used:
o Dictionary-based approach: In this method, we filter the text by removing
words that appear in a predefined stop word dictionary
o Frequency-based approach: With this method, we count the occurrences of
each word in the text and then remove words that appear frequently (orinfrequently) Research has shown that the most frequently appearing wordsoften carry less meaningful information
Benefits of the Word Cloud task:
There are several advantages of using a Word Cloud to visualize and present data:
o Word Clouds are highly visual and provide a clear representation of the data,
making it easy to understand
o Word Clouds are user-friendly and easy to interpret
21
Trang 21o Word Clouds offer a more visually appealing alternative to traditional data
presentations such as tables or charts
2.3.3 Topic Modeling
The concept of "topic" in topic modeling can be understood as follows According tothe Cambridge Dictionary, a topic is a subject that is discussed, written about, orstudied In the Oxford Dictionary, a topic is a matter presented in a text, essay, orconversation Hidden topics refer to the unidentified (unlabeled) topics in the process
of generating user-generated content
Topic modeling is a technique used to examine and explore textual data by searchingand statistically analyzing relevant words related to topics in each document, thusdiscovering latent topics within the text In this process, terms or words that exhibitsimilarity are grouped together, and topics are identified based on the statistical
probability of the occurrence of those words The topic modeling approach wasinitially proposed by Deerwester and colleagues in 1990, further developed by
research groups such as Hofmann in 1999 and Blei in 2003 Current approaches tocontent modeling are based on the idea of estimating the probability distribution of
each distinctive word in the document This distribution considers the document as a
mixture of multiple topics, with each topic being a combination of multiple wordsaccompanied by its own probability distribution Typically, textual data is not limited
to a single topic but may cover multiple topics Therefore, the task of topic modeling
is to identify the topics present in the textual data Most topic models rely on thefollowing assumptions:
e Each document consists of multiple topics
e Each topic consists of multiple words
The goal of topic modeling is to explore the hidden topics within documents by
identifying the words associated with each topic
22
Trang 22Benefits and Applications of Topic Modeling:
Topic modeling offers several benefits and applications Readers can easily selecttheir preferred genre of news articles through the assigned topics Book buyers canchoose books related to their interested topics based on the identified topics Newsproviders can summarize the content of news articles using topics In summary, topicmodeling helps us understand the content, issues, and characteristics of differenttopics, enabling readers to quickly and accurately determine the content of analyzedresults
Models for Topic Modeling:
In text mining, we often gather documents such as blog posts or news articles that wewant to categorize into natural groups to understand them individually Topic
modeling is an unsupervised classification method for such documents, similar to
grouping data into a predefined number of clusters, helping to identify the naturalcomponents of these groups since we don't know the exact composition of eachgroup Therefore, topic modeling reveals hidden semantic structures and providesprofound insights into unstructured data, which is abundant on the internet Somepopular topic models include Latent Semantic Analysis (LSA), Probabilistic Latent
Semantic Analysis (pLSA), Latent Dirichlet Allocation (LDA), and more
23
Trang 232.4 Current Approaches
2.4.1 Current Approaches in Sentiment Analysis
e Regarding the level:
Currently, there are multiple levels of sentiment analysis tasks However, according
to Marouane et al [19], there are three main levels for detecting sentiment: level, sentence-level, and aspect-level Figure 2-4 illustrates the levels of sentimentanalysis Marouane et al [19] suggest that aspect-level sentiment analysis is morechallenging as it involves a finer-grained control at a detailed level
Figure 2-4 Sentiment analysis levels
e Document Level Sentiment Analysis:
At the document level, sentiment analysis is performed on the entire document,assigning a single polarity to the entire document This level of analysis is notcommonly used but can be used to classify chapters or pages of a book as positive,
negative, or neutral Both supervised and unsupervised learning approaches can be
employed for document classification Cross-domain and cross-language sentimentanalysis are significant challenges at the document level Domain-specific sentimentanalysis achieves high accuracy by utilizing domain-specific and limited feature
vectors.
24
Trang 24e Sentence Level Sentiment Analysis:
At the sentence level, each sentence is analyzed individually, and a correspondingpolarity is assigned This level of analysis is useful when a document contains avariety of sentiments Subjective classification is associated with sentence-levelanalysis The polarity of each sentence is determined independently using similarmethodologies as document-level analysis, but with more training data andprocessing resources The sentence polarities can be aggregated to determine the
sentiment of the document or used individually Sentence-level analysis is
particularly important for working with conditional sentences or ambiguous
statements.
e Phrase Level Sentiment Analysis:
Phrase-level sentiment analysis involves mining opinion words at the phrase leveland performing classification Each phrase may contain multiple aspects or a single
aspect This level of analysis is applicable in product reviews with multiple lineswhere a single aspect is expressed in a phrase It has gained significant attention fromresearchers recently While document-level analysis focuses on categorizing theentire document as subjective or objective, sentence-level analysis is more beneficial
as a document usually contains both positive and negative statements Words, as thebasic unit of language, are closely related to the subjectivity of sentences ordocuments in which they appear Sentences containing adjectives are highly likely to
be subjective Additionally, the choice of terms used for expression reflectsdemographic characteristics, desires, social standing, personality, and otherpsychological and social characteristics Therefore, terms form the foundation for textsentiment analysis
25
Trang 25e Aspect Level Sentiment Analysis:
Aspect-level sentiment analysis is performed at the aspect level Each sentence maycontain multiple aspects, and this level of analysis focuses on all aspects present inthe sentence Polarity is assigned to each aspect, and an aggregate sentiment iscalculated for the entire sentence
Sentiment analysis constitutes a dynamic and thriving domain of research, withdiverse applications across multiple fields Consequently, scholars consistentlypropose, assess, and contrast various methodologies The primary objective is toenhance the efficacy of sentiment analysis and address the challenges inherent in thisfield Moreover, the incorporation of sentiment analysis into novel domains offerssubstantial motivation and elevates the significance of this undertaking However, the
meticulous selection of the suitable approach for sentiment analysis assumes utmost
importance and criticality Hence, this section aims to furnish a comprehensiveoutline of the predominant methodologies employed for conducting sentimentanalysis, which are widely recognized in academic circles
The existing methodologies for sentiment analysis can be classified based on differentperspectives, such as textual viewpoint and level of textual analysis depth [136].However, in most scholarly literature, sentiment analysis approaches are typicallycategorized into three distinct groups: Machine Learning approaches, Lexicon-Basedapproaches, and Hybrid approaches [29,137,138]
Machine learning represents the predominant approach extensively employed insentiment analysis It relies on machine learning algorithms and linguistic features toaccomplish sentiment classification Conversely, the lexicon-based approach utilizessentiment lexicons, which encompass a compilation of words and phrases commonlyemployed to convey positive or negative sentiments [139] Hybrid approaches, on the
other hand, amalgamate machine learning and lexicon-based techniques to enhancethe performance of sentiment analysis Figure 2-5 provides an overview of the
sentiment analysis approaches
26
Trang 26Figure 2-5 Overview of the sentiment analysis approaches
Prior to 2010, traditional methods dominated the field of text classification Thesemethods demonstrated clear and stable accuracy when compared to previous rule-based approaches However, they had certain limitations One drawback was the needfor feature extraction, which was a time-consuming and expensive process
Additionally, traditional methods often overlooked the natural sequential structure
and contextual information present in text data, thus hindering the learning ofsemantic information encoded in words These factors presented significantchallenges for text classification
In the 2010s, there was a gradual shift from traditional models to deep learningmodels in the field of text classification Deep learning methods offered several
27
Trang 27advantages over their traditional counterparts They eliminated the requirement for
manual rule design and feature engineering by leveraging automated techniques to
generate semantically meaningful representations for text mining Consequently, amajority of text classification research focused on Deep Neural Networks (DNN),which offered a data-driven approach albeit with higher computational complexity.Text classification is the process of extracting features from raw text data andpredicting their corresponding categories Over the past few decades, numerousmodels have been proposed for text classification Among the traditional models, theNaive Bayes model was the first to be used for text classification tasks Subsequently,general classification models such as K-Nearest Neighbors (KNN), Support VectorMachines (SVM), and Random Forest (RF) emerged as popular classifiers in thedomain of text classification
In the realm of deep learning models, TextCNN garnered the highest number of
references This model introduced the Convolutional Neural Network (CNN) totackle text classification problems effectively Additionally, the BidirectionalEncoder Representation from Transformers (BERT), although not specificallydesigned for text classification tasks, found extensive usage in designing textclassification models Its effectiveness has been evaluated on various textclassification datasets, particularly in the domain of emotion analysis
As a result, several pre-training models based on BERT have been developed,including ALBERT, RoBERTa, XLNET, DitiIBERT, BART, and others Figure 2-6illustrates the chronological release of these proposed models over the years
28
Trang 28Ai2 OpenAI trie Transformer |
ELMo GPT BERT ELMO MT-DNN XLM XLM-R °
Figure 2-6 The chronological release of these proposed models over the years
2.4.2 Topic Modeling Approaches
In recent years, with the advancement of technology and the internet, users can easilyexpress their opinions and reviews about products and services online These user-generated texts are stored as textual data, which presents a vast source for datamining Due to the strong growth in this field and the development of e-commerceand the internet, customers now have more choices when shopping or using services.Therefore, businesses need to understand their customers to timely meet their needs.One method is to understand the issues that customers comment on As a result, manystudies have been conducted using various methods and models to analyze customer
experiences and improve the quality of products and services
One such study by Raut & Londhe, as presented by Van-Ho et al (2020) [24], utilizedmachine learning techniques and SentiWordNet to extract opinions from hotelreviews The project relied on the relevance between sentences to aggregate topicsrelated to hotel reviews The results successfully classified and summarized hotelreviews, enabling businesses to understand customer preferences Van-Ho et al
29
Trang 29(2020) [24] also presented a study on analyzing customer feedback in the tourismindustry by proposing a text summarization technique to identify topics Another
study [24] examined content and sentiment similarity to determine the similarity
between two comment sentences The study used the k-medoids clustering algorithm
to group sentences into k clusters
In Berezina's study [24], an evaluation of 2,510 online hotel reviews collected fromTripAdvisor.com for Sarasota, Florida, was conducted The research findingsrevealed common "themes" used in both positive and negative reviews, includingbusiness facilities (e.g., hotel and interior design, staff, and amenities) The study alsoindicated that satisfied customers were more likely to recommend the hotel to others,mentioning intangible factors related to their stay, such as the staff's behavior, morefrequently than unsatisfied customers On the other hand, unsatisfied customers more
frequently referred to tangible aspects of the hotel, such as the interior and financial
aspects (cost, price) The research provided theoretical implications and clearmanagerial insights into understanding customer satisfaction and dissatisfactionthrough text mining and hotel ratings obtained from review websites, social media,blogs, and other online platforms
2.5 Algorithm and Fundamental Concept
2.5.1 Vietnamese Tokenization
Tokenization, also known as word segmentation, is the process of splitting text intoindividual tokens or segments In the case of this thesis, the data is in Vietnamese, so
to perform effective tokenization, we need knowledge about words and word
structures in Vietnamese Before delving into the definition and types of tokenization,let's explore the characteristics of word structures in Vietnamese
2.3.1.1 Characteristics of Vietnamese Word Structures
Each text is composed of sentences, and within each sentence, words are constructed
or composed together to form a complete sentence Each word is formed fromindividual syllables This means:
30
Trang 30e A word is the smallest linguistic unit used to form a sentence.
TU (Phân loại theo cau tạo)
Từ đơn Từ phức
K ˆ ẢN TT a a a ra
Từ đơn Từ đơn Từ ghép Từ lay
đơn âm tiết đa âm tiệt S2 NG ^^
fe N fo
¿7 NN 7 bà
Từ ghép Từghép Từ lay Từ láy
tông hợp phân loại toàn bộ bộ phận
Figure 2-7 Structure of words in Vietnamese
In the monosyllable branch, a single word is a word with one syllable
e Single-syllable words: Single-syllable words consist of only one syllable
e Multi-syllable words: Single-syllable words are composed of multiple
syllables
For example, names of certain animals like “Ba ba”, “chuồn chuồn”, “châu chấu”:
borrowed words from foreign languages like Ti vi (TV), cà phê (coffee), in-ter-net(internet)
In the field of compound words, compound words are words that consist of two ormore words For example: Sạch sé (clean), sạch sành sanh (spotlessly clean), lúng talung túng (confused) The classification of compound words includes:
e Compound words: Compound words are complex words formed by combining
words that have a semantic relationship For example, cao lớn (tall and big,where both words have an equal semantic relationship), cao vit (tall andtowering, where "cao" is the main word and "vut" is the secondary word thatadds meaning to the main word)
31
Trang 31e Reduplicated words: Reduplicated words are complex words formed by
repeating the same initial sound, rhyme, or both For example, do do(measuring and red, where both words have the same initial sound and rhyme),lao xao (tumultuous, where both words have the same rhyme), x6n xao (stirred
up, where both words have the same initial sound)
e Composite compound words: For example, Trong xanh (fresh green), where
"Trong" and "xanh" have an equal semantic relationship
e Classifying compound words: For example, Xanh ri (greenish-blue), where
"xanh" is the main word and "ri" is the secondary word that adds meaning tothe main word
e Complete reduplicated words: For example, Xanh xanh (greenish-green),
where both words are completely identical
e Partial reduplicated words: For example, Xanh xao (greenish-faint), where
both words have the same initial sound
After understanding the structural features of Vietnamese words, we will explore thedefinition of tokenization and the current approaches to tokenization
2.3.1.2 Tokenization
Tokenization is one of the most important steps in text preprocessing Tokenization
is the process of splitting a group of words, a sentence, a paragraph, and one or moredocuments into smaller units Each of these smaller units is called a token Tokenscan be considered as the building blocks of NLP, and all NLP models process rawtext at the token level They are used to create a vocabulary in a corpus (a dataset inNLP) This vocabulary is then converted into numbers as an ID representation, whichhelps us build models A token can be a word, a sub-word, or a character
Different algorithms follow different procedures in performing tokenization There
are three levels of tokenization algorithms: word-based tokenization, character-based
tokenization, and sub word-based tokenization The differences between these threetypes of tokenization are outlined below:
32
Trang 32Word-based tokenization: This is a commonly used tokenization technique
in text analysis It divides a text into words (e.g., in English) or syllables (e.g.,
in Vietnamese) based on whitespace as the separator For example, thesentence "I like you." would be tokenized into [T, Tike, 'you.] Wordtokenization can be easily done using the split() method in Python There arealso many Python libraries that support word tokenization, such as NLTK,spaCy, Keras, Gensim, etc Depending on the NLP models used, appropriatetokenization methods are applied for different languages Depending on thetask, the same text can be processed under different token types Each tokentypically has a unique representation and is encoded as an ID, which serves as
a way to encode or identify tokens in a numerical space Figure 2-8 illustrateshow tokens are represented as numbers One limitation of this technique is that
it leads to a large vocabulary and a large number of tokens, making the modelcumbersome and computationally resource-intensive Additionally,misspelled words, after tokenization, are still considered as a single token Forexample, after tokenization, the token list includes both "minh" and "mih" - amisspelled version of "minh," and the model assigns the OOV (Out ofVocabulary) token to both words To address these issues, researchers haveproposed character-based tokenization techniques
Character-based tokenization: This technique splits raw text into individualcharacters For example, the sentence "I like you." would be tokenized into ['t’,'ô', 1',’t?, thị, Í, 'c, Thị, 'c', 'a’, tu] The idea behind this tokenization method isthat a language may have many different words, but only a fixed number ofcharacters This leads to a smaller vocabulary size For example, English has
256 different characters, including letters, numbers, punctuation marks, andspecial characters, while English has nearly 170,000 words in its vocabulary.Figure 2-8 provides an excellent example of the difference when convertingtokens into numerical format Each token corresponds to a number The figureillustrates both word-based and character-based tokenization algorithms
33
Trang 33grew a pretty little fir-tree; and yet it was not happy
"Rejoice with us," said the air and the sunlight Enjoy
(oo soe [es [a]
(wa us] a2] [sero]
(mse mer ae) [a(n [a]
This type of tokenization is simpler and can reduce memory and time However, acharacter usually does not carry as much meaning or information as a word, whichposes difficulties for models to learn the semantics of input representations Forexample, learning the meaning of the character "t" based on context is harder thanlearning the meaning of the word "like" based on context Additionally, while thistechnique helps reduce the vocabulary size, it increases the length of the tokenizedsequence in character-based tokenization Each word is divided into individualcharacters, resulting in a much longer tokenized sequence compared to the original
raw text Therefore, despite solving many challenges faced by word-based
tokenization, character-based tokenization still has certain issues
34
Trang 34Overall, character-based tokenization has its advantages but also some limitationsthat need to be considered.
e Subword-based tokenization algorithm is another popular technique for
tokenization, which is based on subword units Figure 2-9 provides an example
of the subword-based tokenization algorithm It is a solution that combinesalgorithms for word-based and character-based tokenization The main idea is
to simultaneously address the issues of word-based tokenization (largevocabulary size, many OOV tokens, tokens with similar semantics) andcharacter-based tokenization (long sequences and less meaningful individualtokens) Subword-based tokenization algorithms follow these principles:
e Do not split frequently occurring words into smaller subwords
e Split infrequently occurring words into meaningful subwords
Most Transformers models use subword-based tokenization algorithms, amongwhich WordPiece is commonly used by BERT and DistiIBERT, SentencePiece byXLNet and ALBERT, and Byte-Pair Encoding by GPT-2 and RoBERTa
Danh | sách | 180 | <unk> | nghề | hiện | nay
(4) Subword level (BPE)
Figure 2-9 Compare the difference of word separation at different levels
35
Trang 35Subword-based tokenization enables the model to achieve an optimal vocabulary s1zeand acquire contextually significant representations that are independent of specific
contexts Moreover, the model exhibits the capability to handle previously unseen
words by virtue of the segmentation process, which allows for the identification ofknown subword components Consequently, these advancements in tokenizationtechniques have emerged as a response to the growing demands of natural languageprocessing (NLP), facilitating enhanced problem-solving capabilities Notably, BytePair Encoding (BPE) stands as one of the prominent methods utilized for segmentingwords into subwords Subsequently, we will delve into an exploration of the
underlying principles and mechanisms of BPE
e Byte Pair Encoding (BPE):
The drawback of character-level tokenization is that the tokens are not meaningful
when considered independently Therefore, applying character-level tokenization to
sentiment analysis tasks may yield poorer results
Word-level tokenization also has limitations as it cannot handle out-of-vocabularywords
A new method proposed in the paper "Neural Machine Translation of Rare Wordswith Subword Units" in 2016 introduced a technique that can tokenize at a smallerlevel than words and larger than characters, called subword This method is known
as Byte Pair Encoding (BPE) According to this method, most words can berepresented by subwords, and we can significantly reduce the number of `<unk>`tokens representing previously unseen words This method has quickly been adopted
in various modern NLP approaches, from BERT and its variants like OpenAI GPT,
RoBERTa, DistilBERT, to XLNet
Applying tokenization using this new method has improved accuracy in various tasks
such as text translation, text classification, next sentence prediction, answering, and text relationship prediction
question-36
Trang 36e Tokenization for Vietnamese:
In the case of the Vietnamese language, which is an isolating language, the
characteristic is that Vietnamese words do not undergo morphological changes, andword boundaries are not indicated by whitespace The grammatical meaning inVietnamese lies outside of words, and the primary grammatical method is word orderand function words Therefore, there are cases where a sentence can have different
meanings depending on how we tokenize it, causing ambiguity in the sentence's
semantics For example, the sentence "Xoài phun thuốc sâu không ăn" can be
token1zed in two different ways with completely different meanings:
- "Xoài / phun thuốc / sâu / không / ăn."
- "Xoài / phun / thuốc sâu / không / ăn."
This demonstrates that word segmentation in Vietnamese is not an easy task because
it can generate sentences with completely different meanings, which affects the
quality of model training Therefore, word segmentation is crucial for processing theVietnamese language, especially when dealing with tasks related to the semantics ofthe text There are several toolkits available to support word segmentation inVietnamese, such as PyVi, Underthesea, VnCoreNLP, etc Each toolkit has its ownadvantages and disadvantages, for example, PyVi is faster, while VnCoreNLP offers
higher accuracy Generally, Vietnamese tokenization tools perform word
won
segmentation using the underscore character before tokenizing into individualtokens Depending on the task, we can choose to perform word segmentation ortokenization accordingly
In data processing, the input data for model training is a list of tokens Having
multiple tokens representing the same meaning can affect the effectiveness of model
training Therefore, the tokenization step is crucial before feeding the data into thetraining model, especially for Vietnamese due to the complexity of words and wordstructures compared to English
37
Trang 372.5.2 RNN, LSTM and Transformer Architecture
2.3.2.1 Recurrent Neural Network (RNN)
Recurrent Neural Networks (RNNs) are a specific type of Artificial Neural Network(ANN) characterized by interconnected nodes that simulate the behavior of neurons
in the human brain These neural connections, similar to synapses in biological brains,allow for the transmission of signals between nodes [24] The received signals areprocessed by artificial neurons, which in turn transmit the processed information toother connected nodes To facilitate the learning process, neurons and connectionsare assigned weights that can be adjusted accordingly These weights control thestrength of signals as they propagate from the input layers to the output layers ANNstypically comprise hidden layers situated between the input and output layers
In the case of RNNs, it is recommended to have a minimum of three hidden layers.The fundamental architecture of RNNs consists of input units, output units, andhidden units The hidden units are responsible for carrying out calculations throughweight adjustments, ultimately generating the desired outputs [18,25,26] Informationflows unidirectionally from the input units to the hidden units in RNNs, while a
directional loop compares the error of the current hidden layer with that of the
previous hidden layer, thereby enabling adjustments to the weights between thehidden layers Figure 2-10 illustrates a simplified RNN architecture featuring two
hidden layers
38
Trang 38Input Layer Hidden Layers Output Layer
Figure 2-10 A simple RNN
Recurrent Neural Networks (RNNs) are built upon traditional Neural Networks andare specifically designed to model sequential data, such as word sequences in asentence RNNs are called "recurrent" because they perform the same task for allelements of a sequence, with the output at each element depending on computationsperformed on previous elements In other words, RNNs are designed to have the
ability to remember information from preceding elements, making them suitable for
capturing the sequential dependencies among elements in a sequence
Recurrent Neural Networks though in theory are capable of handling long-termdependencies fall short when it comes to practical applications This problem wasvery well explored in depth by Hochreiter (1991) and Bengio, et al (1994) [1] It wasseen that remembering information over long periods requires calculating thedistances between distant nodes that involves multiple multiplications of the JacobianMatrix Problems with the more commonly occurring vanishing gradients and lesserfrequent exploding gradients caused the performance of these models to be notsatisfactory It was seen that a trade of between gradient descent based learning andthe time over which the information is held was required Recurrent Neural Networks
39
Trang 39(RNNs) are built upon traditional Neural Networks and are used to model sequentialdata such as word sequences in a sentence RNNs are called "recurrent" because theyperform the same task for all elements of a sequence, with the output at each elementdepending on computations on the previous elements In other words, RNNs aredesigned to have the ability to remember information from previous elements,making them suitable for representing the sequential dependencies of elements in a
commonly used to process other types of sequential data, such as predicting stockmarket trends
However, after being applied for some time, RNNs have shown certain weaknesses:
e Slow training speed, even when using Truncated Backpropagation for training
Despite this technique, the training speed remains slow as it relies on CPUsand cannot fully leverage parallel computing on GPUs
e Inefficient processing of long sequences due to the Gradient
Vanishing/Exploding problem As the number of words increases, the number
of units in the network also grows, causing the gradients to diminish gradually
in the later units due to the chain rule of differentiation This leads to the loss
of information about long-range dependencies between units
e To address the Gradient Vanishing problem of RNNs, the Long Short-Term
Memory (LSTM) model was introduced in 1991 LSTM cells have anadditional memory branch (C) that allows information to flow through the cell,
helping to maintain information for longer sentences
40
Trang 40In order to overcome this, Hochreiter and Schmidhuber (1997) [2] introduced theLong Short Term Memory networks usually called LSTM’s The LSTM’saccumulates long-term relationships between distant nodes by designing weightcoefficients between connections These networks have shown unbelievableapplications in speech processing, Natural Language Processing and imagecaptioning among other applications.
2.3.2.2 Long Short-Term Memory
Long short term memory units (LSTMs) are an advanced form of RNN designed tohandle the vanishing gradient problem properly A typical LSTM unit encompassesessential components, namely a cell, a forget gate, an input gate, and an output gate,
as visually depicted in Figure 2-5 The cell serves as a repository for retaining valuesacross arbitrary time intervals within its memory, while the three gates assume thepivotal role of regulating the influx and efflux of information into and out of the cell.These gates possess the capacity to discern the significance of data within a sequence,enabling the LSTM to retain essential information and discard irrelevant details
Through this mechanism, pertinent information is transmitted along the extensive
sequence chain to facilitate accurate predictions
The fundamental aspect of LSTMs lies in the cell state, symbolized by the horizontalline traversing the top section of Figure 2-11 Conceptually, it can be likened to aconveyor belt that allows information to flow unhindered The cell state in LSTMs issubject to modification, with the inclusion or removal of information being
meticulously regulated by the gates The sigmoid layer, represented by the symbol o,
produces outputs ranging between zero and one, where zero signifies the suppression
of information passage, while one signifies the unimpeded flow of information Theinitial gate, referred to as the forget gate, determines which information is to bediscarded from the cell state Subsequently, the input gate ascertains which novelinformation is to be stored in the cell state, while the output gate yields the final outputbased on the input and memory
4I