1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp: Using NLP sentiment analysis to detect emotion through by Online Reviews

114 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Sentiment Analysis of Customer Reviews of Food Delivery Services
Tác giả Nguyen Thi Khanh Ha, Vu Thi Quy
Người hướng dẫn PhD Do Trong Hop, MSc Nguyen Thu Thuy
Trường học University of Information Technology
Chuyên ngành Information Systems Engineering
Thể loại Graduation Thesis
Năm xuất bản 2023
Thành phố Ho Chi Minh City
Định dạng
Số trang 114
Dung lượng 62,29 MB

Nội dung

cseessessesesseeseesesseeseeseesnsceeesceeceaesaeeaeeasenscceceueeeeenteneenss 24 Figure 2-5 Overview of the sentiment analysis approaches...-.---- - - s+c+csxc+xsxerxex 27 Figure 2-6 Th

Trang 1

VIET NAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

FACULTY OF INFORMATION SYSTEMS

NGUYEN THI KHANH HA - 18520692

VU THI QUY - 18521317

GRADUATION THESIS

SENTIMENT ANALYSIS OF CUSTOMER REVIEWS

OF FOOD DELIVERY SERVICES

INFORMATION SYSTEMS ENGINEERING

INSTRUCTOR

PhD DO TRONG HOP

MSc NGUYEN THU THUY

HO CHI MINH CITY, 2023

Trang 2

In order to complete this graduation thesis, in addition to our personal efforts and

unwavering commitment, it is imperative to acknowledge the indispensable support

and assistance rendered by the faculty members at the University of Information

Technology, VNU-HCMC We would like to express our profound and sincere

appreciation to PhD Do Trong Hop, our esteemed supervisor, who wholeheartedly

aided us from the inception of our Deep Learning studies Dr Hop's unwavering trust

and encouragement during challenging times throughout the thesis composition have been invaluable Furthermore, we are immensely grateful for his insightful

contributions from the early stages of topic selection His astute guidance has

accompanied us throughout the entire research and writing process of this thesis We could not have envisioned an exemplary advisor and mentor for our academic

pursuit.

Although we have diligently endeavored to acquire knowledge, conduct

research, experiment, and achieve initial promising outcomes, the inherent limitations in our expertise and experience necessitate the anticipation of constructive feedback to refine and enhance the thesis.

Advisor Nguyễn Thi Khánh Hà

Vũ Thị Quý

Trang 3

2.3.1 Sentiment Analysis

2.3.2 Word CIGNH22 øất NEN !o 2Ú 2.3.3 Topic Modeling ceceeececec ees eeee sees neeeeeereeeeeeasseessasseeseseseaeseeneseseee 22

2.4 Current Approaches ccscecereeeerrererrrrerrrerre 24

2.4.1 Current Approaches in Sentiment Analysis 0.0.0 24 2.4.2 Topic Modeling Approaches ¿5552 S+Sc+csezxererrkerrree 29 2.5 Algorithm and Fundamental Concept

2.5.1 Vietnamese Tokenization -c:ccc+sxcrsrtrrererirrrrerrrrerree 30 2.5.2 RNN, LSTM and Transformer Arehitecture - : s+<< 38 2.5.3 Pretrained BERT Model Theory -.: +:-+cc++c+cccx+se+ 47

2.5.4 The theory of PhoBERT pretraining model - ‹- - +=++ 5

2.5.5 The theory of LDA model

2.6 Problem Statement ceeceesccesesseseeseseesesesesseseseeaesesesesesessesesaeseeneaes 55

Chapter 3 THEORETICAL DATASET AND SOLUTION APPROACH 57

3.3 Data Crawling cà HH Hi 8

Trang 4

3.4 Introduction to the Training Dataset

3.5 Process and Algorithms used in Sentiment AnalySis -. 6l 3.5.1 Data Preprocessing ccc St the 63 3.5.2 Tokenization and Encoding :- - 55+ ++s++x+xvezxererxrrererxer 71

3.5.3 Sentiment Analysis Model building - - c5++cscecxsxererxey 75

Chapter 4 SYSTEM DESIGN & IMPLEMENTATION .: 102

4.3 Overview of System Design.

4.4 Deployment of System ccsceseseesesessesesseresesrenesesnesesrensseeneaeerenseeenenees 105

4.4.1 Systems to be Deployed and Technologies Ủsed - 105 4.2.2 Database 46 5 ⁄⁄⁄ 59%/E C Thạ, oi 106 4.2.3 Visualization of Analytical Results -¿-¿55+5x+sxccxscxccxe 107

Chapter 5 RESULTS AND CONCLUSIONS sccsssessssseessesssesseesesstesseesseeseesee F12

5.1 Implementation Result -+c-++cccxccvsxsrxeererxer 112

5.2 LiL ati Tan SR AF osssecsvensensensceseasenssveasessesvesvensensensenseas 113

Trang 5

LIST OF FIGURES

Figure 1-1 Statista (2022, March 29) Online food delivery in Vietnam Report [1] 11

Figure 1-2 General Online Review Statistics [2] eccessssessesesesseseceeeseeeeseseeeeneceseaeeeeeeeeeeees 12

Figure 2-1 Uncover emotion: Social media sentiment analysis [3] -. -‹ 18

Figure 2-2 Word example 011712777 20

Figure 2-3 Visual illustration of undeleted word cloud Stop word in English 21

Figure 2-4 Sentiment analysis levels cseessessesesseeseesesseeseeseesnsceeesceeceaesaeeaeeasenscceceueeeeenteneenss 24

Figure 2-5 Overview of the sentiment analysis approaches -. - - s+c+csxc+xsxerxex 27

Figure 2-6 The chronological release of these proposed models over the years 29

Figure 2-7 Structure of words in Vietnamese cecssessssssessseseeseeseessesseasessesecseesesneeaseaeenee 31

Figure 2-8 Tokemization ceccecccccccseesessessesesseeseceseesecsesessecsesssessecussessecseseesesseeeseesesnsseeneeeseeneeneess 34

Figure 2-9 Compare the difference of word separation at different levels - 35

Figure 2-10 A simple RÌNN cà HH HH gàng nghe 39

Figure 2-11 Visualization of the repeating component of an LSTM Network 42

Figure 2-12 RNN, LSTM and Transformer 1lÏustrafiOn 5- +55 5++cs>xszsrezxsrxersree 43

Figure 2-13 Architecture of Transformers Neural NetWOrK «5c cs+csrererkerererree 45

Figure 2-14 BERT input representation cceecessessesesseesesseseesececsessececsecseeseceeseenecucseeeesecaeeneeesens 48

Figure 2-15 BERT pre-training and fine-funing - se x+xxerxerxerkerxerererkerkerkee 49

4010210809720) :.)0000205 ồ 53

Figure 3-1 Overview of the data processing pipeline in the system to address the three

tasks of Word Cloud, Topic Modeling, and Sentiment Analys1 . -++c+ccszssce2 57

Figure 3-2 Sentiment Analysis Processing 5-52-1111 kgrkrree 61

Trang 6

Figure 3-3 Example of lowercase converSion 1n DF€DTOC€SSITE -5- 5+ 5++c+s>c+x 65

Figure 3-4 Emojis on Facebook -: ++-++++++k+rxt+kt+ktrkttrkrkktrktrkkrtrrkrrrrerrrkrrrerke 66

Figure 3-5 Example of tOKefniZafIOIA - 5-2 < sex k1 1111111111111 re 72

Figure 3-6 Compare result between different fOOÌS ¿5+ 5++2+2z+vczvrtrterxerrrertersrree 73

Figure 3-7 Number of tokens per COMMENC sessesessecseseessesecsesesseescseeaeenececaecnecuceeenecueaeeneeneess 74 Figure 3-8 WordCloud Processing -¿- 2 <5 5< kề SE E11 1111111111111 ree 93

Figure 3-9 Word cloud image r€SuÏÍ - ¿- +5 5++5++++EE2EEYSESEEEEeEkerrktrkerrrrrkrrkrrrrkrrkrrrree 96

Figure 3-10 Topic modeling Sf€p - - «- << +s+s++keEeEvEkEEkEEEEEEKEE111111 111111 rkee 97

Figure 3-12 Results of the model €X€CUtIOI - 2-52 25+‡E‡ESEeEkerkerkerkrrkererkerkrrrrrk 100

Figure 4-1 illustrates the batch processing WOrkfÏOW -ccccxcreerrsrrrrrererrrerrek 103

Figure 4-2 Overview system architecture dia8TaIm se +xsx++xsxeexererxerkerkrrrrk 104

Figure 4-3 Dashboatd c.cceccccessssesssssssssseesessssessessesscsessecuescsesuessesecssansacsesucsessecuesuesecusseeseeneaeeneeneenes 109

Figure 4-4 Dashboard of result ceccececsessecsessessessnessessessecsessssueenesneeseeneeaeeaeeseeassesseeeeeeeeeneeaeenees 111

Trang 7

LIST OF TABLES

Table 1: Comparison of Transformer IIO(@ÏS SG St 50Table 2: LDA Result f(LÏ€ So 1 vn TH ng TH nh ghHhg 101Table 3: PostgreSQL database SG 33kg ket 105Table 4: Tables of the database Structure S531 SEEEESkeeeerereeereese 107

Trang 8

LIST OF ACRONYMS

AI: Artificial Intelligence

API: Application Programming Interface

ANN: Artificial Neural Network

CNN: Convolutional Neural Network

DBMS: Database Management System

MLP: Multilayer Perceptrons

NN: Neural Network

FC Layer: Full-connected Layer

RNN: Recurrent Neural Network

ReLU: Rectified Linear Unit

ResNet: Residual Network

Trang 9

Sentiment analysis plays a crucial role in understanding and evaluating customeropinions and feedback In the context of food delivery services, analyzing customerreviews can provide valuable insights into the overall satisfaction and sentiment ofcustomers towards the services provided

This study focuses on sentiment analysis of customer reviews specifically related tofood delivery services The aim is to analyze and classify customer sentiments aspositive, neutral, or negative based on their reviews By applying natural languageprocessing techniques and machine learning algorithms, the study aims to extractmeaningful information from textual data and identify the sentiment expressed by

customers.

The analysis of customer reviews can help service providers gain a deeperunderstanding of customer preferences, identify areas of improvement, and makeinformed decisions to enhance the quality of their services It also enables them to

address customer concerns and complaints promptly, thereby improving customer

satisfaction and loyalty

The findings of this sentiment analysis can be used by food delivery service providers

to monitor customer sentiments over time, track the effectiveness of serviceimprovements, and make data-driven decisions to enhance their overall customer

experience

In conclusion, sentiment analysis of customer reviews of food delivery services

provides valuable insights into customer sentiment, allowing service providers tomake informed decisions and improvements to enhance customer satisfaction andloyalty

10

Trang 10

Chapter 1 INTRODUCTION

1.1 Problem statement

Vietnam's culinary landscape is widely acknowledged for its diverse offerings andembodies a dynamic food culture Food holds deep significance within Vietnamesesociety and has undergone transformative changes over time

Examining the market trends, the online food delivery industry in Vietnamexperienced remarkable growth from 2016 to 2020, with an annual growth rate of

96.8 percent However, from 2021 to 2025, the growth rate gradually decreased to

35.8 percent per year Despite this decline, it is projected that by 2025, the

Vietnamese food delivery market will attain a substantial value of 2,709.7 million

USD, reflecting the sustained demand and potential in the future of the industry

Figure 1-1 Statista (2022, March 29) Online food delivery in Vietnam Report [1]

In recent years, a notable shift in consumer preferences and behaviors can beobserved, as contemporary Vietnamese consumers increasingly gravitate towardsdigital modes of consumption, driven by their fast-paced lifestyles that prioritizeconvenience This shift has resulted in a significant surge in the demand for onlinefood delivery services in Vietnam

11

Trang 11

The emergence of online platforms has bestowed customers with the power to voice

their opinions, both positive and negative, at any hour of the day According to arecent survey conducted by Podium, nearly everyone (93%) say that an online review

has impacted their purchase habits More details, consumers expect high standardsfrom the brands they do business, with most saying they will not engage with abusiness or product that has less than a 3.3-star rating In order to uphold theirreputation, enhance the guest experience, and avert negative trends from affectingsales, food brands are increasingly resorting to sentiment analysis as a valuable tool

of consumers say online reviews

%

93% impact purchase decisions

of consumers say content of a review

82%

has convinced them to make a purchase

of consumers say online reviews for local businesses are

80% as helpful as product reviews on sites like Amazon.com

on average, is the minimum star rating of a business

consumers would consider engaginng with

3

Figure 1-2 General Online Review Statistics [2]

Nevertheless, manually scrutinizing and interpreting customer reviews of online fooddelivery services is an onerous task for businesses due to various factors Firstly, there

is a voluminous amount of textual data to process and interpret, which can be consuming and laborious for businesses Additionally, customer reviews are ofteninformal and lack a structured format, making it arduous to extract meaningful

time-insights from the data Secondly, different customers may express their opinions and

sentiments in various ways, including sarcasm, irony, and humor, which maynecessitate additional effort to comprehend accurately Thirdly, businesses mayencounter difficulties in detecting and categorizing different aspects of customersatisfaction, such as food quality, delivery speed, and customer service Lastly,

Trang 12

businesses must stay abreast of the evolving trends in customer preferences andneeds, which exacerbates the complexity of manual analysis.

1.2 Problem Solution

In this era where the confidence index is paramount, businesses require effective toolsand techniques to analyze and comprehend customer reviews of online food deliveryservices in Vietnam Sentiment analysis, a Natural Language Processing technique,can automatically classify the polarity of textual data, providing insights into theemotions and opinions expressed in customer reviews By leveraging sentimentanalysis, businesses can promptly and proficiently process extensive volumes ofcustomer reviews, discern patterns and trends in customer feedback, extract

meaningful insights, and enhance their business strategies accordingly

The problem of sentiment analysis can be regarded as a subtask within the field ofNatural Language Processing (NLP), aiming to extract the affective states andsubjective opinions expressed by individuals through textual means Numerousresearch endeavors have been dedicated to developing applications based on

sentiment analysis However, the task of comprehending human sentiment is

inherently intricate In practice, it necessitates discerning the underlying intentions ofcommenters, such as identifying instances of irony, sarcasm, or subjectivityembedded within the text Moreover, user-generated review often diverges from the

formal linguistic conventions found in traditional literature or journalism, exhibiting

a range of linguistic imperfections such as orthographic errors, informal vernacular,

slang expressions, or abbreviations

Furthermore, due to the target domain of natural language data processing andanalysis, each language exhibits distinct characteristics that warrant diversetechnological approaches and preprocessing procedures Consequently, despite theabundance of solutions proposed for sentiment analysis, predominantly tailored forthe English language, their direct applicability to sentiment analysis tasks involvingVietnamese comment data remains limited Therefore, the application of sentiment

13

Trang 13

analysis models demands adaptability and customization to address the intricacies ofVietnamese language data effectively.

1.3 Goal and Study Scope

The first important objective of this study is to analyze thoroughly the Vietnamese

people's comments on the Food Finding and Reviewing website Comments are avaluable source of opinions and feedback that users provide after using a product,service, or visiting a place, and they can be positive, negative, or neutral

In this thesis, we will craw food reviews from the Foody website in all of the cities

and provinces in Vietnam, which is a prominent platform for reviewing and searching

for food locations across most provinces and cities in Vietnam

Furthermore, a crucial aim of this research endeavor is to employ a diverse array ofanalytical techniques, including Word Cloud, Latent Dirichlet Allocation, as well as

LSTM and PhoBERT sentiment analysis models, to visualize the outcomes

comprehensively This multifaceted visualization approach facilitates an in-depthexamination of the sentiment distribution inherent in the comments, enabling theidentification of pivotal patterns and trends that hold relevance for informed businessdecision-making The integration of these advanced analytical methods serves toelucidate the sentiment landscape from a macroscopic overview to intricate details,thereby fostering a holistic understanding of the data

Finally, we will build a system that can automate the entire process of data processing,

analysis, and displaying the results on a web platform This system will help us to

efficiently analyze a large volume of comments and present the results in a

user-friendly and interactive way, allowing users to interact with the data and gain deeper

insights Overall, this study will contribute to the field of sentiment analysis in theVietnamese language and provide valuable insights for businesses looking to improvetheir products and services based on customer feedback

14

Trang 14

Chapter 2: Background Knowledge and Current Works

This chapter presents an in-depth overview of the analysis problem, focusing on theconcept of sentiment analysis and its relevance in the field It also explores the twokey techniques employed in this study, namely Word Cloud and Latent DirichletAllocation (LDA), discussing their principles and applications in sentiment analysis

Chapter 3: Dataset and Solution Approach

Chapter 3 delves into the details of the dataset used in this research, outlining the

process of crawling and preprocessing the data to create the training dataset

Furthermore, it provides insights into the algorithms and techniques employed in thesentiment analysis process

Chapter 4: System Design & Implementation

Chapter 4 focuses on the system design and implementation aspects It elaborates onthe experiments conducted for each of the techniques and models employed in thisstudy, providing a detailed account of the setup and configuration used forimplementation

15

Trang 15

Chapter 5: Results and Conclusions

Chapter 5 presents the results obtained from the implementation and experimentationprocess It includes an analysis of the implementation outcomes, achievements, andlimitations encountered during the study Additionally, this chapter offers insightsinto potential future work and areas of improvement for further research

Overall, this thesis structure encompasses the necessary components to provide acomprehensive understanding of the research, from the initial overview to thedetailed technical aspects, and finally, concluding with the results and implications

of the study

16

Trang 16

Chapter 2 THEORETICAL FOUNDATIONS

2.3 Overview of the Analysis Problem

As mentioned in section 1.1, understanding user sentiment is a key factor inimproving the ideal service/product experience When users perceive that theiremotions are being acknowledged, they tend to trust and establish long-termconnections with the business/organization Therefore, the problem at hand is how tounderstand the sentiments and opinions of users Merely concluding the sentimentlevel of each comment as positive, neutral, or negative is not truly effective inunderstanding user preferences

The analysis problem in this context involves going beyond predicting sentiment

labels for each comment and instead analyzing various aspects Specifically, theproblem can be divided into two sub-problems:

e Sentiment Analysis: This task involves predicting labels for each comment.

The labels consist of three categories: positive, neutral, or negative,represented as 1, 0, and -1, respectively

e Word Cloud: The Word Cloud involves displaying the most frequently

occurring meaningful words or word clusters in a text It aims to visually

represent the importance or relevance of different words based on theirfrequency of occurrence in the text A word cloud provides a quick overview

of the key terms or concepts present in the text, with larger and bolder wordsindicating higher frequency

e Topic Modeling: This task aims to discover hidden topics within the text A

group of keywords and their respective frequencies of occurrence representeach topic

By addressing these sub-problems, we can gain a comprehensive understanding ofuser sentiments, preferences, and the underlying topics discussed in the comments

17

Trang 17

2.3.1 Sentiment Analysis

Definition

Sentiment Analysis, also known as Opinion Mining, is the task of extracting and

analyzing opinions, emotions, attitudes, and perceptions of individuals regardingvarious entities such as topics, products, and services The rapid development ofInternet applications, such as websites, social networks, and blogs, has led users togenerate a vast amount of opinions and reviews about products, services, and dailyactivities Sentiment analysis is considered a powerful tool for businesses,governments, and researchers to extract and analyze the mood and viewpoints of thepublic, gaining deep insights into their operations and making better decisions

POSITIVE NEUTRAL NEGATIVE

"Great service for an affordable “Just booked two nights “Horrible services The room

price at this hotel.” was dirty and unpleasant.

We will definitely be booking again." Not worth the money."

Figure 2-1 Uncover emotion: Social media sentiment analysis [3]

Importance

Sentiment analysis is one of the crucial tasks in the field of natural languageprocessing It holds significance not only in academia and research but also plays avital role in various industries and services, specifically in understanding the behaviorand attitudes of customers towards products and services they use We are living inthe digital age, particularly in recent years, where social media platforms and onlinewebsites have millions of users worldwide generating a tremendous amount ofinformation and content on a daily basis, with diverse cultural backgrounds,perspectives, and knowledge levels Even online information and events can becollected from media sources The wide-ranging impact of media channels on our

18

Trang 18

lives has propelled the application of sentiment analysis in text across variousdomains of social life, including brand management, customer opinion surveys, andpsychological behavior analysis.

Data sources

Sentiment analysis is used as a powerful tool to automate the process of analyzingand evaluating user opinions In this thesis, we will craw food reviews data from theFoody website in all of the cities and provinces in Vietnam, which is a prominent

platform for reviewing and searching for food locations across most provinces and

cities in Vietnam

In customer feedback systems, comments are often collected in the form of ratingscales (e.g., 1-5 stars or 1-10 stars) or satisfaction levels (e.g., very unsatisfied,unsatisfied, neutral, satisfied, very satisfied) provided by customers These rating

scales or satisfaction levels reflect the customers’ satisfaction, perspectives, and

opinions on negative, neutral, and positive emotional values In addition to recordingratings, the system also collects user opinions in the form of textual comments

Objectives of Sentiment Analysis

The objective of sentiment analysis is the process of determining and classifying text

into different sentiments, such as positive, negative, or neutral, or specific emotionslike happiness, sadness, anger, or fear It aims to identify the human's attitude towards

a particular subject or entity In this thesis, the output of the sentiment analysis task

is to assign one of three labels to each comment: positive, negative, or neutral

19

Trang 19

2.3.2 Word Cloud

Definition:

A Word Cloud, also known as a tag cloud, is a visual representation method thatallows us to describe and highlight the most frequently used keywords in a piece oftext True to its name, when visualizing the data in the given text, the data isrepresented in the form of a word cloud, where the words are displayed in varying

sizes and intensities based on their frequency of occurrence in the original text The

more a word appears, the larger and bolder it appears in the word cloud, indicatingits popularity and significance in the text

staff crime customer

effort : hot

movie economy

together note attorney

bill address p 2 n early

figure special difficult

drop

en i t

Bae senior _ send price _ recently

design foot show

, tl CS mrS _ meet read live improve S9Y body Dinh p

` yun

certain words called stop words, such as "thi," "va," "là," "mà," "to," etc These wordsaccount for about 25% of the text Stop words often carry little meaningful value and

do not differ significantly across different texts Imagine if the analysis result only

showed words like "mình," "này," "để," etc Clearly, this would affect the analysis

20

Trang 20

result and would not provide much assistance to the intended audience reading the

analysis result

emissions

report action united that

at paris is Change have Would it global ice be from greenhouse this average

convention nations

TO joo sas with 2 © 2 of Climate

agreement rise warming C earth on

A sea wollen level

In are parties

Figure 2-3 Visual illustration of undeleted word cloud Stop word in English

In NLP, it is common practice to filter out noise words, including stop words To

remove stop words from a text, there are several approaches, but two main methods

are commonly used:

o Dictionary-based approach: In this method, we filter the text by removing

words that appear in a predefined stop word dictionary

o Frequency-based approach: With this method, we count the occurrences of

each word in the text and then remove words that appear frequently (orinfrequently) Research has shown that the most frequently appearing wordsoften carry less meaningful information

Benefits of the Word Cloud task:

There are several advantages of using a Word Cloud to visualize and present data:

o Word Clouds are highly visual and provide a clear representation of the data,

making it easy to understand

o Word Clouds are user-friendly and easy to interpret

21

Trang 21

o Word Clouds offer a more visually appealing alternative to traditional data

presentations such as tables or charts

2.3.3 Topic Modeling

The concept of "topic" in topic modeling can be understood as follows According tothe Cambridge Dictionary, a topic is a subject that is discussed, written about, orstudied In the Oxford Dictionary, a topic is a matter presented in a text, essay, orconversation Hidden topics refer to the unidentified (unlabeled) topics in the process

of generating user-generated content

Topic modeling is a technique used to examine and explore textual data by searchingand statistically analyzing relevant words related to topics in each document, thusdiscovering latent topics within the text In this process, terms or words that exhibitsimilarity are grouped together, and topics are identified based on the statistical

probability of the occurrence of those words The topic modeling approach wasinitially proposed by Deerwester and colleagues in 1990, further developed by

research groups such as Hofmann in 1999 and Blei in 2003 Current approaches tocontent modeling are based on the idea of estimating the probability distribution of

each distinctive word in the document This distribution considers the document as a

mixture of multiple topics, with each topic being a combination of multiple wordsaccompanied by its own probability distribution Typically, textual data is not limited

to a single topic but may cover multiple topics Therefore, the task of topic modeling

is to identify the topics present in the textual data Most topic models rely on thefollowing assumptions:

e Each document consists of multiple topics

e Each topic consists of multiple words

The goal of topic modeling is to explore the hidden topics within documents by

identifying the words associated with each topic

22

Trang 22

Benefits and Applications of Topic Modeling:

Topic modeling offers several benefits and applications Readers can easily selecttheir preferred genre of news articles through the assigned topics Book buyers canchoose books related to their interested topics based on the identified topics Newsproviders can summarize the content of news articles using topics In summary, topicmodeling helps us understand the content, issues, and characteristics of differenttopics, enabling readers to quickly and accurately determine the content of analyzedresults

Models for Topic Modeling:

In text mining, we often gather documents such as blog posts or news articles that wewant to categorize into natural groups to understand them individually Topic

modeling is an unsupervised classification method for such documents, similar to

grouping data into a predefined number of clusters, helping to identify the naturalcomponents of these groups since we don't know the exact composition of eachgroup Therefore, topic modeling reveals hidden semantic structures and providesprofound insights into unstructured data, which is abundant on the internet Somepopular topic models include Latent Semantic Analysis (LSA), Probabilistic Latent

Semantic Analysis (pLSA), Latent Dirichlet Allocation (LDA), and more

23

Trang 23

2.4 Current Approaches

2.4.1 Current Approaches in Sentiment Analysis

e Regarding the level:

Currently, there are multiple levels of sentiment analysis tasks However, according

to Marouane et al [19], there are three main levels for detecting sentiment: level, sentence-level, and aspect-level Figure 2-4 illustrates the levels of sentimentanalysis Marouane et al [19] suggest that aspect-level sentiment analysis is morechallenging as it involves a finer-grained control at a detailed level

Figure 2-4 Sentiment analysis levels

e Document Level Sentiment Analysis:

At the document level, sentiment analysis is performed on the entire document,assigning a single polarity to the entire document This level of analysis is notcommonly used but can be used to classify chapters or pages of a book as positive,

negative, or neutral Both supervised and unsupervised learning approaches can be

employed for document classification Cross-domain and cross-language sentimentanalysis are significant challenges at the document level Domain-specific sentimentanalysis achieves high accuracy by utilizing domain-specific and limited feature

vectors.

24

Trang 24

e Sentence Level Sentiment Analysis:

At the sentence level, each sentence is analyzed individually, and a correspondingpolarity is assigned This level of analysis is useful when a document contains avariety of sentiments Subjective classification is associated with sentence-levelanalysis The polarity of each sentence is determined independently using similarmethodologies as document-level analysis, but with more training data andprocessing resources The sentence polarities can be aggregated to determine the

sentiment of the document or used individually Sentence-level analysis is

particularly important for working with conditional sentences or ambiguous

statements.

e Phrase Level Sentiment Analysis:

Phrase-level sentiment analysis involves mining opinion words at the phrase leveland performing classification Each phrase may contain multiple aspects or a single

aspect This level of analysis is applicable in product reviews with multiple lineswhere a single aspect is expressed in a phrase It has gained significant attention fromresearchers recently While document-level analysis focuses on categorizing theentire document as subjective or objective, sentence-level analysis is more beneficial

as a document usually contains both positive and negative statements Words, as thebasic unit of language, are closely related to the subjectivity of sentences ordocuments in which they appear Sentences containing adjectives are highly likely to

be subjective Additionally, the choice of terms used for expression reflectsdemographic characteristics, desires, social standing, personality, and otherpsychological and social characteristics Therefore, terms form the foundation for textsentiment analysis

25

Trang 25

e Aspect Level Sentiment Analysis:

Aspect-level sentiment analysis is performed at the aspect level Each sentence maycontain multiple aspects, and this level of analysis focuses on all aspects present inthe sentence Polarity is assigned to each aspect, and an aggregate sentiment iscalculated for the entire sentence

Sentiment analysis constitutes a dynamic and thriving domain of research, withdiverse applications across multiple fields Consequently, scholars consistentlypropose, assess, and contrast various methodologies The primary objective is toenhance the efficacy of sentiment analysis and address the challenges inherent in thisfield Moreover, the incorporation of sentiment analysis into novel domains offerssubstantial motivation and elevates the significance of this undertaking However, the

meticulous selection of the suitable approach for sentiment analysis assumes utmost

importance and criticality Hence, this section aims to furnish a comprehensiveoutline of the predominant methodologies employed for conducting sentimentanalysis, which are widely recognized in academic circles

The existing methodologies for sentiment analysis can be classified based on differentperspectives, such as textual viewpoint and level of textual analysis depth [136].However, in most scholarly literature, sentiment analysis approaches are typicallycategorized into three distinct groups: Machine Learning approaches, Lexicon-Basedapproaches, and Hybrid approaches [29,137,138]

Machine learning represents the predominant approach extensively employed insentiment analysis It relies on machine learning algorithms and linguistic features toaccomplish sentiment classification Conversely, the lexicon-based approach utilizessentiment lexicons, which encompass a compilation of words and phrases commonlyemployed to convey positive or negative sentiments [139] Hybrid approaches, on the

other hand, amalgamate machine learning and lexicon-based techniques to enhancethe performance of sentiment analysis Figure 2-5 provides an overview of the

sentiment analysis approaches

26

Trang 26

Figure 2-5 Overview of the sentiment analysis approaches

Prior to 2010, traditional methods dominated the field of text classification Thesemethods demonstrated clear and stable accuracy when compared to previous rule-based approaches However, they had certain limitations One drawback was the needfor feature extraction, which was a time-consuming and expensive process

Additionally, traditional methods often overlooked the natural sequential structure

and contextual information present in text data, thus hindering the learning ofsemantic information encoded in words These factors presented significantchallenges for text classification

In the 2010s, there was a gradual shift from traditional models to deep learningmodels in the field of text classification Deep learning methods offered several

27

Trang 27

advantages over their traditional counterparts They eliminated the requirement for

manual rule design and feature engineering by leveraging automated techniques to

generate semantically meaningful representations for text mining Consequently, amajority of text classification research focused on Deep Neural Networks (DNN),which offered a data-driven approach albeit with higher computational complexity.Text classification is the process of extracting features from raw text data andpredicting their corresponding categories Over the past few decades, numerousmodels have been proposed for text classification Among the traditional models, theNaive Bayes model was the first to be used for text classification tasks Subsequently,general classification models such as K-Nearest Neighbors (KNN), Support VectorMachines (SVM), and Random Forest (RF) emerged as popular classifiers in thedomain of text classification

In the realm of deep learning models, TextCNN garnered the highest number of

references This model introduced the Convolutional Neural Network (CNN) totackle text classification problems effectively Additionally, the BidirectionalEncoder Representation from Transformers (BERT), although not specificallydesigned for text classification tasks, found extensive usage in designing textclassification models Its effectiveness has been evaluated on various textclassification datasets, particularly in the domain of emotion analysis

As a result, several pre-training models based on BERT have been developed,including ALBERT, RoBERTa, XLNET, DitiIBERT, BART, and others Figure 2-6illustrates the chronological release of these proposed models over the years

28

Trang 28

Ai2 OpenAI trie Transformer |

ELMo GPT BERT ELMO MT-DNN XLM XLM-R °

Figure 2-6 The chronological release of these proposed models over the years

2.4.2 Topic Modeling Approaches

In recent years, with the advancement of technology and the internet, users can easilyexpress their opinions and reviews about products and services online These user-generated texts are stored as textual data, which presents a vast source for datamining Due to the strong growth in this field and the development of e-commerceand the internet, customers now have more choices when shopping or using services.Therefore, businesses need to understand their customers to timely meet their needs.One method is to understand the issues that customers comment on As a result, manystudies have been conducted using various methods and models to analyze customer

experiences and improve the quality of products and services

One such study by Raut & Londhe, as presented by Van-Ho et al (2020) [24], utilizedmachine learning techniques and SentiWordNet to extract opinions from hotelreviews The project relied on the relevance between sentences to aggregate topicsrelated to hotel reviews The results successfully classified and summarized hotelreviews, enabling businesses to understand customer preferences Van-Ho et al

29

Trang 29

(2020) [24] also presented a study on analyzing customer feedback in the tourismindustry by proposing a text summarization technique to identify topics Another

study [24] examined content and sentiment similarity to determine the similarity

between two comment sentences The study used the k-medoids clustering algorithm

to group sentences into k clusters

In Berezina's study [24], an evaluation of 2,510 online hotel reviews collected fromTripAdvisor.com for Sarasota, Florida, was conducted The research findingsrevealed common "themes" used in both positive and negative reviews, includingbusiness facilities (e.g., hotel and interior design, staff, and amenities) The study alsoindicated that satisfied customers were more likely to recommend the hotel to others,mentioning intangible factors related to their stay, such as the staff's behavior, morefrequently than unsatisfied customers On the other hand, unsatisfied customers more

frequently referred to tangible aspects of the hotel, such as the interior and financial

aspects (cost, price) The research provided theoretical implications and clearmanagerial insights into understanding customer satisfaction and dissatisfactionthrough text mining and hotel ratings obtained from review websites, social media,blogs, and other online platforms

2.5 Algorithm and Fundamental Concept

2.5.1 Vietnamese Tokenization

Tokenization, also known as word segmentation, is the process of splitting text intoindividual tokens or segments In the case of this thesis, the data is in Vietnamese, so

to perform effective tokenization, we need knowledge about words and word

structures in Vietnamese Before delving into the definition and types of tokenization,let's explore the characteristics of word structures in Vietnamese

2.3.1.1 Characteristics of Vietnamese Word Structures

Each text is composed of sentences, and within each sentence, words are constructed

or composed together to form a complete sentence Each word is formed fromindividual syllables This means:

30

Trang 30

e A word is the smallest linguistic unit used to form a sentence.

TU (Phân loại theo cau tạo)

Từ đơn Từ phức

K ˆ ẢN TT a a a ra

Từ đơn Từ đơn Từ ghép Từ lay

đơn âm tiết đa âm tiệt S2 NG ^^

fe N fo

¿7 NN 7 bà

Từ ghép Từghép Từ lay Từ láy

tông hợp phân loại toàn bộ bộ phận

Figure 2-7 Structure of words in Vietnamese

In the monosyllable branch, a single word is a word with one syllable

e Single-syllable words: Single-syllable words consist of only one syllable

e Multi-syllable words: Single-syllable words are composed of multiple

syllables

For example, names of certain animals like “Ba ba”, “chuồn chuồn”, “châu chấu”:

borrowed words from foreign languages like Ti vi (TV), cà phê (coffee), in-ter-net(internet)

In the field of compound words, compound words are words that consist of two ormore words For example: Sạch sé (clean), sạch sành sanh (spotlessly clean), lúng talung túng (confused) The classification of compound words includes:

e Compound words: Compound words are complex words formed by combining

words that have a semantic relationship For example, cao lớn (tall and big,where both words have an equal semantic relationship), cao vit (tall andtowering, where "cao" is the main word and "vut" is the secondary word thatadds meaning to the main word)

31

Trang 31

e Reduplicated words: Reduplicated words are complex words formed by

repeating the same initial sound, rhyme, or both For example, do do(measuring and red, where both words have the same initial sound and rhyme),lao xao (tumultuous, where both words have the same rhyme), x6n xao (stirred

up, where both words have the same initial sound)

e Composite compound words: For example, Trong xanh (fresh green), where

"Trong" and "xanh" have an equal semantic relationship

e Classifying compound words: For example, Xanh ri (greenish-blue), where

"xanh" is the main word and "ri" is the secondary word that adds meaning tothe main word

e Complete reduplicated words: For example, Xanh xanh (greenish-green),

where both words are completely identical

e Partial reduplicated words: For example, Xanh xao (greenish-faint), where

both words have the same initial sound

After understanding the structural features of Vietnamese words, we will explore thedefinition of tokenization and the current approaches to tokenization

2.3.1.2 Tokenization

Tokenization is one of the most important steps in text preprocessing Tokenization

is the process of splitting a group of words, a sentence, a paragraph, and one or moredocuments into smaller units Each of these smaller units is called a token Tokenscan be considered as the building blocks of NLP, and all NLP models process rawtext at the token level They are used to create a vocabulary in a corpus (a dataset inNLP) This vocabulary is then converted into numbers as an ID representation, whichhelps us build models A token can be a word, a sub-word, or a character

Different algorithms follow different procedures in performing tokenization There

are three levels of tokenization algorithms: word-based tokenization, character-based

tokenization, and sub word-based tokenization The differences between these threetypes of tokenization are outlined below:

32

Trang 32

Word-based tokenization: This is a commonly used tokenization technique

in text analysis It divides a text into words (e.g., in English) or syllables (e.g.,

in Vietnamese) based on whitespace as the separator For example, thesentence "I like you." would be tokenized into [T, Tike, 'you.] Wordtokenization can be easily done using the split() method in Python There arealso many Python libraries that support word tokenization, such as NLTK,spaCy, Keras, Gensim, etc Depending on the NLP models used, appropriatetokenization methods are applied for different languages Depending on thetask, the same text can be processed under different token types Each tokentypically has a unique representation and is encoded as an ID, which serves as

a way to encode or identify tokens in a numerical space Figure 2-8 illustrateshow tokens are represented as numbers One limitation of this technique is that

it leads to a large vocabulary and a large number of tokens, making the modelcumbersome and computationally resource-intensive Additionally,misspelled words, after tokenization, are still considered as a single token Forexample, after tokenization, the token list includes both "minh" and "mih" - amisspelled version of "minh," and the model assigns the OOV (Out ofVocabulary) token to both words To address these issues, researchers haveproposed character-based tokenization techniques

Character-based tokenization: This technique splits raw text into individualcharacters For example, the sentence "I like you." would be tokenized into ['t’,'ô', 1',’t?, thị, Í, 'c, Thị, 'c', 'a’, tu] The idea behind this tokenization method isthat a language may have many different words, but only a fixed number ofcharacters This leads to a smaller vocabulary size For example, English has

256 different characters, including letters, numbers, punctuation marks, andspecial characters, while English has nearly 170,000 words in its vocabulary.Figure 2-8 provides an excellent example of the difference when convertingtokens into numerical format Each token corresponds to a number The figureillustrates both word-based and character-based tokenization algorithms

33

Trang 33

grew a pretty little fir-tree; and yet it was not happy

"Rejoice with us," said the air and the sunlight Enjoy

(oo soe [es [a]

(wa us] a2] [sero]

(mse mer ae) [a(n [a]

This type of tokenization is simpler and can reduce memory and time However, acharacter usually does not carry as much meaning or information as a word, whichposes difficulties for models to learn the semantics of input representations Forexample, learning the meaning of the character "t" based on context is harder thanlearning the meaning of the word "like" based on context Additionally, while thistechnique helps reduce the vocabulary size, it increases the length of the tokenizedsequence in character-based tokenization Each word is divided into individualcharacters, resulting in a much longer tokenized sequence compared to the original

raw text Therefore, despite solving many challenges faced by word-based

tokenization, character-based tokenization still has certain issues

34

Trang 34

Overall, character-based tokenization has its advantages but also some limitationsthat need to be considered.

e Subword-based tokenization algorithm is another popular technique for

tokenization, which is based on subword units Figure 2-9 provides an example

of the subword-based tokenization algorithm It is a solution that combinesalgorithms for word-based and character-based tokenization The main idea is

to simultaneously address the issues of word-based tokenization (largevocabulary size, many OOV tokens, tokens with similar semantics) andcharacter-based tokenization (long sequences and less meaningful individualtokens) Subword-based tokenization algorithms follow these principles:

e Do not split frequently occurring words into smaller subwords

e Split infrequently occurring words into meaningful subwords

Most Transformers models use subword-based tokenization algorithms, amongwhich WordPiece is commonly used by BERT and DistiIBERT, SentencePiece byXLNet and ALBERT, and Byte-Pair Encoding by GPT-2 and RoBERTa

Danh | sách | 180 | <unk> | nghề | hiện | nay

(4) Subword level (BPE)

Figure 2-9 Compare the difference of word separation at different levels

35

Trang 35

Subword-based tokenization enables the model to achieve an optimal vocabulary s1zeand acquire contextually significant representations that are independent of specific

contexts Moreover, the model exhibits the capability to handle previously unseen

words by virtue of the segmentation process, which allows for the identification ofknown subword components Consequently, these advancements in tokenizationtechniques have emerged as a response to the growing demands of natural languageprocessing (NLP), facilitating enhanced problem-solving capabilities Notably, BytePair Encoding (BPE) stands as one of the prominent methods utilized for segmentingwords into subwords Subsequently, we will delve into an exploration of the

underlying principles and mechanisms of BPE

e Byte Pair Encoding (BPE):

The drawback of character-level tokenization is that the tokens are not meaningful

when considered independently Therefore, applying character-level tokenization to

sentiment analysis tasks may yield poorer results

Word-level tokenization also has limitations as it cannot handle out-of-vocabularywords

A new method proposed in the paper "Neural Machine Translation of Rare Wordswith Subword Units" in 2016 introduced a technique that can tokenize at a smallerlevel than words and larger than characters, called subword This method is known

as Byte Pair Encoding (BPE) According to this method, most words can berepresented by subwords, and we can significantly reduce the number of `<unk>`tokens representing previously unseen words This method has quickly been adopted

in various modern NLP approaches, from BERT and its variants like OpenAI GPT,

RoBERTa, DistilBERT, to XLNet

Applying tokenization using this new method has improved accuracy in various tasks

such as text translation, text classification, next sentence prediction, answering, and text relationship prediction

question-36

Trang 36

e Tokenization for Vietnamese:

In the case of the Vietnamese language, which is an isolating language, the

characteristic is that Vietnamese words do not undergo morphological changes, andword boundaries are not indicated by whitespace The grammatical meaning inVietnamese lies outside of words, and the primary grammatical method is word orderand function words Therefore, there are cases where a sentence can have different

meanings depending on how we tokenize it, causing ambiguity in the sentence's

semantics For example, the sentence "Xoài phun thuốc sâu không ăn" can be

token1zed in two different ways with completely different meanings:

- "Xoài / phun thuốc / sâu / không / ăn."

- "Xoài / phun / thuốc sâu / không / ăn."

This demonstrates that word segmentation in Vietnamese is not an easy task because

it can generate sentences with completely different meanings, which affects the

quality of model training Therefore, word segmentation is crucial for processing theVietnamese language, especially when dealing with tasks related to the semantics ofthe text There are several toolkits available to support word segmentation inVietnamese, such as PyVi, Underthesea, VnCoreNLP, etc Each toolkit has its ownadvantages and disadvantages, for example, PyVi is faster, while VnCoreNLP offers

higher accuracy Generally, Vietnamese tokenization tools perform word

won

segmentation using the underscore character before tokenizing into individualtokens Depending on the task, we can choose to perform word segmentation ortokenization accordingly

In data processing, the input data for model training is a list of tokens Having

multiple tokens representing the same meaning can affect the effectiveness of model

training Therefore, the tokenization step is crucial before feeding the data into thetraining model, especially for Vietnamese due to the complexity of words and wordstructures compared to English

37

Trang 37

2.5.2 RNN, LSTM and Transformer Architecture

2.3.2.1 Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNNs) are a specific type of Artificial Neural Network(ANN) characterized by interconnected nodes that simulate the behavior of neurons

in the human brain These neural connections, similar to synapses in biological brains,allow for the transmission of signals between nodes [24] The received signals areprocessed by artificial neurons, which in turn transmit the processed information toother connected nodes To facilitate the learning process, neurons and connectionsare assigned weights that can be adjusted accordingly These weights control thestrength of signals as they propagate from the input layers to the output layers ANNstypically comprise hidden layers situated between the input and output layers

In the case of RNNs, it is recommended to have a minimum of three hidden layers.The fundamental architecture of RNNs consists of input units, output units, andhidden units The hidden units are responsible for carrying out calculations throughweight adjustments, ultimately generating the desired outputs [18,25,26] Informationflows unidirectionally from the input units to the hidden units in RNNs, while a

directional loop compares the error of the current hidden layer with that of the

previous hidden layer, thereby enabling adjustments to the weights between thehidden layers Figure 2-10 illustrates a simplified RNN architecture featuring two

hidden layers

38

Trang 38

Input Layer Hidden Layers Output Layer

Figure 2-10 A simple RNN

Recurrent Neural Networks (RNNs) are built upon traditional Neural Networks andare specifically designed to model sequential data, such as word sequences in asentence RNNs are called "recurrent" because they perform the same task for allelements of a sequence, with the output at each element depending on computationsperformed on previous elements In other words, RNNs are designed to have the

ability to remember information from preceding elements, making them suitable for

capturing the sequential dependencies among elements in a sequence

Recurrent Neural Networks though in theory are capable of handling long-termdependencies fall short when it comes to practical applications This problem wasvery well explored in depth by Hochreiter (1991) and Bengio, et al (1994) [1] It wasseen that remembering information over long periods requires calculating thedistances between distant nodes that involves multiple multiplications of the JacobianMatrix Problems with the more commonly occurring vanishing gradients and lesserfrequent exploding gradients caused the performance of these models to be notsatisfactory It was seen that a trade of between gradient descent based learning andthe time over which the information is held was required Recurrent Neural Networks

39

Trang 39

(RNNs) are built upon traditional Neural Networks and are used to model sequentialdata such as word sequences in a sentence RNNs are called "recurrent" because theyperform the same task for all elements of a sequence, with the output at each elementdepending on computations on the previous elements In other words, RNNs aredesigned to have the ability to remember information from previous elements,making them suitable for representing the sequential dependencies of elements in a

commonly used to process other types of sequential data, such as predicting stockmarket trends

However, after being applied for some time, RNNs have shown certain weaknesses:

e Slow training speed, even when using Truncated Backpropagation for training

Despite this technique, the training speed remains slow as it relies on CPUsand cannot fully leverage parallel computing on GPUs

e Inefficient processing of long sequences due to the Gradient

Vanishing/Exploding problem As the number of words increases, the number

of units in the network also grows, causing the gradients to diminish gradually

in the later units due to the chain rule of differentiation This leads to the loss

of information about long-range dependencies between units

e To address the Gradient Vanishing problem of RNNs, the Long Short-Term

Memory (LSTM) model was introduced in 1991 LSTM cells have anadditional memory branch (C) that allows information to flow through the cell,

helping to maintain information for longer sentences

40

Trang 40

In order to overcome this, Hochreiter and Schmidhuber (1997) [2] introduced theLong Short Term Memory networks usually called LSTM’s The LSTM’saccumulates long-term relationships between distant nodes by designing weightcoefficients between connections These networks have shown unbelievableapplications in speech processing, Natural Language Processing and imagecaptioning among other applications.

2.3.2.2 Long Short-Term Memory

Long short term memory units (LSTMs) are an advanced form of RNN designed tohandle the vanishing gradient problem properly A typical LSTM unit encompassesessential components, namely a cell, a forget gate, an input gate, and an output gate,

as visually depicted in Figure 2-5 The cell serves as a repository for retaining valuesacross arbitrary time intervals within its memory, while the three gates assume thepivotal role of regulating the influx and efflux of information into and out of the cell.These gates possess the capacity to discern the significance of data within a sequence,enabling the LSTM to retain essential information and discard irrelevant details

Through this mechanism, pertinent information is transmitted along the extensive

sequence chain to facilitate accurate predictions

The fundamental aspect of LSTMs lies in the cell state, symbolized by the horizontalline traversing the top section of Figure 2-11 Conceptually, it can be likened to aconveyor belt that allows information to flow unhindered The cell state in LSTMs issubject to modification, with the inclusion or removal of information being

meticulously regulated by the gates The sigmoid layer, represented by the symbol o,

produces outputs ranging between zero and one, where zero signifies the suppression

of information passage, while one signifies the unimpeded flow of information Theinitial gate, referred to as the forget gate, determines which information is to bediscarded from the cell state Subsequently, the input gate ascertains which novelinformation is to be stored in the cell state, while the output gate yields the final outputbased on the input and memory

4I

Ngày đăng: 04/10/2024, 17:00

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN