VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMSNGUYEN HOANG NHAT REAL-TIME CUSTOMER REVIEWS ANALYSIS USING MACHINE
Trang 1VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS
NGUYEN HOANG NHAT
REAL-TIME CUSTOMER REVIEWS ANALYSIS
USING MACHINE LEARNING
ON BIG DATA FRAMEWORK
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
HO CHI MINH CITY, 2021
Trang 2NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS
NGUYEN HOANG NHAT - 17520851
REAL-TIME CUSTOMER REVIEWS ANALYSIS
USING MACHINE LEARNING
ON BIG DATA FRAMEWORK
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR
HOP DO TRONG, Ph.D
HO CHI MINH CITY, 2021
Trang 3ASSESSMENT COMMITTEE
The Assessment Committee is established under the Decision Ho Chi Minh city, date
January 18" by Rector of the University of Information Technology.
1 Tho Quan Thanh, ph.D - Chairman
2 Thanh Ngo Duc, Ph.D - Secretary
3 Nhan Cao Thi, Ph.D - Commissary
Trang 4First of all, I thank God for giving the support and providing me with the patience
and guidance which I used in making this work
I would like to thank my supervisor, Hop Do Trong, Ph.D I would like to thank him
for his help, continuous encouragement, productive discussion, and valuable
suggestions and comments throughout the research and thesis work
It has been a fantastic experience studying at the University of Information
Technology I was motivated and excited about studying, ultimately fulfilling my
expectations
Last but not least important, I owe more than thanks to my family members especially
Thanks to my brother for his support and encouragement throughout my life
Trang 51.3 Object and range of stud:
Chapter 2 Related Work csssssssesessssssssssnseescsesessssssssneessesssoseesssneeeaneneaeeeseesseneneneaeeeeees 3
Chapter 3 Methodolog) ssssssssssssssssssssssescsesesesessssssssssseseseseneseneneaeaeaeseenenenenensaneoee 7
3.1 Big Data 44 đất e ce cÊO HQ HH HẤ HH nghe 7 3.2 Spark framework
3.2.1 Core technologies of spark and its components.
3.2.2 Big Data Analysis Using Spark.
3.3 Machine Learning
3.3.1 Machine learning proce:
3.3.2 Machine learning algorithms
3.7.3 Our dataset
3.7.4 Mix dataset.
3.8 SPARK NLP.
3.9 TF-IDE.
Chapter 4 Experimental ReSUItS :cscsessssssssseseessssssssessesseseeesesenessneueeneeeenesenenenenenanes 26
4.1 Prepare data for the real-time problem
4.2 Dataset for train-test
4.43 Train model
4.3.1 TF-IDF vectorizer + Logi
Trang 64.3.2 TF-IDF vectorizer + NaiveBayes Classifi
4.3.3 TF-IDF vectorizer + Decision Tree Classifier
4.3.4 TF-IDF vectorizer + Random Forest Classifier
AA Compare MOdel cccseecseesseseseesseesseccseceseessecssscssesseeesecaseessecssecsseensecasecssecaseessecsseess4.5 Real-time sentiment analyst
4.6 Compare with UIT-VSMEC
4.7 Challenge c tình nhe 56Chapter 5 Summary
Chapter 6 Future WOIFK - «<< << HH nh ghi 58REEERENCES 5-5 5< SH HH Họ TH TH HH HH HA HH BE 0.087 071.0 59 F.}gg000/(02Ề0sS6ố2n111111111 61
iii
Trang 7LIST OE FIGURES
œELlb
Figure 3.1 The three Vs of big đata - 52522222 E112 prrrree 7
Figure 3.2 The ecosystem of Spark 28
Figure 3.3 Apache Spark data processing c.ccccecsessessessssessesseeseeseesessessessessssssesseseeseeeeee 10 Figure 3.4 A typical process of machine learning [17] -. + ++s++s++z++xerxerxerxerxer 11 Figure 3.5 Summary of machine learning algorithms [18] Figure 3.6 Sentiment analyst step Figure 3.7 Attributes available in snscrape tweet object Figure 3.8 Code for crawl Vietnamese EW€€ts cscccccterrrrrrrerrrrrrrrrrrrrrirrrrree 16 Figure 3.9 Dataframe of tW€ẨS - - nh HH HH HH Hà HH HH HH HH HH HH key 17 Figure 3.10 Pipeline for spark nÍp ¿52 ©5+5+tcttcxerxerxrrrtrrtrrrrrrrrrrrrtrrrrrrrrrree 23 Figure 3.11 Calculate 'TÌE - «5+ ềE* St E111 Tàn Tàn Tàn Tàn Tàn TH Tàn Hàn nhe 24 Figure 4.1 Format of data test in r€aÏ-tÏIT - 2-52-5252 2*cxeExerxtrxerxrrrrrrrrrxrrxrrxrrkrree 26 Figure 4.2 Separate the data set Figure 4.3 Cleaning data Figure 4.4 Libs for training model Figure 4.5 Dictionary of emotion Figure 4.6 TF-IDF vectorizer + Logistic Regression Classifier pipeline . ‹ 28
Figure 4.7 TF-IDF vectorizer + NaiveBayes Classifier pipeline -. -+ 35
Figure 4.8 TF-IDF vectorizer + Decision Tree Classifier pipeline -:-«+ 41
Figure 4.9 TF-IDF vectorizer + Random Forest Classifier pipeline - - «+ 47
Figure 4.10 Step for real-time sentiment Figure 4.11 Result realtime predict sentiment of tweets Figure 4.12 Count of Emotion in realtime
Figure 4.13 Visualizing number of emotion in real-time
Figure 4.14 Calculate Average of latency in real-time cssecseessesseesseeeseeesecssessesseesseeses 54 Figure 4.15 Percentage each EmOtIO - - ¿c5 St ÉEk‡ÉE kg TH HH ưêc 55 Figure 4.16 Visualizing of Percentage of each Emotion :.sssesseesseesseecseessecsseeeseceseeeneeesees 55 Figure 4.17 Count sentence is already annaÏySí - + c5 sét ey 55
iv
Trang 8LIST OE TABLES
œELlb
Table 3.1 Step for data preDFOC€SSINE - 52-52 5222‡2*ExEExEExErkerkrrkrrkrrrrrrrrrrrrkrrkrrer 17Table 3.2 : Statistics of emotion labels of the UIT-VSMEC corpus [ Ï] ‹ 18Table 3.3 Statistics of emotion labels of the Our dataset . ¿-52-52©5<+5<+c<ccxcrxer 20Table 3.4 Statistics of emotion labels of the mix dataset
Table 4.1 TF-IDF vectorizer + Logistic Regression Classifier
Table 4.2 TF-IDF vectorizer + Logistic Regression Classifier with my data train + UIT
data test.
Table 4.3 TF-IDF vectorizer + Logistic Regression Classifier with my data train + mix
data test.
Table 4.4 Compare accuracy using TF-IDF vectorizer + Logistic Regression Classifier
with my data using for training
Table 4.5 TF-IDF vectorizer + Logis
data t€SE sec
Table 4.6 TF-IDF vectorizer + Logistic Regression Classifier with UIT data train +
UIT data tesf ec s55
Table 4.7 TF-IDF vectorizer + Logistic Regression Classifier with UIT data train +
mix data test (AIIM Ui Se đợt m
Table 4.8 Compare accuracy using TF-IDF vectorizer + Logistic Regression Classifier
with UIT data using for training
Table 4.9 TF-IDF vectorizer + Logistic Regression Classifier with mix data train + my
data test.
Table 4.10 TF-IDF vectorizer + Logistic Regression Classifier with mix data train +
UIT data test
Table 4.11 TF-IDF vectorizer + Logistic Regression Classifier with mix data train +
— 34Table 4.12 Compare accuracy using TF-IDF vectorizer + Logistic Regression
mix data (€SỂ
- -Classifier with MIX data using for training
Table 4.13 TF-IDF vectorizer + NaiveBayes Classifier with my data train + my data
test
Table 4.14 TF-IDF vectorizer + NaiveBayes Classifier with my data train + UIT data
` — 36
Trang 9Table 4.15 TF-IDF vectorizer + NaiveBayes Classifier with my data train + mix data
Ôn ốốốốỐốỐố ốốốốố ỐC CC Cố CC Cố 0000 36 Table 4.16 TF-IDF vectorizer + NaiveBayes Classifier with UIT data train + my data
LOSE eee 37 Table 4.17 TF-IDF vectorizer + NaiveBayes Classifier with UIT data train + UIT data
` — 37
+ 38
+ 38
39 Table 4.21 TF-IDF vectorizer + NaiveBayes Classifier mix data train + MIX data test 39
40Table 4.22 Compare accuracy using TF-IDF vectorizer + NaiveBayes Classifier
41 Table 4.24 TF-IDF vectorizer + Decision Tree Classifier with my data train + UIT data
` 42Table 4.25 TF-IDF vectorizer + Decision Tree Classifier with my data train + mix data
{€S{ e0
Table 4.26 TF-IDF vectorizer + Decision Tree Classifier with UIT data train + my data
W€S TT Noo NT" 0,000,212 0000000001 ên 4Table 4.27 TF-IDF vectorizer + Decision Tree Classifier with UIT data train + UIT
data test.
Table 4.28 TF-IDF vectorizer + Decision Tree Classifier with UIT data train + mix
data test.
Table 4.30 TF-IDF vectorizer + Decision Tree Classifier with mix data train + UIT
data tet ceceeeseseeseseeseeeeeeeneeeeene
Table 4.31 TF-IDF vectorizer + Decision Tree Classifier with mix data train + mix
ata t€SỂ à not
Table 4.32 Compare accuracy using TF-IDF vectorizer + Decision Tree Classifier 46
vi
Trang 10Table 4.35 TF-IDF vectorizer + Random Forest Classifier with my data train + mix
lon T1 48 Table 4.36 TF-IDF vectorizer + Random Forest Classifier with UIT data train + my
vii
Trang 11For firms to monitor their brand reputations and assess their performance and public
feelings about their goods, have many ways, but nowadays, using automation to
sentiment analyst of customer reviews is the easiest
We reveal our efforts to develop a machine learning-based system for sentiment
analysis of tweets on Twitter and make it a real-time analyst in this study We use the
framework ‘Apache Spark' — Bigdata framework
Our system outperforms the Emotion Recognition for Vietnamese Social Media
Text, 2019 (UIT-VSMEC) [1] We make a dataset using a tweets crawler and label thedataset that best reflects its sentiment: sadness, enjoyment, anger, disgust, fear, and
surprise We also merge it with project UIT-VSMEC's dataset That is the Vietnamese
language dataset System inputs an arbitrary tweet and assigns it to one of the classes
that best reflects its sentiment The significant results were that the Logistic Regression
Classifier had the most excellent classification accuracy for this area out of the
classification methods studied
After analysis, we create a report such as each emotion count, total tweets, and
percentage of each emotion Our system can analyze the sentiment of millions of
Tweets in pseudo - real-time
Vii
Trang 12Chapter 1
Problem Statement
1.1 Rationale
With technology development, social networks have become a "gold mine" to
collect customer reviews Historically, businesses gathered feedback and insight into
consumers' feelings about their goods via interviews, questionnaires, and surveys
These traditional methods were often extraordinarily time-consuming, expensive, and
must be manual
Tweets are sometimes used to express opinions on a wide range of topics These
ideas have an important effect in a variety of business decisions as well as in political
opinions toward a certain candidate Develop a sentiment analysis model that can
extract customer reviews and detect the emotions that consumers score in order to be
successful From there, you may utilize this information to plan for the consolidation
and development of goods and services that are more appealing to customers
Consumers can use sentiment analysis to research products or services before
making a purchase E.g., Kindle
Marketers can use this to research public opinion of their company and products, or
to analyze customer satisfaction E.g., Election Polls
Organizations can also use this to gather critical feedback about problems in newly
released products E.g., Brand Management (Nike, Adidas)
12 Aims
We want to extract attributes from tweets and analyze their emotions, which might
be characterized as sadness, enjoyment, anger, disgust, fear, or surprise The project
UIT-dataset VSMEC's is used in conjunction with our dataset in order to perform
emotion classification using the Spark framework Following that, a real-time
sentiment analysis will be performed When the models are applied to the Vietnamese
language, you can see how well they work and how accurate they are
Trang 131.3 Object and range of study
We crawl tweets on Twitter and categorize the data so that it can be used as part of
our collection In order to tackle this challenge, we make use of our dataset, which
contains more than 4000 tweets, as well as the project UIT-dataset VSMEC's The
Vietnamese language is represented by the dataset We used classic machine learning
models in Spark (Logistic Regression, Naive Bayes, Decision Tree, and RandomForest) to compare the performance and accuracy of the models We found that the
models performed better and were more accurate Vietnamese user reviews were used
to determine the most appropriate model for sentiment analysis The input data is
massive, so we need to use a framework for big data in this processing like Spark
Sometimes we need an analyst sentiment in past sentences Nevertheless, For the most
part, we apply sentiment analysis in real-time because sometimes we need results in
real-time, which is pretty significant; continuously updating the result is good when wewant to get a survey or comment of new services or goods to make a decision thing
Trang 14Chapter 2
Related work
In 2020, Mandloi and Patel [2] show that the evolution of social media platforms
drew millions of users, like Twitter, where users may write 280 character tweets
Tweets' low character count facilitates sentiment analysis A daily average of 550
million tweets Sentiment analysis of Twitter data becomes a proxy for societal
attitudes This research using Naive Bayes Classifier, Support Vector Machine (SVM)!
Maximum Entropy Method? As an outcome, Mandloi and Patel developed a sentiment
analysis method and used it in real-time applications This algorithm is suitable for use
in political or other types of review systems The research demonstrates that machine
learning techniques such as Naive Bayes have the best accuracy and may be considered
baseline learning methods, although Maximum Entropy approaches are also rather
successful in specific circumstances
Bouazizi and Ohtsuki [3] derived the Senta’ approach for classification Because
new platforms such as Snapchat focused on video- and multimedia-based
communication, Twitter kept some properties that make it a fascinating subject of data
mining Twitter creates tremendous amounts of data every day, and the number of users
has increased dramatically They offer a novel technique for sentiment analysis that
categorizes tweets into seven types The findings are promising: the data set utilized
for multi-class sentiment analysis had a 60.2 percent accuracy However, we think a
better training set would be preferable Throughout this study, multi-class sentiment
analysis is capable of high accuracy, although it remains a complicated process A more
intriguing job is that quantifying the tweet's emotions
! SVM is a supervised machine learning method which is used for both classifications as well as a regression problem
2 Maximum Entropy is also a Supervised Machine Learning method
3 SENTA is a user-friendly tool they developed to extract different features from the tweets, and texts in general, to perform
in a later step the classification of tweets/texts into different classes.
Trang 15Turney et al [4] proposed a model called Semantic Orientation This model is
created by studying the usefulness of Twitter to determine its sentiment The limitations
of this work include the time required for queries and, for some applications, the level
of accuracy that was achieved The former difficulty will be eliminated by progress in
hard-ware The latter difficulty might be addressed by combining semantic orientation
with other features in a supervised classification algorithm The model Supervised
Learning is used for sentiment analysis and ML to get better accurate results
Celiktug [5] have proposed a 3-way sentimental classification model Tweets
contain rich information about people's preferences Users usually discuss with each
other and declare their opinions on Twitter All in all, Twitter sentiment analysis has
practical and research value in numerous applications Besides that, Choi et al [6]
designed knowledge-based coarse-grained +/- effect of implicit sentiment analysis on
word sense disambiguation Opinions are expressed positively or negatively on the
events Words have a mixture of different positive/negative effect labels This
disambiguation is needed to classify +/- effect in the word or information for sentiment
analysis The knowledge-based coarse gained +/- effect word sense disambiguation
method will be used to solve this problem
Woldemariam et al [7] derived a model using a randomly chose 600 sample tweets
from the Zooniverse as a test dataset and then evaluate and compare two broad
categories of sentiment analysis methods, namely lexicon-based and machine learning
After processing, data techniques such as SVM, POS, data cleaning, tokenization, ML,
Naive Bayes are used for sentiment analysis This model is more efficient and provides
better accurate results Because the training dataset is collected from the movie reviews
whereas the test dataset is obtained from citizen-science domain, Zooniverse As a
result, the algorithm gets challenged to recognize some unseen positive/negative
phrases specific to the domain
Mehra et al [8] have proposed a sentiment behavior model using Naive Bayes and
Fuzzy Logic Sentiment analysis allurements are rising at attentions from bothexploration and business communities Sentiment analysis is a moderately new zone
Trang 16that repeatedly contracts with mining user opinion There are various methods in which
data of social networks can be leveraged to give a superior indulgence of user opinion,
such as problems is at the heart of natural language processing (NLP) and data mining
research Twitter is a social microblogging service for social networking that allows
users to post real-time messages known as tweets Tweets have many exclusive
characteristics, including new challenges and figuring out the means of carrying
sentiment analysis on it as related to other domains Since many people have reacted,
Twitter has been massive data generated every day It creates many features or models
for the researchers to work
Jing et al [9] designed a topic-adaptive sentiment analysis model (TaSL) The
sentiment analysis issue is solved by considering words for higher-level classification
in the Topic-adaptive Sentiment lexicon Capturing sentiment opinion words on
different topics is the main advantage of TaSL To model subjects and sentiments
simultaneously, TaSL relies on preexisting sentiment information from documents and
words This approach generates a topic-adaptive sentiment lexicon to improve the
lexicon-based sentiment classification performance The proposed TaSL-type
sentiment lexicon consistently beats the state-of-the-art semantic lexicons in four
real-world experiments
Rotovei [10] designed a multi-agent framework For any firm that wishes to build,
develop, and enhance customer value (and indirectly shareholder value), Customer
Relationship Management (CRM) has become the best practice The goal is to help aprospect become a customer by making appropriate recommendations First scale
customer retention is more critical in a business which means the first scale creates the
first impression of any business
Li et al [11] derived a new method by using sentiment-specific word embeddings
(SSWE) and a weighted text feature model (WTFM) Data generated from Twitter is
extensive, but the data is not correct; either it has the wrong format or spelling mistakes,
making it very difficult to apply sentiment analysis An innovative Twitter sentiment
Trang 17analysis method based on sentiment-specific word embeddings and a weighted text
feature model Comparatively, the WTFM model is simple to design and practical
Khalid et al [12] propose StreamSensing for analyzing real-time data in noisy
streams This method includes six stages: tokenization, stop word removal, stemming,
filtering, conversion into Term Document Matrix (TDM), and pattern analysis The
method was evaluated and implemented using Spark, a rapid in-memory processing
system The results are reported and examined This paper's findings are theoretical and
practical The StreamSensing technique is introduced theoretically, but it may be used
effectively to analyze any real-time text data stream Furthermore, the proposed
architecture is flexible, and its compute dimension can process distributed in-memory
data structures created on the fly by the streaming data
Back to Vietnamese language, Ho et al [1] evaluate machine learning (SVM and
Random Forest) and deep learning models (CNN and LSTM) on the UIT-VSMEC
corpus For UIT-VSMEC corpus based on six basic human emotions: enjoyment,
sadness, anger, fear, disgust and surprise the best overall weighted F1-score of 59.74%
on the original UIT-VSMEC corpus with CNN using the word2vec word embeddings
Through this thing, our works is building a model using machine learning and apply
Big data framework to this project Evaluate some models in Spark framework like
Logistic regression, Naive Bayes, Decision tree and Random Forest Clean data when
do a sentiment analyst in real-time Apply best model for Vietnamese language The
goal of using Apache Spark’s Machine learning library (MLIB) is to handle an
extraordinary amount of data effectively [13]
Trang 18Chapter 3
Methodology
3.1 Big Data
Big Data requires a revolutionary step forward from traditional data analysis,
characterized by its three main components: variety, velocity, and volume [14] as
shown in Figure 3.1
Real Time Near Real Time
Figure 3.1 The three Vs of big data
Volume — The size of data is very large and in terabytes and petabytes [14] The
vastness and growth of data far outstrip conventional data storage and processing
approaches For instance, a company's marketing department may be tasked with
analyzing terabytes of consumer communications in a single day to ascertain their
responses to new items
Velocity — It should be used when streaming into the enterprise in order to
maximize its value to the business The role of time is very critical here [14] A few
seconds might seem like an eternity if we do a time-sensitive activity, such as financial
Trang 19fraud investigation However, for fraud detection and prevention, millions of bank
accounts and transactions must be analyzed in real-time
Variety — It extends beyond the structured data, including unstructured data of all
varieties: text, audio, video, posts, log files etc [14]
During the intensity of this information, another component is verifying data flow
It is not easy to manage vast data Hence data security must be supplied In addition,
after creating and processing big data, it should provide a positive value for the firm
3.2 Spark framework
3.2.1 Core technologies of spark and its components
Spark is a broad distributed computing platform which is based on Hadoop
MapReduce algorithms It absorbs the benefits of Hadoop MapReduce, but unlike
MapReduce, the intermediate and output results of the Spark tasks may be held in
memory, which is called Memory Computing Memory Computing enhances theefficiency of data computing So, Spark is more suited for iterative applications, such
as Data Mining and Machine Learning
Spark provides high-level APIs in Java, Scala, Python and R, and an optimized
engine that supports general execution graphs It also supports a rich set of higher-level
tools including Spark SQL for SQL, MLIIb for machine learning, GraphX for graph
processing, and Spark Streaming [15] The ecosystem is described in Figure 3.2
MLIib Spark SOL &
For Machine Learning ompt DataFrames
Trang 20Spark SQL supports the SQL implementation in Spark It has a great development
in terms of data compatibility, performance optimization, components extension [15]
Spark Streaming is a stream computing framework based on Spark, it provides a
rich API, and integrates streaming, batch and interactive query applications [15]
GraphX is a parallel computation API used for Spark charts and graphs processing
GraphX is developed based on Bagel, and has a great improvement on the performance
and the memory overhead reducing [15]
MLIIb (Machine Learning library) is a scalable Machine Learning library of Spark,
it includes relevant tests and data generators The performance of Machine Learning
algorithms has a 100 times increase than MapReduce MLIib supports the main
Machine Learning algorithms, such as classification, regression, clustering,
collaborative filtering, dimensionality reduction, and supports Sparse Matrix [15]
In general, the key technology and the foundation architecture of Spark is RDD
And the Spark SQL, MLIib, GraphX, Spark Streaming are the fundamental members
of the Spark ecosystem
Survey from the Big Data company Syncsort [16] discovered that the Big Data
processing trend in 2016 moves towards Spark with 70% rather than its primary
competition MapReduce at 55% Moreover, the poll reveals that Apache Spark is
presently the most active project inside the Big Data sector Therefore, MapReduce,
the default processing engine for Apache Hadoop, should be replaced with its successor
Apache Spark Furthermore, businesses such as Cloudera, Hortonworks and MapR that
are the top three providers for Big Data, offer Spark as the default processing
framework over Hadoop
3.2.2 Big Data Analysis Using Spark
Spark's different machine learning techniques have been researched in-depth
concerning big data For the goal of this study, we utilize the MLLib package of Spark
to develop Logistic Regression, Decision Trees, Random Forest, Naive Bayes,
regression methods
Trang 21In this project the data processing is happening using Twitter streaming API and
Apache Spark as shown in Figure 3.3 bellow
TwitterStreaming
API
1- input data batches of batches of
stream Spark input data processed data
\
Spark
(>) Streaming lim) Engine TT)
Figure 3.3 Apache Spark data processing
3.3 Machine Learning
3.3.1 Machine learning process
As illustrated in Figure 3.4, the machine learning process covers several steps in
practice
10
Trang 22{ Hold out Test
Figure 3.4 A typical process of machine learning [17]
Data Acquisition: This is the initial phase when we gather raw data for the model
For example, if we want to construct a model that forecasts the arrival time of an airline,
we gather historical data of previous performance and known properties of the airline
Data Cleaning: The raw data is then cleaned to provide an organized and tidy
dataset for good analysis and modelling
Data Split: The cleansed data is then separated into two datasets (train and test)
The training set is the dataset used to develop the model, while the testing set is used
to evaluate the model’s performance
Train ML Model and Test Model: The train dataset is further separated into another
training set and validation set The validation dataset is used to verify the model
developed from the train dataset before testing it with the original test dataset If the
model performance is not as intended, it may be further improved by adjusting
parameters and feature engineering until it achieves its optimal performance
Evaluate Model: After the model is tested sufficiently using the validation set, it
will be assessed for accuracy and performance using the original test dataset
Deploy Model: When the result is expected, the final model with the best
performance will then be deployed and implemented in real-time
To evaluate the ability of the Apache Spark MLIib library in analyzing big data sets,
we focused on a set of supervised (classification) methods, including Logistic
Regression, Decision Tree, Naive Bayes, and Random Forest
1I
Trang 233.3.2 Machine learning algorithms
Supervised Learning
| Classification | Regression | | Clustering | | Model-based | | Model-free |
Support Vector + Linear + K-Means, K-Medoids,
Machines Regression, GLM Fuzzy C-Means
Discriminant + SVR, GPR * Hierarchical Analysis + Emsemble + Gaussian Mixture
Naive Bayes Methods + Neural Networks
Nearest + Decision Trees + Hidden Markov Neighbor + Neural Models
Networks
Figure 3.5 Summary of machine learning algorithms [18]
In Figure 3.5, have many machines algorithm, but in this work, the dataset is
multi-class, Naive Bayes, Logistic Regression, Random Forest, and Decision Trees
algorithms were applied to accomplish the Sentiment Analysis
Logistic regression: is a regression model where the dependent variable can take
one out of a fixed number of values It utilizes a logistic function to measure the
relationship between the instance class, and the features extracted from the input
Although widely used for binary classification, it can be extended to solve multiclass
classification problems [19] Multi-class Logistic regression method is proposed to
identify all kinds of emotion In this work, we are using Logistic regression on
multi-class
Naive Bayes: is a basic multiclass classification technique based on Bayes’
theorem’s application Each instance of the issue is represented as a feature vector, and
it is assumed that the value of each feature is independent of the value of every other
feature One of the benefits of this algorithm is that it can be taught highly efficiently
since it takes just a single pass to the training data Initially, the conditional probability
12
Trang 24distribution of each feature given class is calculated, and then Bayes’ theorem is
utilized to predict the class label of an instance [19]
Decision tree: is a classification technique that is based on a tree structure whose
leaves indicate class labels and branches represent combinations of features that result
in the aforementioned classes Essentially, it does a recursive binary split of the feature
space Each step is picked greedily, striving for the optimum decision for the present
step by maximizing the information gain
Random forests: are very flexible and powerful ensemble classifiers based on
decision trees which were firstly developed by Breiman [20] The method runs random
binary trees that implement a subset of the observations through bootstrapping
technique; of the original dataset, a random pick of the training data is chosen and
implemented to form the model; the info which is not included is characterized as out
of a bag
3.4 Twitter
One of the most prominent social networking websites, Twitter, came into being on
the 21st of March 2006 On this website, visitors may read as well as send tweets
Tweets are just a post on Twitter with a limited character block Twitter is a website
that determines the confinement of the assessment's material A tweet is not simply a
simple, instant communication; instead, it combines content information and Meta
information linked with the tweet These features are the highlights of tweets [21]
They convey the core of the tweet or what is that tweet about The Metadata may
be utilized to find the region of the tweet The Metadata of tweets is a few chemicals
and places These substances integrate client-specific hashtags, URLs, and media Users,
the Twitter user ID RT signifies retweet, '@' followed by a client identification report,
and '#' trailed by a phrase depicting a hashtag
3.4.1 Twitter API
The Streaming APIs allow push delivery of Tweets and other events, enabling
real-time or low-latency applications Twitter API is a widely recognized source of big data
13
Trang 25and is utilized internationally in many applications with many aims However, there is
a specific restriction in free Twitter API that should be considered when examining the
data
Twitter provides two APIs: REST and Streaming REST API consists of two APIs:
one just called the REST API, and another called Search API (whose difference is
entirely due to their history of development) The difference between Streaming API
and REST APIs are: Streaming API supports long-lived connection and provides data
in almost real-time The REST APIs support short-lived connections and are
rate-limited (one can download a certain amount of data but not more per day)
Both the streaming API and the Search REST API have a language parameter that
can be set to a language code, e.g., 'vi' to collect Vietnamese data But the collected
data still contained tweets in other languages making the data very noisy
3.4.2 Sentiment Analysis of Twitter Data
The primary goal of Twitter sentiment’s analysis is to categorize numerous tweets
into distinct feelings categories Various ways in this topic address Twitter arise by
training up a model and assessing its effectiveness
Challenging: the tweets are not as straightforward as it looks to be Reasons for this
are as follows [21] [22]:
- Varying Language Origins: Different people from different cultures tweet using
some words of their cultural origin These words may be promotional, slang and
more.
- Limited Character Block Size: With just 280 characters in scope, the amount of
data content that can be recognized is minimal
- Use of hashtags Twitter provides hashtags to mention emotions, events, which
requires separate processing than the actual word-based tweet
Sentiment analysis is mainly composed of steps as shown in Figure 3.6
Accept input: Take any of the tweets This tweet may be a blend of numerous
moods, tags and hashtags
14
Trang 26Raw data Crawiing Bliouocoiio
You\ffÐ
Classification model
Ệ i Training :
csv + L
Figure 3.6 Sentiment analyst step
We crawl tweets from Twitter using tweepy and snscrape and data from another
social network After cleaning data - preprocessing text, we save it into two types:
(1) CSV file for labeling and using for training model, (2) JSON file used for
real-time senreal-timent analyst After that, the best model with the highest accuracy
becomes our classification model As a result, we display a sentence and what is it
sentiment
3.5 Crawl data
Snscrape is used to crawl tweets from Twitter, which we do in this work Using
Snscrape, a Python package, it is possible to scrape tweets from Twitter's API without
being restricted or having to provide a limit on the number of requests This information
serves as our dataset Snscrape’s tweet object already has a lot of information available
The image below shows the data you can access with snscrape The attributes in the
image should map exactly to how they’re stored inside the actual object [23] The
Figure 3.7 shows the data you can access with snscrape The attributes in the image
should map exactly to how they’re stored inside the actual object
15
Trang 27Attributes Available Through snscrape Tweet Object:
Attribute description left blank if purpose is unknown
Url: Permalink pointing to tweet location date: Date tweet was created
content: Text content of tweet
renderedContent: Appears to also be text content of tweet
id Id of tweet
User: User object containing the following data: username, displayname, id, description, descriptionUrs, verified, created, followersCount, friendsCount, statusesCount, favourltesCount, listedCount, medlaCount, location, protected, linkUri, profllelmageU, profleBannerUrl
ouinks tcooutlinks replyCount: Count of replies retweetCount: Count of retweets IKeCount: Count of IiKes
quoteCount: Count of users that quoted the tweet and replied
conversationid: Appears to be the same as tweet id
lang: Machine generated, assumed language of tweet source: Where tweet was posted from, ex: IPhone, Android, etc.
media: Media object, containing previewUr, full, and type retweetedTweet: Ifis a retweet, id of original tweet
quotedTweet: Ifis a quoted tweet, id of original tweet
‘mentionedUsers: User objects of any mentioned user in tweet
Figure 3.7 Attributes available in snscrape tweet object
We do on the Vietnamese language in our project, so we want to crawl Vietnamese
tweets This code for crawl is shown below
or i,tweet in enunerate( sntwitten.TuitterSeanchScraper(‘since:2620-I1-01 until:2021-11-08 lang:vi -filter:retweets -filter:links -filter:replies *).get_itens()):
if 132000:
break
theets_list2.append{ [tweet content])
Figure 3.8 Code for crawl Vietnamese tweets
After running this code, we get tweets based on the data frame is shown in Figure
3.9 We can see that have much a lot of noise in the dataset That is why before doing
a sentiment, we clean data by using something that will be talked to in the below section
16
Trang 280 w-n $-n
1 Rt, sexphone Video Call (+3uandiv) © #ẩxuia
2 #‡l“S525010x5934x41155}L|242\n#‡18SIH| 2LS010x5934x4115
3 rt, sexphone (xfam 18+) #ẩIuIaforsex
4 Con quỷ pay lag nó ám luôn cái twitter của tui
4997 c1 RT jascacsnt: ban mình thường nói minh hay
4998 bao lâu rồi nhỉ ? lâu đến độ tôi chẳng còn nhớ
4999 "Sống như người sẽ biết một mùa trăng qua.\nSó
5000 RT handw_twt: liu lên rồi ®\n#BTS #ARMY #Butt
Figure 3.9 Dataframe of tweets
3.6 Data preprocessing
Before running the model, we perform data preprocessing in the following steps
Table 3.1 Step for data preprocessing
Step Handle data Example
Ban nay dep that @HaLe
1 Remove @Mention from the text
->Ban nay dep that
Chi iu có lên #ARMY #BTS
2 Remove Hashtag, tag, html from text ,
->Chi iu cô lên ARMY BTS
Theo dõi fb minh nha
3 Remove URL from text https://www.fb.com/123
->Theo dõi fb minh nha
4 Remove mail address
5 Emoji 2 text :) ->smile
17
Trang 296 Remove double space
7 Remove space at begin and end of
Information: 6,927 human-annotated sentences with one of the seven emotion
labels Statistics of emotion labels of the dataset is presented in Table 3.2
Table 3.2 : Statistics of emotion labels of the UIT-VSMEC corpus [1]
Emotion | Sentences | Percentage (%)
Enjoyment 1,965 28.36
Disgust 1,338 19.31 Sadness 1,149 16.59
Based on Ekman’s instruction in basic human emotions [24], we build annotation
guidelines for Vietnamese text with seven emotion labels described as follows [1]
18
Trang 30Enjoyment: For comments with the states that are triggered by feeling connection
or sensory pleasure It contains both peace and ecstasy The intensity of these states
varies from the enjoyment of helping others, a warm uplifting feeling that people
experience when they see kindness and compassion, an experience of ease and
contentment or even the enjoyment of the misfortunes of another person to the joyful
pride in the accomplishments or the experience of something that is very beautiful and
amazing For example, the emotion of the Vietnamese sentence "coi BST dep lim" is
Enjoyment
Anger: contains both annoyance and fury The intensity of these states varies: We
can feel mild or strong annoyance, but we can only feel intense fury All states of anger
are triggered by a feeling of being blocked in our progress It contains both annoyance
and fury and variesfrom frustration which is a response to repeated failures to overcome
an obstacle, exasperation - anger caused by strong nuisance, argumentativeness to
bitterness - anger after unfair treatment and vengefulness For example, the emotion of
the Vietnamese sentence "Dem tôi ngủ d nồi vi run các bạn a dhs lúc đồng ý thì mạnh
mom thé h hén ved dm dm" is Anger.
Fear: contains both anxiety and terror The intensity of these states varies: We can
feel mild or strong anxiety, but we can only feel intense terror All states of fear are
triggered by feeling a threat of harm The intensity of these states varies from
trepidation - anticipation of the possibility of danger, nervousness, dread to desperation,
a response to the inability to reduce danger, panic and horror - a mixture of fear, disgust
and shock For example, the emotion of the Vietnamese sentence "muốn di we nhưng
sợ ma, đời ác vừa" is Fear
Disgust: contains both dislike and loathing The intensity of these states varies: We
can feel mild or strong dislike, but we can only feel intense loathing All states of
disgust are triggered by the feeling that something is toxic Their strength ranges from
an inclination to avoid anything nasty or aversion, the response to a foul taste, smell,
item or thought, repugnance to revulsion which is a blend of disgust and hate or
19
Trang 31abhorrence - a mixture of strong disgust and hatred For example, the emotion of the
Vietnamese sentence "Chết mẹ cho rồi" is Disgust.
Sadness: contains both disappointment and despair The intensity of these states
varies: We can feel mild or strong disappointment, but we can only feel intense despair
All states of sadness are triggered by a feeling of loss The intensity of its states varies
from discouragement, distraughtness, helplessness, hopelessness to strong suffering, a
feeling of distress and sadness often caused by a loss or sorrow and anguish For
example, the emotion of the Vietnamese sentence "Chẳng ai quan tâm ca" is Sadness.
Surprise: For comments that express the feeling caused by unexpected events,
something hard to believe and may shock you This is the shortest emotion of all
emotions, only takes a few seconds And it passes when we understand what is
happening, and it may become fear, anger, relief or nothing depends on the event
that makes us surprise For example, the emotion of the Vietnamese sentence "Tôi đứng
hình khi thay quà tặng ấy!" is Surprise.
Other: For comments that show none of those emotions above or comments that do
not contain any emotions For example, the emotion of the Vietnamese sentence "CTU
Table 3.3 Statistics of emotion labels of the Our dataset
Emotion | Sentences | Percentage (%)
Enjoyment 763 22.40 Disgust 154 4.52
Sadness 498 14.62
Anger 141 4.14
Fear 88 2.58
20
Trang 32Surprise 121 3.55
Other 1641 48.18 Total 3406 100
3.7.4 Mix dataset
We merge our dataset and UIT-VSMEC corpus to compare model Evaluate the
accuracy of each dataset on each model and give the best results
After combining the 2 datasets, we get a dataset with emotional recognition of
words in more major and the data used for training also increases
Name: Mix_data
Information: 10782 Vietnamese sentences with one of the seven emotion labels.Statistics of emotion labels of the dataset is presented in Table 3.4
Table 3.4 Statistics of emotion labels of the mix dataset
Emotion | Sentences | Percentage (%)
3.8 SPARK NLP
Natural language processing (NLP) is a vital component in many data science
systems that must interpret or reason about a text Common use cases include question
answering, paraphrasing or summarizing, sentiment analysis, natural language BI,
language modeling, and dis-ambiguation Nevertheless, NLP is always just a part of a
21
Trang 33bigger data processing pipeline and due to the nontrivial steps involved in this process,
there is a growing need for all-in-one solution to ease the burden of text preprocessing
at large scale and connecting the dots between various steps of solving a data science
problem with NLP A decent NLP library should be able to appropriately turn the free
text into structured features and enable the users train their own NLP models that are
readily fed into the downstream machine learning (ML) or deep learning (DL) pipelines
with no trouble [25]
Spark NLP is developed to be a single unified solution for all the NLP tasks and is
the only library that can scale up for training and inference in any Spark cluster, take
advantage of transfer learning and implementing the latest and greatest algorithms and
models in NLP research, and deliver a mission-critical, enterprise-grade solutions at
the same time It is an open-source natural language processing library, developed on
top of Apache Spark and Spark ML It offers a simple API to interact with ML pipelines
and it is commercially sponsored by John Snow Labs Inc, an award-winning healthcare
Aland NLP firm located in USA
Figure 3.10 is our pipeline We add some models at the end of pipeline such as:
LogisticRegression(),NaiveBayes(),DecisionTreeClassifier(),
RandomForestClassifier()
When we fit() on the pipeline with a Spark data frame, its Sentence column is fed
into the DocumentAssembler() transformer A new column document is formed as an
initial entry point to Spark NLP for every Spark data frame
Then, the ‘‘document’’ column is put into Tokenizer(), each sentence is tokenized,
and a new column ““token'ˆ is produced
Then, Tokens are normalized (basic text cleaning), and StopWordsCleaner take a
output of Normalizer() and drops all the stop words from the input sequences
Stemmer() returns hard-stems out of words with the objective of retrieving the
meaningful part of the word
22
Trang 34Finisher() Converts annotation results into a format that easier to use It is useful to
extract the results from Spark NLP Pipelines The Finisher outputs annotation(s) values
„setOutputAsArray(True) \
„ setCleanAnnotations(False)
outputCol = "raw feature”)
idf = IDF(inputcol = "raw feature”,
outputCol = "features", minDocFreq = 5)
label_stringIdx = StringTndexer(inputCol = "Emotion", outputCol = ”“label”)
Figure 3.10 Pipeline for spark nlp
23
Trang 353.9 TF-IDF
TF-IDF stands for “Term Frequency — Inverse Document Frequency” This is a
technique to quantify words in a set of documents We generally compute a score for
each word to signify its importance in the document and corpus This method is a
widely used technique in Information Retrieval and Text Mining [26]
According to the TF-IDF, it is primarily used to estimate the significance of a word
in the word frequency, but in a text, consider that high-frequency words are not
significant, low-frequency words critical thought is evident lack of theoretical support
Moreover, because frequent words do not indicate a meaningless term, the word is not
always low frequency has strong attributes abilities Furthermore, the algorithm does
not adequately represent the impact of the position of words and the part of speech
Term Frequency: This measures the frequency of a word in a document This highly
depends on the length of the document and the generality of the word For this exact
reason, we perform normalization on the frequency value, we divide the frequency with
the total number of words in the document
TF is individual to each document and word; hence we can formulate TF as follows:
f(t, d)
Ht i= max{f(w,d) : w € d}
Figure 3.11 Calculate TF
Document Frequency (DF): This assesses the significance of documents in a
complete set of the corpus This is highly similar to TF, but the main difference is that
TF is the frequency counter for a word t in document d At the same time, DF is the
count of term t throughout the document set N In other words DF is the number of
documents in which the term is present We count one occurrence if the word is
contained in the document at least once We do not need to know the number of times
the phrase is present [26]
24
Trang 36Inverse Document Frequency (IDF): is the inverse of the document frequency
which measures the informativeness of term t When we calculate IDF, it will be very
low for the most occurring words such as stop words (because they are present in
almost all of the documents, and N/df will give a very low value to that word) This
finally gives what we want, a relative weightage [26]