Khóa luận tốt nghiệp Hệ thống thông tin: Real-time customer reviews analysis using machine learning on big data framework

VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMSNGUYEN HOANG NHAT REAL-TIME CUSTOMER REVIEWS ANALYSIS USING MACHINE

Trang 1

VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS

NGUYEN HOANG NHAT

REAL-TIME CUSTOMER REVIEWS ANALYSIS

USING MACHINE LEARNING

ON BIG DATA FRAMEWORK

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

HO CHI MINH CITY, 2021

Trang 2

NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS

NGUYEN HOANG NHAT - 17520851

REAL-TIME CUSTOMER REVIEWS ANALYSIS

USING MACHINE LEARNING

ON BIG DATA FRAMEWORK

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

THESIS ADVISOR

HOP DO TRONG, Ph.D

HO CHI MINH CITY, 2021

Trang 3

ASSESSMENT COMMITTEE

The Assessment Committee is established under the Decision Ho Chi Minh city, date

January 18" by Rector of the University of Information Technology.

1 Tho Quan Thanh, ph.D - Chairman

2 Thanh Ngo Duc, Ph.D - Secretary

3 Nhan Cao Thi, Ph.D - Commissary

Trang 4

First of all, I thank God for giving the support and providing me with the patience

and guidance which I used in making this work

I would like to thank my supervisor, Hop Do Trong, Ph.D I would like to thank him

for his help, continuous encouragement, productive discussion, and valuable

suggestions and comments throughout the research and thesis work

It has been a fantastic experience studying at the University of Information

Technology I was motivated and excited about studying, ultimately fulfilling my

expectations

Last but not least important, I owe more than thanks to my family members especially

Thanks to my brother for his support and encouragement throughout my life

Trang 5

1.3 Object and range of stud:

Chapter 2 Related Work csssssssesessssssssssnseescsesessssssssneessesssoseesssneeeaneneaeeeseesseneneneaeeeeees 3

Chapter 3 Methodolog) ssssssssssssssssssssssescsesesesessssssssssseseseseneseneneaeaeaeseenenenenensaneoee 7

3.1 Big Data 44 đất e ce cÊO HQ HH HẤ HH nghe 7 3.2 Spark framework

3.2.1 Core technologies of spark and its components.

3.2.2 Big Data Analysis Using Spark.

3.3 Machine Learning

3.3.1 Machine learning proce:

3.3.2 Machine learning algorithms

3.7.3 Our dataset

3.7.4 Mix dataset.

3.8 SPARK NLP.

3.9 TF-IDE.

Chapter 4 Experimental ReSUItS :cscsessssssssseseessssssssessesseseeesesenessneueeneeeenesenenenenenanes 26

4.1 Prepare data for the real-time problem

4.2 Dataset for train-test

4.43 Train model

4.3.1 TF-IDF vectorizer + Logi

Trang 6

4.3.2 TF-IDF vectorizer + NaiveBayes Classifi

4.3.3 TF-IDF vectorizer + Decision Tree Classifier

4.3.4 TF-IDF vectorizer + Random Forest Classifier

AA Compare MOdel cccseecseesseseseesseesseccseceseessecssscssesseeesecaseessecssecsseensecasecssecaseessecsseess4.5 Real-time sentiment analyst

4.6 Compare with UIT-VSMEC

4.7 Challenge c tình nhe 56Chapter 5 Summary

Chapter 6 Future WOIFK - «<< << HH nh ghi 58REEERENCES 5-5 5< SH HH Họ TH TH HH HH HA HH BE 0.087 071.0 59 F.}gg000/(02Ề0sS6ố2n111111111 61

iii

Trang 7

LIST OE FIGURES

œELlb

Figure 3.1 The three Vs of big đata - 52522222 E112 prrrree 7

Figure 3.2 The ecosystem of Spark 28

Figure 3.3 Apache Spark data processing c.ccccecsessessessssessesseeseeseesessessessessssssesseseeseeeeee 10 Figure 3.4 A typical process of machine learning [17] -. + ++s++s++z++xerxerxerxerxer 11 Figure 3.5 Summary of machine learning algorithms [18] Figure 3.6 Sentiment analyst step Figure 3.7 Attributes available in snscrape tweet object Figure 3.8 Code for crawl Vietnamese EW€€ts cscccccterrrrrrrerrrrrrrrrrrrrrirrrrree 16 Figure 3.9 Dataframe of tW€ẨS - - nh HH HH HH Hà HH HH HH HH HH HH key 17 Figure 3.10 Pipeline for spark nÍp ¿52 ©5+5+tcttcxerxerxrrrtrrtrrrrrrrrrrrrtrrrrrrrrrree 23 Figure 3.11 Calculate 'TÌE - «5+ ềE* St E111 Tàn Tàn Tàn Tàn Tàn TH Tàn Hàn nhe 24 Figure 4.1 Format of data test in r€aÏ-tÏIT - 2-52-5252 2*cxeExerxtrxerxrrrrrrrrrxrrxrrxrrkrree 26 Figure 4.2 Separate the data set Figure 4.3 Cleaning data Figure 4.4 Libs for training model Figure 4.5 Dictionary of emotion Figure 4.6 TF-IDF vectorizer + Logistic Regression Classifier pipeline . ‹ 28

Figure 4.7 TF-IDF vectorizer + NaiveBayes Classifier pipeline -. -+ 35

Figure 4.8 TF-IDF vectorizer + Decision Tree Classifier pipeline -:-«+ 41

Figure 4.9 TF-IDF vectorizer + Random Forest Classifier pipeline - - «+ 47

Figure 4.10 Step for real-time sentiment Figure 4.11 Result realtime predict sentiment of tweets Figure 4.12 Count of Emotion in realtime

Figure 4.13 Visualizing number of emotion in real-time

Figure 4.14 Calculate Average of latency in real-time cssecseessesseesseeeseeesecssessesseesseeses 54 Figure 4.15 Percentage each EmOtIO - - ¿c5 St ÉEk‡ÉE kg TH HH ưêc 55 Figure 4.16 Visualizing of Percentage of each Emotion :.sssesseesseesseecseessecsseeeseceseeeneeesees 55 Figure 4.17 Count sentence is already annaÏySí - + c5 sét ey 55

iv

Trang 8

LIST OE TABLES

œELlb

Table 3.1 Step for data preDFOC€SSINE - 52-52 5222‡2*ExEExEExErkerkrrkrrkrrrrrrrrrrrrkrrkrrer 17Table 3.2 : Statistics of emotion labels of the UIT-VSMEC corpus [ Ï] ‹ 18Table 3.3 Statistics of emotion labels of the Our dataset . ¿-52-52©5<+5<+c<ccxcrxer 20Table 3.4 Statistics of emotion labels of the mix dataset

Table 4.1 TF-IDF vectorizer + Logistic Regression Classifier

Table 4.2 TF-IDF vectorizer + Logistic Regression Classifier with my data train + UIT

data test.

Table 4.3 TF-IDF vectorizer + Logistic Regression Classifier with my data train + mix

data test.

Table 4.4 Compare accuracy using TF-IDF vectorizer + Logistic Regression Classifier

with my data using for training

Table 4.5 TF-IDF vectorizer + Logis

data t€SE sec

Table 4.6 TF-IDF vectorizer + Logistic Regression Classifier with UIT data train +

UIT data tesf ec s55

Table 4.7 TF-IDF vectorizer + Logistic Regression Classifier with UIT data train +

mix data test (AIIM Ui Se đợt m

Table 4.8 Compare accuracy using TF-IDF vectorizer + Logistic Regression Classifier

with UIT data using for training

Table 4.9 TF-IDF vectorizer + Logistic Regression Classifier with mix data train + my

data test.

Table 4.10 TF-IDF vectorizer + Logistic Regression Classifier with mix data train +

UIT data test

Table 4.11 TF-IDF vectorizer + Logistic Regression Classifier with mix data train +

— 34Table 4.12 Compare accuracy using TF-IDF vectorizer + Logistic Regression

mix data (€SỂ

- -Classifier with MIX data using for training

Table 4.13 TF-IDF vectorizer + NaiveBayes Classifier with my data train + my data

test

Table 4.14 TF-IDF vectorizer + NaiveBayes Classifier with my data train + UIT data

` — 36

Trang 9

Table 4.15 TF-IDF vectorizer + NaiveBayes Classifier with my data train + mix data

Ôn ốốốốỐốỐố ốốốốố ỐC CC Cố CC Cố 0000 36 Table 4.16 TF-IDF vectorizer + NaiveBayes Classifier with UIT data train + my data

LOSE eee 37 Table 4.17 TF-IDF vectorizer + NaiveBayes Classifier with UIT data train + UIT data

` — 37

+ 38

39 Table 4.21 TF-IDF vectorizer + NaiveBayes Classifier mix data train + MIX data test 39

40Table 4.22 Compare accuracy using TF-IDF vectorizer + NaiveBayes Classifier

41 Table 4.24 TF-IDF vectorizer + Decision Tree Classifier with my data train + UIT data

` 42Table 4.25 TF-IDF vectorizer + Decision Tree Classifier with my data train + mix data

{€S{ e0

Table 4.26 TF-IDF vectorizer + Decision Tree Classifier with UIT data train + my data

W€S TT Noo NT" 0,000,212 0000000001 ên 4Table 4.27 TF-IDF vectorizer + Decision Tree Classifier with UIT data train + UIT

data test.

Table 4.28 TF-IDF vectorizer + Decision Tree Classifier with UIT data train + mix

data test.

Table 4.30 TF-IDF vectorizer + Decision Tree Classifier with mix data train + UIT

data tet ceceeeseseeseseeseeeeeeeneeeeene

Table 4.31 TF-IDF vectorizer + Decision Tree Classifier with mix data train + mix

ata t€SỂ à not

Table 4.32 Compare accuracy using TF-IDF vectorizer + Decision Tree Classifier 46

vi

Trang 10

Table 4.35 TF-IDF vectorizer + Random Forest Classifier with my data train + mix

lon T1 48 Table 4.36 TF-IDF vectorizer + Random Forest Classifier with UIT data train + my

vii

Trang 11

For firms to monitor their brand reputations and assess their performance and public

feelings about their goods, have many ways, but nowadays, using automation to

sentiment analyst of customer reviews is the easiest

We reveal our efforts to develop a machine learning-based system for sentiment

analysis of tweets on Twitter and make it a real-time analyst in this study We use the

framework ‘Apache Spark' — Bigdata framework

Our system outperforms the Emotion Recognition for Vietnamese Social Media

Text, 2019 (UIT-VSMEC) [1] We make a dataset using a tweets crawler and label thedataset that best reflects its sentiment: sadness, enjoyment, anger, disgust, fear, and

surprise We also merge it with project UIT-VSMEC's dataset That is the Vietnamese

language dataset System inputs an arbitrary tweet and assigns it to one of the classes

that best reflects its sentiment The significant results were that the Logistic Regression

Classifier had the most excellent classification accuracy for this area out of the

classification methods studied

After analysis, we create a report such as each emotion count, total tweets, and

percentage of each emotion Our system can analyze the sentiment of millions of

Tweets in pseudo - real-time

Vii

Trang 12

Chapter 1

Problem Statement

1.1 Rationale

With technology development, social networks have become a "gold mine" to

collect customer reviews Historically, businesses gathered feedback and insight into

consumers' feelings about their goods via interviews, questionnaires, and surveys

These traditional methods were often extraordinarily time-consuming, expensive, and

must be manual

Tweets are sometimes used to express opinions on a wide range of topics These

ideas have an important effect in a variety of business decisions as well as in political

opinions toward a certain candidate Develop a sentiment analysis model that can

extract customer reviews and detect the emotions that consumers score in order to be

successful From there, you may utilize this information to plan for the consolidation

and development of goods and services that are more appealing to customers

Consumers can use sentiment analysis to research products or services before

making a purchase E.g., Kindle

Marketers can use this to research public opinion of their company and products, or

to analyze customer satisfaction E.g., Election Polls

Organizations can also use this to gather critical feedback about problems in newly

released products E.g., Brand Management (Nike, Adidas)

12 Aims

We want to extract attributes from tweets and analyze their emotions, which might

be characterized as sadness, enjoyment, anger, disgust, fear, or surprise The project

UIT-dataset VSMEC's is used in conjunction with our dataset in order to perform

emotion classification using the Spark framework Following that, a real-time

sentiment analysis will be performed When the models are applied to the Vietnamese

language, you can see how well they work and how accurate they are

Trang 13

1.3 Object and range of study

We crawl tweets on Twitter and categorize the data so that it can be used as part of

our collection In order to tackle this challenge, we make use of our dataset, which

contains more than 4000 tweets, as well as the project UIT-dataset VSMEC's The

Vietnamese language is represented by the dataset We used classic machine learning

models in Spark (Logistic Regression, Naive Bayes, Decision Tree, and RandomForest) to compare the performance and accuracy of the models We found that the

models performed better and were more accurate Vietnamese user reviews were used

to determine the most appropriate model for sentiment analysis The input data is

massive, so we need to use a framework for big data in this processing like Spark

Sometimes we need an analyst sentiment in past sentences Nevertheless, For the most

part, we apply sentiment analysis in real-time because sometimes we need results in

real-time, which is pretty significant; continuously updating the result is good when wewant to get a survey or comment of new services or goods to make a decision thing

Trang 14

Chapter 2

Related work

In 2020, Mandloi and Patel [2] show that the evolution of social media platforms

drew millions of users, like Twitter, where users may write 280 character tweets

Tweets' low character count facilitates sentiment analysis A daily average of 550

million tweets Sentiment analysis of Twitter data becomes a proxy for societal

attitudes This research using Naive Bayes Classifier, Support Vector Machine (SVM)!

Maximum Entropy Method? As an outcome, Mandloi and Patel developed a sentiment

analysis method and used it in real-time applications This algorithm is suitable for use

in political or other types of review systems The research demonstrates that machine

learning techniques such as Naive Bayes have the best accuracy and may be considered

baseline learning methods, although Maximum Entropy approaches are also rather

successful in specific circumstances

Bouazizi and Ohtsuki [3] derived the Senta’ approach for classification Because

new platforms such as Snapchat focused on video- and multimedia-based

communication, Twitter kept some properties that make it a fascinating subject of data

mining Twitter creates tremendous amounts of data every day, and the number of users

has increased dramatically They offer a novel technique for sentiment analysis that

categorizes tweets into seven types The findings are promising: the data set utilized

for multi-class sentiment analysis had a 60.2 percent accuracy However, we think a

better training set would be preferable Throughout this study, multi-class sentiment

analysis is capable of high accuracy, although it remains a complicated process A more

intriguing job is that quantifying the tweet's emotions

! SVM is a supervised machine learning method which is used for both classifications as well as a regression problem

2 Maximum Entropy is also a Supervised Machine Learning method

3 SENTA is a user-friendly tool they developed to extract different features from the tweets, and texts in general, to perform

in a later step the classification of tweets/texts into different classes.

Trang 15

Turney et al [4] proposed a model called Semantic Orientation This model is

created by studying the usefulness of Twitter to determine its sentiment The limitations

of this work include the time required for queries and, for some applications, the level

of accuracy that was achieved The former difficulty will be eliminated by progress in

hard-ware The latter difficulty might be addressed by combining semantic orientation

with other features in a supervised classification algorithm The model Supervised

Learning is used for sentiment analysis and ML to get better accurate results

Celiktug [5] have proposed a 3-way sentimental classification model Tweets

contain rich information about people's preferences Users usually discuss with each

other and declare their opinions on Twitter All in all, Twitter sentiment analysis has

practical and research value in numerous applications Besides that, Choi et al [6]

designed knowledge-based coarse-grained +/- effect of implicit sentiment analysis on

word sense disambiguation Opinions are expressed positively or negatively on the

events Words have a mixture of different positive/negative effect labels This

disambiguation is needed to classify +/- effect in the word or information for sentiment

analysis The knowledge-based coarse gained +/- effect word sense disambiguation

method will be used to solve this problem

Woldemariam et al [7] derived a model using a randomly chose 600 sample tweets

from the Zooniverse as a test dataset and then evaluate and compare two broad

categories of sentiment analysis methods, namely lexicon-based and machine learning

After processing, data techniques such as SVM, POS, data cleaning, tokenization, ML,

Naive Bayes are used for sentiment analysis This model is more efficient and provides

better accurate results Because the training dataset is collected from the movie reviews

whereas the test dataset is obtained from citizen-science domain, Zooniverse As a

result, the algorithm gets challenged to recognize some unseen positive/negative

phrases specific to the domain

Mehra et al [8] have proposed a sentiment behavior model using Naive Bayes and

Fuzzy Logic Sentiment analysis allurements are rising at attentions from bothexploration and business communities Sentiment analysis is a moderately new zone

Trang 16

that repeatedly contracts with mining user opinion There are various methods in which

data of social networks can be leveraged to give a superior indulgence of user opinion,

such as problems is at the heart of natural language processing (NLP) and data mining

research Twitter is a social microblogging service for social networking that allows

users to post real-time messages known as tweets Tweets have many exclusive

characteristics, including new challenges and figuring out the means of carrying

sentiment analysis on it as related to other domains Since many people have reacted,

Twitter has been massive data generated every day It creates many features or models

for the researchers to work

Jing et al [9] designed a topic-adaptive sentiment analysis model (TaSL) The

sentiment analysis issue is solved by considering words for higher-level classification

in the Topic-adaptive Sentiment lexicon Capturing sentiment opinion words on

different topics is the main advantage of TaSL To model subjects and sentiments

simultaneously, TaSL relies on preexisting sentiment information from documents and

words This approach generates a topic-adaptive sentiment lexicon to improve the

lexicon-based sentiment classification performance The proposed TaSL-type

sentiment lexicon consistently beats the state-of-the-art semantic lexicons in four

real-world experiments

Rotovei [10] designed a multi-agent framework For any firm that wishes to build,

develop, and enhance customer value (and indirectly shareholder value), Customer

Relationship Management (CRM) has become the best practice The goal is to help aprospect become a customer by making appropriate recommendations First scale

customer retention is more critical in a business which means the first scale creates the

first impression of any business

Li et al [11] derived a new method by using sentiment-specific word embeddings

(SSWE) and a weighted text feature model (WTFM) Data generated from Twitter is

extensive, but the data is not correct; either it has the wrong format or spelling mistakes,

making it very difficult to apply sentiment analysis An innovative Twitter sentiment

Trang 17

analysis method based on sentiment-specific word embeddings and a weighted text

feature model Comparatively, the WTFM model is simple to design and practical

Khalid et al [12] propose StreamSensing for analyzing real-time data in noisy

streams This method includes six stages: tokenization, stop word removal, stemming,

filtering, conversion into Term Document Matrix (TDM), and pattern analysis The

method was evaluated and implemented using Spark, a rapid in-memory processing

system The results are reported and examined This paper's findings are theoretical and

practical The StreamSensing technique is introduced theoretically, but it may be used

effectively to analyze any real-time text data stream Furthermore, the proposed

architecture is flexible, and its compute dimension can process distributed in-memory

data structures created on the fly by the streaming data

Back to Vietnamese language, Ho et al [1] evaluate machine learning (SVM and

Random Forest) and deep learning models (CNN and LSTM) on the UIT-VSMEC

corpus For UIT-VSMEC corpus based on six basic human emotions: enjoyment,

sadness, anger, fear, disgust and surprise the best overall weighted F1-score of 59.74%

on the original UIT-VSMEC corpus with CNN using the word2vec word embeddings

Through this thing, our works is building a model using machine learning and apply

Big data framework to this project Evaluate some models in Spark framework like

Logistic regression, Naive Bayes, Decision tree and Random Forest Clean data when

do a sentiment analyst in real-time Apply best model for Vietnamese language The

goal of using Apache Spark’s Machine learning library (MLIB) is to handle an

extraordinary amount of data effectively [13]

Trang 18

Chapter 3

Methodology

3.1 Big Data

Big Data requires a revolutionary step forward from traditional data analysis,

characterized by its three main components: variety, velocity, and volume [14] as

shown in Figure 3.1

Real Time Near Real Time

Figure 3.1 The three Vs of big data

Volume — The size of data is very large and in terabytes and petabytes [14] The

vastness and growth of data far outstrip conventional data storage and processing

approaches For instance, a company's marketing department may be tasked with

analyzing terabytes of consumer communications in a single day to ascertain their

responses to new items

Velocity — It should be used when streaming into the enterprise in order to

maximize its value to the business The role of time is very critical here [14] A few

seconds might seem like an eternity if we do a time-sensitive activity, such as financial

Trang 19

fraud investigation However, for fraud detection and prevention, millions of bank

accounts and transactions must be analyzed in real-time

Variety — It extends beyond the structured data, including unstructured data of all

varieties: text, audio, video, posts, log files etc [14]

During the intensity of this information, another component is verifying data flow

It is not easy to manage vast data Hence data security must be supplied In addition,

after creating and processing big data, it should provide a positive value for the firm

3.2 Spark framework

3.2.1 Core technologies of spark and its components

Spark is a broad distributed computing platform which is based on Hadoop

MapReduce algorithms It absorbs the benefits of Hadoop MapReduce, but unlike

MapReduce, the intermediate and output results of the Spark tasks may be held in

memory, which is called Memory Computing Memory Computing enhances theefficiency of data computing So, Spark is more suited for iterative applications, such

as Data Mining and Machine Learning

Spark provides high-level APIs in Java, Scala, Python and R, and an optimized

engine that supports general execution graphs It also supports a rich set of higher-level

tools including Spark SQL for SQL, MLIIb for machine learning, GraphX for graph

processing, and Spark Streaming [15] The ecosystem is described in Figure 3.2

MLIib Spark SOL &

For Machine Learning ompt DataFrames

Trang 20

Spark SQL supports the SQL implementation in Spark It has a great development

in terms of data compatibility, performance optimization, components extension [15]

Spark Streaming is a stream computing framework based on Spark, it provides a

rich API, and integrates streaming, batch and interactive query applications [15]

GraphX is a parallel computation API used for Spark charts and graphs processing

GraphX is developed based on Bagel, and has a great improvement on the performance

and the memory overhead reducing [15]

MLIIb (Machine Learning library) is a scalable Machine Learning library of Spark,

it includes relevant tests and data generators The performance of Machine Learning

algorithms has a 100 times increase than MapReduce MLIib supports the main

Machine Learning algorithms, such as classification, regression, clustering,

collaborative filtering, dimensionality reduction, and supports Sparse Matrix [15]

In general, the key technology and the foundation architecture of Spark is RDD

And the Spark SQL, MLIib, GraphX, Spark Streaming are the fundamental members

of the Spark ecosystem

Survey from the Big Data company Syncsort [16] discovered that the Big Data

processing trend in 2016 moves towards Spark with 70% rather than its primary

competition MapReduce at 55% Moreover, the poll reveals that Apache Spark is

presently the most active project inside the Big Data sector Therefore, MapReduce,

the default processing engine for Apache Hadoop, should be replaced with its successor

Apache Spark Furthermore, businesses such as Cloudera, Hortonworks and MapR that

are the top three providers for Big Data, offer Spark as the default processing

framework over Hadoop

3.2.2 Big Data Analysis Using Spark

Spark's different machine learning techniques have been researched in-depth

concerning big data For the goal of this study, we utilize the MLLib package of Spark

to develop Logistic Regression, Decision Trees, Random Forest, Naive Bayes,

regression methods

Trang 21

In this project the data processing is happening using Twitter streaming API and

Apache Spark as shown in Figure 3.3 bellow

TwitterStreaming

API

1- input data batches of batches of

stream Spark input data processed data

\

Spark

(>) Streaming lim) Engine TT)

Figure 3.3 Apache Spark data processing

3.3 Machine Learning

3.3.1 Machine learning process

As illustrated in Figure 3.4, the machine learning process covers several steps in

practice

10

Trang 22

{ Hold out Test

Figure 3.4 A typical process of machine learning [17]

Data Acquisition: This is the initial phase when we gather raw data for the model

For example, if we want to construct a model that forecasts the arrival time of an airline,

we gather historical data of previous performance and known properties of the airline

Data Cleaning: The raw data is then cleaned to provide an organized and tidy

dataset for good analysis and modelling

Data Split: The cleansed data is then separated into two datasets (train and test)

The training set is the dataset used to develop the model, while the testing set is used

to evaluate the model’s performance

Train ML Model and Test Model: The train dataset is further separated into another

training set and validation set The validation dataset is used to verify the model

developed from the train dataset before testing it with the original test dataset If the

model performance is not as intended, it may be further improved by adjusting

parameters and feature engineering until it achieves its optimal performance

Evaluate Model: After the model is tested sufficiently using the validation set, it

will be assessed for accuracy and performance using the original test dataset

Deploy Model: When the result is expected, the final model with the best

performance will then be deployed and implemented in real-time

To evaluate the ability of the Apache Spark MLIib library in analyzing big data sets,

we focused on a set of supervised (classification) methods, including Logistic

Regression, Decision Tree, Naive Bayes, and Random Forest

1I

Trang 23

3.3.2 Machine learning algorithms

Supervised Learning

Support Vector + Linear + K-Means, K-Medoids,

Machines Regression, GLM Fuzzy C-Means

Discriminant + SVR, GPR * Hierarchical Analysis + Emsemble + Gaussian Mixture

Naive Bayes Methods + Neural Networks

Nearest + Decision Trees + Hidden Markov Neighbor + Neural Models

Networks

Figure 3.5 Summary of machine learning algorithms [18]

In Figure 3.5, have many machines algorithm, but in this work, the dataset is

multi-class, Naive Bayes, Logistic Regression, Random Forest, and Decision Trees

algorithms were applied to accomplish the Sentiment Analysis

Logistic regression: is a regression model where the dependent variable can take

one out of a fixed number of values It utilizes a logistic function to measure the

relationship between the instance class, and the features extracted from the input

Although widely used for binary classification, it can be extended to solve multiclass

classification problems [19] Multi-class Logistic regression method is proposed to

identify all kinds of emotion In this work, we are using Logistic regression on

multi-class

Naive Bayes: is a basic multiclass classification technique based on Bayes’

theorem’s application Each instance of the issue is represented as a feature vector, and

it is assumed that the value of each feature is independent of the value of every other

feature One of the benefits of this algorithm is that it can be taught highly efficiently

since it takes just a single pass to the training data Initially, the conditional probability

12

Trang 24

distribution of each feature given class is calculated, and then Bayes’ theorem is

utilized to predict the class label of an instance [19]

Decision tree: is a classification technique that is based on a tree structure whose

leaves indicate class labels and branches represent combinations of features that result

in the aforementioned classes Essentially, it does a recursive binary split of the feature

space Each step is picked greedily, striving for the optimum decision for the present

step by maximizing the information gain

Random forests: are very flexible and powerful ensemble classifiers based on

decision trees which were firstly developed by Breiman [20] The method runs random

binary trees that implement a subset of the observations through bootstrapping

technique; of the original dataset, a random pick of the training data is chosen and

implemented to form the model; the info which is not included is characterized as out

of a bag

3.4 Twitter

One of the most prominent social networking websites, Twitter, came into being on

the 21st of March 2006 On this website, visitors may read as well as send tweets

Tweets are just a post on Twitter with a limited character block Twitter is a website

that determines the confinement of the assessment's material A tweet is not simply a

simple, instant communication; instead, it combines content information and Meta

information linked with the tweet These features are the highlights of tweets [21]

They convey the core of the tweet or what is that tweet about The Metadata may

be utilized to find the region of the tweet The Metadata of tweets is a few chemicals

and places These substances integrate client-specific hashtags, URLs, and media Users,

the Twitter user ID RT signifies retweet, '@' followed by a client identification report,

and '#' trailed by a phrase depicting a hashtag

3.4.1 Twitter API

The Streaming APIs allow push delivery of Tweets and other events, enabling

real-time or low-latency applications Twitter API is a widely recognized source of big data

13

Trang 25

and is utilized internationally in many applications with many aims However, there is

a specific restriction in free Twitter API that should be considered when examining the

data

Twitter provides two APIs: REST and Streaming REST API consists of two APIs:

one just called the REST API, and another called Search API (whose difference is

entirely due to their history of development) The difference between Streaming API

and REST APIs are: Streaming API supports long-lived connection and provides data

in almost real-time The REST APIs support short-lived connections and are

rate-limited (one can download a certain amount of data but not more per day)

Both the streaming API and the Search REST API have a language parameter that

can be set to a language code, e.g., 'vi' to collect Vietnamese data But the collected

data still contained tweets in other languages making the data very noisy

3.4.2 Sentiment Analysis of Twitter Data

The primary goal of Twitter sentiment’s analysis is to categorize numerous tweets

into distinct feelings categories Various ways in this topic address Twitter arise by

training up a model and assessing its effectiveness

Challenging: the tweets are not as straightforward as it looks to be Reasons for this

are as follows [21] [22]:

- Varying Language Origins: Different people from different cultures tweet using

some words of their cultural origin These words may be promotional, slang and

more.

- Limited Character Block Size: With just 280 characters in scope, the amount of

data content that can be recognized is minimal

- Use of hashtags Twitter provides hashtags to mention emotions, events, which

requires separate processing than the actual word-based tweet

Sentiment analysis is mainly composed of steps as shown in Figure 3.6

Accept input: Take any of the tweets This tweet may be a blend of numerous

moods, tags and hashtags

14

Trang 26

Raw data Crawiing Bliouocoiio

You\ffÐ

Classification model

Ệ i Training :

csv + L

Figure 3.6 Sentiment analyst step

We crawl tweets from Twitter using tweepy and snscrape and data from another

social network After cleaning data - preprocessing text, we save it into two types:

(1) CSV file for labeling and using for training model, (2) JSON file used for

real-time senreal-timent analyst After that, the best model with the highest accuracy

becomes our classification model As a result, we display a sentence and what is it

sentiment

3.5 Crawl data

Snscrape is used to crawl tweets from Twitter, which we do in this work Using

Snscrape, a Python package, it is possible to scrape tweets from Twitter's API without

being restricted or having to provide a limit on the number of requests This information

serves as our dataset Snscrape’s tweet object already has a lot of information available

The image below shows the data you can access with snscrape The attributes in the

image should map exactly to how they’re stored inside the actual object [23] The

Figure 3.7 shows the data you can access with snscrape The attributes in the image

should map exactly to how they’re stored inside the actual object

15

Trang 27

Attributes Available Through snscrape Tweet Object:

Attribute description left blank if purpose is unknown

Url: Permalink pointing to tweet location date: Date tweet was created

content: Text content of tweet

renderedContent: Appears to also be text content of tweet

id Id of tweet

User: User object containing the following data: username, displayname, id, description, descriptionUrs, verified, created, followersCount, friendsCount, statusesCount, favourltesCount, listedCount, medlaCount, location, protected, linkUri, profllelmageU, profleBannerUrl

ouinks tcooutlinks replyCount: Count of replies retweetCount: Count of retweets IKeCount: Count of IiKes

quoteCount: Count of users that quoted the tweet and replied

conversationid: Appears to be the same as tweet id

lang: Machine generated, assumed language of tweet source: Where tweet was posted from, ex: IPhone, Android, etc.

media: Media object, containing previewUr, full, and type retweetedTweet: Ifis a retweet, id of original tweet

quotedTweet: Ifis a quoted tweet, id of original tweet

‘mentionedUsers: User objects of any mentioned user in tweet

Figure 3.7 Attributes available in snscrape tweet object

We do on the Vietnamese language in our project, so we want to crawl Vietnamese

tweets This code for crawl is shown below

or i,tweet in enunerate( sntwitten.TuitterSeanchScraper(‘since:2620-I1-01 until:2021-11-08 lang:vi -filter:retweets -filter:links -filter:replies *).get_itens()):

if 132000:

break

theets_list2.append{ [tweet content])

Figure 3.8 Code for crawl Vietnamese tweets

After running this code, we get tweets based on the data frame is shown in Figure

3.9 We can see that have much a lot of noise in the dataset That is why before doing

a sentiment, we clean data by using something that will be talked to in the below section

16

Trang 28

0 w-n $-n

2 #‡l“S525010x5934x41155}L|242\n#‡18SIH| 2LS010x5934x4115

3 rt, sexphone (xfam 18+) #ẩIuIaforsex

4 Con quỷ pay lag nó ám luôn cái twitter của tui

4997 c1 RT jascacsnt: ban mình thường nói minh hay

4998 bao lâu rồi nhỉ ? lâu đến độ tôi chẳng còn nhớ

4999 "Sống như người sẽ biết một mùa trăng qua.\nSó

5000 RT handw_twt: liu lên rồi ®\n#BTS #ARMY #Butt

Figure 3.9 Dataframe of tweets

3.6 Data preprocessing

Before running the model, we perform data preprocessing in the following steps

Table 3.1 Step for data preprocessing

Step Handle data Example

Ban nay dep that @HaLe

1 Remove @Mention from the text

->Ban nay dep that

Chi iu có lên #ARMY #BTS

2 Remove Hashtag, tag, html from text ,

->Chi iu cô lên ARMY BTS

Theo dõi fb minh nha

3 Remove URL from text https://www.fb.com/123

->Theo dõi fb minh nha

4 Remove mail address

5 Emoji 2 text :) ->smile

17

Trang 29

6 Remove double space

7 Remove space at begin and end of

Information: 6,927 human-annotated sentences with one of the seven emotion

labels Statistics of emotion labels of the dataset is presented in Table 3.2

Table 3.2 : Statistics of emotion labels of the UIT-VSMEC corpus [1]

Emotion | Sentences | Percentage (%)

Enjoyment 1,965 28.36

Disgust 1,338 19.31 Sadness 1,149 16.59

Based on Ekman’s instruction in basic human emotions [24], we build annotation

guidelines for Vietnamese text with seven emotion labels described as follows [1]

18

Trang 30

Enjoyment: For comments with the states that are triggered by feeling connection

or sensory pleasure It contains both peace and ecstasy The intensity of these states

varies from the enjoyment of helping others, a warm uplifting feeling that people

experience when they see kindness and compassion, an experience of ease and

contentment or even the enjoyment of the misfortunes of another person to the joyful

pride in the accomplishments or the experience of something that is very beautiful and

amazing For example, the emotion of the Vietnamese sentence "coi BST dep lim" is

Enjoyment

Anger: contains both annoyance and fury The intensity of these states varies: We

can feel mild or strong annoyance, but we can only feel intense fury All states of anger

are triggered by a feeling of being blocked in our progress It contains both annoyance

and fury and variesfrom frustration which is a response to repeated failures to overcome

an obstacle, exasperation - anger caused by strong nuisance, argumentativeness to

bitterness - anger after unfair treatment and vengefulness For example, the emotion of

the Vietnamese sentence "Dem tôi ngủ d nồi vi run các bạn a dhs lúc đồng ý thì mạnh

mom thé h hén ved dm dm" is Anger.

Fear: contains both anxiety and terror The intensity of these states varies: We can

feel mild or strong anxiety, but we can only feel intense terror All states of fear are

triggered by feeling a threat of harm The intensity of these states varies from

trepidation - anticipation of the possibility of danger, nervousness, dread to desperation,

a response to the inability to reduce danger, panic and horror - a mixture of fear, disgust

and shock For example, the emotion of the Vietnamese sentence "muốn di we nhưng

sợ ma, đời ác vừa" is Fear

Disgust: contains both dislike and loathing The intensity of these states varies: We

can feel mild or strong dislike, but we can only feel intense loathing All states of

disgust are triggered by the feeling that something is toxic Their strength ranges from

an inclination to avoid anything nasty or aversion, the response to a foul taste, smell,

item or thought, repugnance to revulsion which is a blend of disgust and hate or

19

Trang 31

abhorrence - a mixture of strong disgust and hatred For example, the emotion of the

Vietnamese sentence "Chết mẹ cho rồi" is Disgust.

Sadness: contains both disappointment and despair The intensity of these states

varies: We can feel mild or strong disappointment, but we can only feel intense despair

All states of sadness are triggered by a feeling of loss The intensity of its states varies

from discouragement, distraughtness, helplessness, hopelessness to strong suffering, a

feeling of distress and sadness often caused by a loss or sorrow and anguish For

example, the emotion of the Vietnamese sentence "Chẳng ai quan tâm ca" is Sadness.

Surprise: For comments that express the feeling caused by unexpected events,

something hard to believe and may shock you This is the shortest emotion of all

emotions, only takes a few seconds And it passes when we understand what is

happening, and it may become fear, anger, relief or nothing depends on the event

that makes us surprise For example, the emotion of the Vietnamese sentence "Tôi đứng

hình khi thay quà tặng ấy!" is Surprise.

Other: For comments that show none of those emotions above or comments that do

not contain any emotions For example, the emotion of the Vietnamese sentence "CTU

Table 3.3 Statistics of emotion labels of the Our dataset

Enjoyment 763 22.40 Disgust 154 4.52

Sadness 498 14.62

Anger 141 4.14

Fear 88 2.58

20

Trang 32

Surprise 121 3.55

Other 1641 48.18 Total 3406 100

3.7.4 Mix dataset

We merge our dataset and UIT-VSMEC corpus to compare model Evaluate the

accuracy of each dataset on each model and give the best results

After combining the 2 datasets, we get a dataset with emotional recognition of

words in more major and the data used for training also increases

Name: Mix_data

Information: 10782 Vietnamese sentences with one of the seven emotion labels.Statistics of emotion labels of the dataset is presented in Table 3.4

Table 3.4 Statistics of emotion labels of the mix dataset

3.8 SPARK NLP

Natural language processing (NLP) is a vital component in many data science

systems that must interpret or reason about a text Common use cases include question

answering, paraphrasing or summarizing, sentiment analysis, natural language BI,

language modeling, and dis-ambiguation Nevertheless, NLP is always just a part of a

21

Trang 33

bigger data processing pipeline and due to the nontrivial steps involved in this process,

there is a growing need for all-in-one solution to ease the burden of text preprocessing

at large scale and connecting the dots between various steps of solving a data science

problem with NLP A decent NLP library should be able to appropriately turn the free

text into structured features and enable the users train their own NLP models that are

readily fed into the downstream machine learning (ML) or deep learning (DL) pipelines

with no trouble [25]

Spark NLP is developed to be a single unified solution for all the NLP tasks and is

the only library that can scale up for training and inference in any Spark cluster, take

advantage of transfer learning and implementing the latest and greatest algorithms and

models in NLP research, and deliver a mission-critical, enterprise-grade solutions at

the same time It is an open-source natural language processing library, developed on

top of Apache Spark and Spark ML It offers a simple API to interact with ML pipelines

and it is commercially sponsored by John Snow Labs Inc, an award-winning healthcare

Aland NLP firm located in USA

Figure 3.10 is our pipeline We add some models at the end of pipeline such as:

LogisticRegression(),NaiveBayes(),DecisionTreeClassifier(),

RandomForestClassifier()

When we fit() on the pipeline with a Spark data frame, its Sentence column is fed

into the DocumentAssembler() transformer A new column document is formed as an

initial entry point to Spark NLP for every Spark data frame

Then, the ‘‘document’’ column is put into Tokenizer(), each sentence is tokenized,

and a new column ““token'ˆ is produced

Then, Tokens are normalized (basic text cleaning), and StopWordsCleaner take a

output of Normalizer() and drops all the stop words from the input sequences

Stemmer() returns hard-stems out of words with the objective of retrieving the

meaningful part of the word

22

Trang 34

Finisher() Converts annotation results into a format that easier to use It is useful to

extract the results from Spark NLP Pipelines The Finisher outputs annotation(s) values

„setOutputAsArray(True) \

„ setCleanAnnotations(False)

outputCol = "raw feature”)

idf = IDF(inputcol = "raw feature”,

outputCol = "features", minDocFreq = 5)

label_stringIdx = StringTndexer(inputCol = "Emotion", outputCol = ”“label”)

Figure 3.10 Pipeline for spark nlp

23

Trang 35

3.9 TF-IDF

TF-IDF stands for “Term Frequency — Inverse Document Frequency” This is a

technique to quantify words in a set of documents We generally compute a score for

each word to signify its importance in the document and corpus This method is a

widely used technique in Information Retrieval and Text Mining [26]

According to the TF-IDF, it is primarily used to estimate the significance of a word

in the word frequency, but in a text, consider that high-frequency words are not

significant, low-frequency words critical thought is evident lack of theoretical support

Moreover, because frequent words do not indicate a meaningless term, the word is not

always low frequency has strong attributes abilities Furthermore, the algorithm does

not adequately represent the impact of the position of words and the part of speech

Term Frequency: This measures the frequency of a word in a document This highly

depends on the length of the document and the generality of the word For this exact

reason, we perform normalization on the frequency value, we divide the frequency with

the total number of words in the document

TF is individual to each document and word; hence we can formulate TF as follows:

f(t, d)

Ht i= max{f(w,d) : w € d}

Figure 3.11 Calculate TF

Document Frequency (DF): This assesses the significance of documents in a

complete set of the corpus This is highly similar to TF, but the main difference is that

TF is the frequency counter for a word t in document d At the same time, DF is the

count of term t throughout the document set N In other words DF is the number of

documents in which the term is present We count one occurrence if the word is

contained in the document at least once We do not need to know the number of times

the phrase is present [26]

24

Trang 36

Inverse Document Frequency (IDF): is the inverse of the document frequency

which measures the informativeness of term t When we calculate IDF, it will be very

low for the most occurring words such as stop words (because they are present in

almost all of the documents, and N/df will give a very low value to that word) This

finally gives what we want, a relative weightage [26]

Tiêu đề	Real-time Customer Reviews Analysis Using Machine Learning on Big Data Framework
Tác giả	Nguyen Hoang Nhat
Người hướng dẫn	Hop Do Trong, Ph.D
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	72
Dung lượng	17,55 MB