Khóa luận tốt nghiệp: Sentiment analysis of user's Vietnamese movie reviews using Support Vector Machine

VIETNAM NATIONAL UNIVERSITY HCM CITYUNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS NGUYEN THI NGA - 16520787 SENTIMENT ANALYSIS OF USER’S VIETNAMESE MOVIE R

Trang 1

VIETNAM NATIONAL UNIVERSITY UNIVERSITY OF INFORMATION TECHNOLOGY

ADVANCED PROGRAM IN INFORMATION SYSTEMS

NGUYEN THI NGA

SENTIMENT ANALYSIS OF USER’S VIETNAMESE MOVIE REVIEWS USING

SUPPORT VECTOR MACHINE

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

HO CHI MINH CITY, 2021

Trang 2

VIETNAM NATIONAL UNIVERSITY HCM CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

ADVANCED PROGRAM IN INFORMATION SYSTEMS

NGUYEN THI NGA - 16520787

SENTIMENT ANALYSIS OF USER’S VIETNAMESE MOVIE REVIEWS USING

SUPPORT VECTOR MACHINE

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

THESIS ADVISOR

Dr CAO TH] NHAN

HO CHI MINH CITY, 2021

Trang 3

ASSESSMENT COMMITTEE

The Assessment Committee is established under the Decision

date by Rector of the University of Information Technology

— ÔỎ - Chairman

- - Secretary

Bo ieceeteeeetecseseseseseeecsesesesscsessseaeseseeessseseeeaeee - Member

- Member

Trang 4

The graduation thesis has been a golden opportunity for me to testify and implement

the knowledge that I have learned during our time at the university, yet opportunities

always accompany difficulties Thus, I would like to express our special thanks ofgratitude to our beloved ones that have helped me overcome this challenging period

First and foremost, we are so grateful for having Dr.Cao Thi Nhan as my thesis

advisor Were it not for her being incredibly patient, inspiring, and knowledgeable, I

probably would not accomplish this thesis at this level

Secondly, to Mr.Nguyen Minh Tri, Mr Nguyen Van Vinh with our highest

gratitude and appreciation, I am so thankful for your support Thank you for helping

me to keep track of my direction in research.I would also like to thank Pham Thi AnhMinh, Nguyen Thi Thu Huong for accompanying me in the process of creating thedataset

Last but not least, I would like to thank my parents for having been there for us

as always

In the course of doing the thesis, I could not avoid shortcomings and limitations

completely Therefore, I really look forward to receiving valuable feedback and

suggestions from teachers Once again, thank you very much!

ii

Trang 5

THESIS PROPOSAL

THESIS TITLE: Sentiment Analysis of User’s Vietnamese Movie Reviews

using Support Vector Machine

Advisor: Dr Cao Thi Nhan

Duration: Oct 5, 2020 to Dec 15, 2020

Student: Nguyen Thi Nga — 16520787

Contents:

1.Scope

e Dataset: Vietnamese movie review.

e Level: sentence.

e Algorithm: SVM, N-gram, TF-IDF

e Programming language: Python.

2 Objectives

e Dataset: Vietnamese movie review.

e Data preprocessing algorithms, data mining.

3 Methodologies

ii

Trang 6

e Survey: collect and read information from documents and textbooks

related to data mining, machine learning and issues related to sentiment

analysis

e Create dataset.

e Use tools to crawl data from Facebook pages about movies: CGV

Cinema, Lotte,

e Research data preprocessing technologies.

e Research about SVM models to use with this data.

e Evaluation: use technologies to evaluate the classification issues.

4 Expected results

e Vietnamese standard dataset of movie reviews.

e Successfully build a demo.

e Applying the SVM model will result in more than 60%.

Timeline:

Phase 1 (5/9/2020 — 18/9/2020): Discuss to select thesis, find out the

situation, related articles and thesis, write thesis outline

Phase 2 (19/9/2020 - 15/11/2020): Draw1 data, process the data and Find

out and apply solutions for abbreviations in data

Phase 3 (15/11/2020 - 8/12/2020): Install the environment and run the

algorithm Use techniques to evaluate such as: numpy-matrix

Phase 4 (8/12/2020 - 25/12/2020): Edit, supplement and complete report,

slide

iv

Trang 7

TABLE OF CONTENTS

Contents

ACKNOWLEDGIMMENTTS GG G0001 0 1816 iiTHESIS PROPOSAL .scsssseseesssssssssssessssssssesevessssseesnsseseeesesseeececssseensessesesasenseensess iiiTABLE OF CONTEINTS - - ĂĂ 5 Họ ng vLIST OF TABLES - 5-5 nh nh nh nh ng nhe vii

TABLE OF FIGURES

ABBREVIATIONS cọ ng HH TH HH ng 0111080731777 ixABSTRACT T0 nhà HT v00 1001100801 10011001801 101 1

Chapter 1 PROBLEM STATEIMENT 5 5225 + sS*ESEEEeexersererserse 3

1.4 Thesis Structure «52555552 S* 5< 3£ S2%93£E9E1E11111111111111011111111110101015 5

CHAPTER 2: OVERVIEW Ăn ng ng ng ng 1 1111171 ng ng re 7

2.1 Sentiment ANALYSIS «<< « «sọ ii ii 08 8

2.2 Research Reldfedd - «s55 +s%<%15191111141111011010110100100101001110101010156 8

2.3 CONCIUSION ssesssesssesssessscssscssscssscssscssscesscsnssssscenscsnscsnscssscsnscsnsssnsesnscsssssnassnscsnenenanens 10

CHAPTER 3: DATASET CREATION - Ăn 111 10111 cee 12

3.1 Collecting and Preprocessing Data sscssccessesssscsssessssccsssssssecsssessscsssessssssessesseeseese 13

Trang 8

CHAPTER 4: SENTIMENT ANALYSIS MODEL <55<<<<<<<<e<.ee-.ee 20

4.1 Support Vector IVIqChrie - «<< 5< + SE EE*EE*EE E155 3 181 1x 0 vn 26

4.2 VeCtOFÍZGtÍON «5 «SH HH1 1010010111111 29

4.3 Feature N-Graims csscsssscssscssscssensssssssssnsennsesscsssnsssnssssesscnssensecssessanenaseeasessanensaeseasens 29

CHAPTER 5: EXPERIMENT AND RESULTS

5.1 Preprocessing Ddd - - «+ + «5+ £2*E£#E£*S£.E£ZEEZSEZSEereseierretreeerrerrrreerrerre 33 5.2 VectOFÍZGEÍOTI ««<se<4s6S19066065486210442648406444404810000024008108400410021084000440420084040400020% 33

5.3 Implement And Setting Parameter - «+ + 5< + 1£ *2£E*E£1£.EEZEsEekekersereee 33

5.3.1 Traditional Classification MOGEI - - - « «2 « «s2 *E£#*£# 2# 8 vn vn 33

5.3.2 Model evaluation: numpy-IT(GEFÍX - - «5< << «5< + %2 E2 khe erere 34

Trang 9

LIST OF TABLES

Table 3.1 Consensus among labeling members -‹ -‹ -++- 21

Table 3.2 The number and proportion of label categories 23

Table 4.1 Feature unigrams, bigrams, trigrams for sentence “phim vừa hay vừa cam

động phải dùng khăn giấy để lao LON” ccccccesssssseessssseeesssseessssssecessssecsessseeesssseeess 30

Table 5.1 The results for the issues of determining sentiment analysis in units

percent 234

Trang 10

1 Diagram Sentiment Analysis system overview

2 The accuracy comparison diagram among features

3 The Homepage of website

4 The Contact page of website

5 The Demo page of website

6 The recommend sentence feature

7 The results demo of sentence “phimm hay qua”

8 The results demo of sentence “công vinh đá đẹp ghê”

0 The results demo of sentence “bộ phim không được hap dân người

2 The results demo of sentence “góc nhìn từ cửa sô that là đẹp”

3 The results demo of sentence “phim cũng 6n áp! anh nhé?”

4 The results demo of sentence “phim có vẻ không oke ti nao”

5 The results demo of sentence “diễn viên chính xấu ghê mà diễn cũng

6 The results demo of sentence “trời lạnh buồn ngủ quá”.

viii

Trang 12

In this document, we focus on building a sentence-level Vietnamese language

standard for user comments on the data movies domain Our set of documents

includes sentences that solve sentiment analysis issues We provide a set of rules tocater to community research and language development

The authors have planned to implement the topic as follows:

e Create a dataset about reviews movies.

e Create the guidelines for the dataset.

e Research and choose the model to solve the Sentiment Analysis of the

movies dataset

e Evaluation the model.

e Build a website to predict the sentences.

In addition, in this topic, we research and test with the Support Vector Machine

model on the dataset we build Test results for the issues of detecting sentiment thetype SVM model achieved 92.23% with 2-grams

However this thesis has many limits bellow:

e Some words still interfere with the model with keyword such as : “không

được, kha, gu, có vẻ”

e Besides the results, there are still some limitations in our thesis Our dataset

builds unequally between labels and influences the test results Thislimitation will have a great impact on bringing the application into practice

e The dataset has 5.194 rows to solve sentiment analysis issues In this,

positive has 544 sentences, the neutral has 4.326 sentences and the negative

has 324 sentences With the distribution ratio having a large difference

between the labels, it will cause linguistic imbalance and affect the results

Trang 13

during later testing The model is overfitted because the test dataset is

overfitted with train dataset

Trang 14

Chapter 1 PROBLEM STATEMENT

1.1 Introduction

In recent years, with the strong and rapid growth of the Internet and the need to

consult the feedback of previous customers as the demand for entertainment of youngpeople is increasing Therefore, websites are now being developed to allow users toshare experiences, reviews, comments and feedback on different types of services andmovies from cinemas When users decide to go to a certain movie, they not onlyconsider information about the actor, trailer, and director, but also tend to beinterested in the feedback of other users

When reviewing the reviews and feedback of other users, customers tend tomake decisions on choosing a more suitable and reliable movie Along with that,

businesses, services and organizations also collect feedback from users about theirmovies to give better directions However, with a large amount of feedback from

users about movies, it is difficult for users and businesses and organizations to care

about them To solve these issues, businesses, organizations and users need a systemthat can automatically analyze all reviews and summarize all the feedback forcustomers and businesses to refer and make quick decisions

Currently, the information that systems are used to analyze user feedback on

websites is usually only interested in the scores that users rate about those productsand services However, the feedback rating scales do not objectively express the level

of user satisfaction with sentences and comment paragraphs

The Sentiment Analysis issues, particularly the Sentiment Analysis issues onmovies dataset in the movie data domain, is very attractive to the research community

in the world and in the country Most linguistic sets and algorithms are built and

experiment in many different languages such as English, Chinese, etc However, for

Vietnamese, not many linguistic sets have been built to serve the research community.Therefore, we decide to build a standard sentence-level linguistic dataset for

Vietnamese to serve this issue and install a system using SVM to automaticallyanalyze the comment

Trang 15

Consequently, a number of systems have also been built to analyze usercomments But, there are no systems analysts in the movies on Vietnamese language.

1.2 Objectives and Scope

1.2.1 Objectives

The most important audience in this thesis is user reviews These

reviews are exploited from user feedback on CGV Vietnamesefanpage about movies This is the basis for building and developingdatasets for the issues in this thesis

In recent years, with the strong and rapid growth of the Internet and

the need to consult the feedback of previous customers as thedemand for entertainment of young people is increasing Therefore,

websites are now being developed to allow users to share

experiences, reviews, comments and feedback on different types ofservices and movies from cinemas When users decide to go to acertain movie, they not only consider information about the actor,

trailer, and director, but also tend to be interested in the feedback of

other users

In this thesis, we focus on researching and implementing traditional

machine learning model Support Vector Machine and n-grams to

solve the Sentiment Analysis issues on movies dataset

1.2.2 Scope

The scope we research in this thesis is the user's review on CGV

facebook about the movies

The thesis is at sentence level

For our thesis, we perform 3 labels for sentiment analysis: negative,positive and neutral

1.2.3 Goals

The scope we research in this thesis is the user's review on CGVfacebook about the movies

Trang 16

e The thesis is at sentence level.

e The algorithm is used: SVM, n-grams

In this thesis, we research, study and solve four main goals as follows:

e@ We set a standard of building a target of Vietnamese language on

the sentence level for domain data movies and solving the issues ofsentiment analysis

e Build the dataset at the sentence level for the dataset movie

e Preprocessing data and data mining in movies domain.

e@ We implement, test, and compare different approaches to solving

issues based on traditional SVM machine learning and n-gramsfeature extraction

1.3 Results

From the researches in the thesis, I have achieved the following results:

e Building a standard Vietnamese dataset with 5.194 sentences labeled for

the issues of SA In which, there are 324 sentences with negative sentences,

positive is 544 sentences and neutral is 4.326 sentences Then, we dividethe data into 2: test dataset and training dataset with the corresponding ratio

of 70-30 for research purposes

e Build a guideline for the labeling process.

e Applying the SVM model will result in more than 60% We achieved

results with the SVM method and 2-grams of 92.23%

1.4 Thesis Structure

This thesis is divided into 6 chapters as follows:

e Chapter 1: Introduction Thesis chapter presents reasons for choosing the

thesis, our objectives, scope, our contributions in this attempt, and theissues analysis ment are also written in part

e Chapter 2: Overview Introduce the concept of the Sentiment Analysis

issues on movies dataset, analyse the research directions that have been

n

Trang 17

done at home and abroad related to these issues Present issues in this thesis

research and implementation

Chapter 3: Create Dataset Describe the data labeling process fromgathering information, building label rules to the stages of the labelprocess, and analyze datasets

Chapter 4: Sentiment Analysis Model Thesis chapter presents theapproach to the Sentiment Analysis issues on movies dataset Then,

presents the theoretical basis of the methods used to the experiment with

the dataset with the corresponding feature

Chapter 5: Experiment and Result Thesis chapter presents the testinstallation process, parameter tables, and analyzes results between trials.Chapter 6: Conclusion and Future Work For the last chapter, wesummarize and give a conclusion of our work, and propose options for thefuture development of this piece of work

Trang 18

made public on the Internet This might explain why sentiment analysis and opinion

mining are often used as synonyms, although, we think it is more accurate to viewsentiments as emotionally loaded opinions [1]

We have seen a massive increase in the number of papers focusing on

sentiment analysis and opinion mining during the recent years Figure 2.1 shows theincrease in searches made with a search string “sentiment analysis” in Google search

engine

® Analyze your opinion ® cus!

Figure 2 1 Google Trends data showing the relative popularity of search strings

“sentiment analysis’’ and ‘‘customer feedback’’ Source: www google.com/trends

In order to enrich Vietnamese sentiment analysis studies, we decided to do a

problem for movies about sentiment analysis at sentence level in Vietnamese

language In condition, we create a dataset that is drawn from a CGV Vietnamfanpage

Trang 19

For example, if we have the user response "chị diễn viên chính diễn quá đỉnh"

then the Sentiment Analysis issues on movies dataset will return a positive analysisfor the review

2.1 Sentiment Analysis

Sentiment analysis (also known as opinion mining or emotion AT) refers to the use ofnatural language processing, text analysis, computational linguistics, and biometrics

to systematically identify, extract, quantify, and study affective analysis and

subjective information Sentiment analysis is widely applied to voice of the customermaterials such as reviews and survey responses, online and social media, and

healthcare materials for applications that range from marketing to customer service

to clinical medicine

The purpose of sentiment analysis is to identify, analyze and evaluate

opinions, emotions, attitudes, etc about people's reviews of products, services,teams organizations, individuals, events Research on Sentiment Analysis and

Opinion Mining dates back to the 2000s [2] Since then, this issue has become aresearch student very interested in scientists

One of the reasons that this issue is interesting is research and development.Firstly, this issue is applied in many practical areas in life Second, the development

of social networks provides a huge amount of data to serve for research with these

issues The study of opinion analysis not only has a significant impact on naturallanguage processing, but can also have a profound impact on governance, politics,economics and the social sciences as they are all influenced by human opinion

Currently, the issues of opinion analysis has three different levels: sentence

level (sentence level), text level (document level), [3] Recent research studies mainlyfocus on determining feedback is expressed directly or implied through positive,

negative or neutral analysis

2.2 Research Related

From the 2000s up to now, sentiment analysis as well as opinion analysis have

been attracting researchers’ attention, developing and putting them into practice The

Trang 20

concept of sentiment analysis (sentiment analysis) appeared for the first time in thework of Nasukawa and Yi [4] The concept of opinion mining appeared for the firsttime in the work of Dave, Lawrence and Pennock [5] However, the study that isconsidered the first to lay the foundation for opinion analysis is that of Pang et al [2].Since then, the researchers in this issue have been increasingly interested anddeveloped.

The work [2] has research on opinion analysis from user feedback on moviedomain with two subclasses interested in the research are positive and negative Three

methods of machine learning (Naive bayes, maximum entropy classification, andsupport vector machine) were used to solve the issues of categorizing ideas in this

research

In addition to the research in the world, the sentiment analysis issues has also

attracted the domestic research community on various data domains such asrestaurants, hotels, electronics and education, etc As far as we learn, the first

research on opinion analysis in Vietnamese is done by Kieu & Pham [6] on the

sentence level and built a rule-based system using Gate platform, ing experiments toevaluate data on computer data domain and reaching F1 of 62.84%

From domestic and international studies related to sentiment analysis issues,

we can see that the need for datasets to serve the research development of these issues

is huge In the world today, there are many datasets that have been built and solved

many issues

However, there is a lack of Vietnamese data about movie domain, so we

decided to build a standard Vietnamese data set about movie

Realizing the importance and the essential needs of the issues, we proceed to builddatasets for the movie data domain on the sentence level with the average data size

to serve and motivate the work research as well as develop methods of processingand testing for these issues Our dataset builds on identifying sentiment analysis.Then, our dataset builds on a different sentence level from the text level

Trang 21

2.3 Conclusion

Sentiment analysis is one of the fastest growing research areas in computer science,making it challenging to keep track of all the activities in the area We find that theroots of sentiment analysis are in the studies on public opinion analysis at thebeginning of 20th century and in the text subjectivity analysis performed by thecomputational linguistics community in the 1990's In recent years, sentiment

analysis has shifted from analyzing online product reviews to social media texts from

Twitter and Facebook Many topics beyond product reviews like stock markets,elections, hotel, movies, medicine, software engineering the utilization of sentimentanalysis

However, there are not many researches in Vietnam sentiment analysis issues,

many domains have not been explored yet, the data set for Vietnamese language is

still limited or not keeping up with world trends on social networks

So we decided to choose to solve the Sentiment Analysis problem on the

movie domain at the sentence level in the Vietnamese language

Vietnamese Movie review dataset created by us This is the dataset usedthroughout the experiment The user’s comment is crawled automatically fromVietnamese CGV fanpage https://www.facebook.com/CJCGV by Get Comment

Facebook tool with version 1.1.3.3 of iClick company Then, preprocess data and get

labels for the dataset

The issues of determining the sentiment analysis, from the user's comments:

identifying the corresponding sentiment analysis for each comment sentence The

sentiment analysts that are interesting in this thesis are positive, neutral and negative

The goal in our thesis is to implement a traditional machine learning approach

to be able to automatically analyze user comments With the issues that we areinterested and solved, the issues overview of our research will be analysed as follows:

e Input: User comments on movies.

e Output: Sentiment analysis of the sentence, including 3 labels (positive,

negative and neutral)

10

Trang 22

®_ Input sentence: nội dung phim oke, diễn rất hayyyy

e Preprocessing data: nội dung phim ok, diễn rat hay

e@ Output: The sentence is positive

11

Trang 23

CHAPTER 3: DATASET CREATION

In this chapter, we present the process of constructing a dataset for the issues ofdetermining the emotional analysis from the user's comment for the movie datadomain

For current prediction systems, most systems are built from supervisedlearning algorithms [7] With the supervised learning algorithm, we need to have alabeled sample dataset to experiments for the system The dataset for experimentdetermines the quality of the system Therefore, the construction of standard datasets

is currently of the utmost concern and takes up a lot of time in the process of building

a system using supervised learning methods

From Figure 3.1 extracted from the data science report [8], we can see that theprocess of building a data science system is almost the whole time involved with thedata 88% of the time (9% of the time it is spent analyzing data, 19% of the time it iscollecting the data, and 60% of the time it is cleaning and reorganizing the data)

What data scientists spend the most time doing

® Building training sets: 3%

® Cleaning and organizing data: 60%

® Mining data for patterns: 9%

® Refining algorithms: 4%

® Other: 5%

Figure 3 1 Percentage of time spent in the stages to build a data science system

Source: 2016 data learning report [8]

12

Trang 24

3.1 Collecting and Preprocessing Data

3.1.1 Collecting Data

Our datasets are collected from user feedback on CGV Facebook: about movies.However, we only collect data about user comments, the images and videos attached

to our user will not collect and remove them We have research on the following:

e CGV representatives said, according to their calculations, the total

box office revenue in Vietnam in 2018 is 3.252 billion VND

(equivalent to 143.3 million USD); the number of people coming tothe theater is 47.2 million; The average ticket price is 68.9 thousand

VND/ ticket (about 3.04 USD), releasing 275 movies From theabove information, we can see that the movie market in Vietnam is

huge and bustling With such a market, the competition is enormous

between movie theaters If businesses understand the needs andimprove from the user's point of view, they will gain a large marketshare in a potential market

13

Trang 25

Số lượng thảo luận về các rạp phim được tạo ra trên social media

*Thời gian từ ngày 01/01/2014 đến ngày 31/12/2014

Figure 3 2 The number of discussions about movie theaters on social media

Source: buzzmetrics.com

e In the world, Vietnam ranks seventh with 70 million users, up 5%

in the first quarter of the year, and 16% broader compared to thesame period last year From Figure 3.2, the number of users’s age

18-34 is highest This is the main customer of the Cinema

14

Trang 26

Ea Facebook users in Viet Nam @

& NapoleonCat Source: NapoleonCat.com

Figure 3 3 Facebook Vietnam users 2020 Source: NapoleonCat.com

e From the information, we choose the movies comment on facebook

to create a dataset

In this thesis, we use the Get Comment Facebook version 1.1.3.3 tool to collect

data from CGV Cinemas Vietnam in Figure 3.3 The Get Comment Facebook version

1.1.3.3 tool

15

Trang 27

17 J3093011I948400 HOSM Cuộc sống mau sin bit buậc Rm phi chy hông ngừng nghỉ ibn

18 290900110896809 35206 “tem những cản gônip rập, 30, ao ni, nhưng li th.

19 (2090010909900 1601351705, Duong hi Hi Tang Chu hình Lim Ba Tm tin thi then hay thờ?

"_ Nab A TESTIS i cing conng bing gt nm i vo

"m Sẽ

"1 Ổ] ThịHuớn 1m0S25707353 Ta loa usu dibngphiingfisolenBAmdob —

" pid cho thi đã hông phải idm vine ti ang Tv ¬¬

“ra "i Sen te ah ddowec-.c

Then, We extract the excel and keep the “Nội Dung” column, delete theother domains Figure 3.4 is the file that processes after crawl and we receive a fileexcel in Figure 3.5

16

Trang 28

Summarry wm@

File Edit View Insert Format Data Tools Help dit was on December 1

BR 105v 9 %0 OQ 12 cab 60 YOU BITE A OH

Dương Hoa thứ 6 đi đi

A 5 ° D E F

3535 _ Jengười yêu Tran Tuyen Thanh Nghe coi xong rồi đẻ meo

3536 Jengười yêu Nguyen phim trước ta coi

3537 Jeon không okie, Bi của tôiơi

3538 _ Jeremy Dg Quyên Quyên coi cái này đi

3533 Jery Hoàng năm sau dẫn coi

3540 _ Jery Hồng Trang Trương Phương Thi ê từ đây cuối tháng toàn phim hay

3541 Jese Nguyen

3542_Jesie Ph di thôi, linh cảm phim hay nề.

3543 Jiang Y Trang Dinh Duyệt Phạm Phạm Thị Ngoc Ánh tuần sau triển nhé

3544 Jin Chu

3545 _ Jin Do Mạnh Tuấn Tùng Nguyễn

3546 Jin Jin mình di hôi

357 _ Jin Saya thứ 6 chiếu á

3553 Johngười yêu Huynh

3554 Joker là bộ phim kinh dị cuối cùng tao xem

3555 Jolie Bảo Trân

Figure 3 5 The data after crawl and save in excel file

We crawl 40.038 rows of comments from the link movies in CGV fanpage.The data has: the other language, link advertising, icons, acronyms, tag name, Theaverage length of sentence 8 words

3.1.2 Preprocessing Data

Our dataset is built at the sentence level, and when we collect raw data fromFacebook, user comments are mostly at the paragraph level, and some passages causeinterference with the data, so we do clean and organize data to facilitate futurelabeling To see the importance of data pre-processing, as shown in Figure 3.1, wecan see that the time spent cleaning and reorganizing raw data extracted fromFacebook accounts for 60% of the time overall time is spent building a data science

system The pre-processing and data cleaning will be carried out by us through the

following work:

17

Trang 29

Firstly, since our dataset is based on Vietnamese, all comments that are not in

Vietnamese will be removed Then, user comments that are not signed in Vietnameseare also removed

Second, user comments only contain advertising content, other Facebook links

are also removed

Next, we use the UETSegmentation library to separate user comments intosingle sentences However, because the data is collected from Facebook, many usercomments are not grammatically correct, such as missing punctuation, using incorrectpunctuation So, we proceeded to add and adjust the punctuation for commentparagraphs before splitting them into single sentences

e If there are words in the comment between lowercase and capital letters,

we will separate them and add punctuation between them For example,for the case “chán lắmĐừng đixem”it will be processed by us “chán lắm.

Đừng đi xem”

e If there are icons or lowercase words in the comments followed by capital

letters, we will add punctuation marks in the middle For example, for the

case “tình yêu của em diễn xuất quá tuyệt em yêu chi.” will be processed into “tình yêu của em diễn xuất quá tuyệt @) Em yêu chị.”

e We remove tag names from the sentence For example: “Nga Nguyễn Vinh

Nguyễn các cao nhân nào xem rồi giải thích giùm mình chi tiết ai đó quay video lúc chung cư bị cháy dé làm gi vay” will be processed into “các cao

nhân nào xem rồi giải thích giùm mình chỉ tiết ai đó quay video lúc chung

cư bị cháy đề làm gì vậy”

e In the end, we remove duplicates from the dataset.

After the data is processed and cleaned, we organize the data in excel format

to facilitate the later data labeling Figure 3.6 shows file data for a comment segmentcontaining the user's comment sentences

18

Trang 30

BB 105v § S0 00 123v 08A v 10 v BISA % BH Ex lv tr + eB.

A D e

1 có lẽaiđó sé thich

2 con so anh tai em

3 con _so ban có muốn tham gia bộ phim này không

‘con_so bạn đi coi với minh không

‘con_so bạn trề of

con_so triệu đô

so còn ở thanh hóa không em

6

5 con,

9 cho nề ‘con_so đà nẵng

10 con_so di thi thi sao

11 con so ngây nữa nhé

1? anhtỗchimkia

14 ma chắc mấy người không đi coi này đâu nhỉ

T6_ töiễm thiỡ nhà vậy

37 aichởkhông

1 ãnxongđiluôn

20 ãnxongổixemnhé

2) ảnh

anh đi đồng phim hồi nào thể

2 _ anhđixem phim không

Figure 3 6 The data after preprocessing in excel file

After preprocessing data, we receive the dataset has 6.967 rows The sentence

has length less than 8 words is 86.8% The dataset is simple Then, the model will not

analyze complex sentences with length that is long

3.2 Guidelines

e Positive: The sentence that clearly or implicitly shows the commentary

analysis that the speaker's analysis with the keyword is positive such as:

“thích thú, nổi, mặn, hap dan, xuất sắc, dé thương, cưng, đáng dé xem,mong

muốn xem lại, khen diễn viên, âm thanh, cảnh quay bài hát, đạt giải thưởng, ” Suggestions for a better movie or show a positive attitude, a

positive analysis of emotions, express to others to read this is a good movie

to go to see.

Example:

o đáng xem đó mọi người ơi nhiều khúc cảm động muốn xiu luôn á

e Neutral: the sentence cannot be clear or implied about the emotional

analysis of the person who is neither in nor condescending, users are only

19

Trang 31

generally commenting on the movie.

Example:

e@ Negative: a clear or implicit sentence in the text suggesting that the speaker

is in a negative analysis towards the movie: angry, disparaging, disdain,

drowsy or expressions of criticism, judgment, negative attitude about themovie

linguistic labeling process, we will meet to discuss and discuss update the labeling

tule to match the language so that the labeling can achieve high results Our labelingprocess is divided into two phases

Stage 1: We carried out labeling materials to assess the consensus among thethree members:

First, we divide the collected data into files that each contain 200 to 300

comments We will take a random file and give it to the three members toindependently label each other on this file Next, to evaluate the quality of the dataset,

we gathered the file independently assigned by three members and then proceeded to

20

Trang 32

calculate the consensus After calculating the consensus among the three members, ifthe consensus has not reached a certain threshold, we will restart the division andlabel the material to evaluate the consensus The process of carrying out the consensusassessment label is described in Figure 3.5.

File Excel has }-——— >

200-300

sentences

Set Label

Discuss to resolve conflicting

sentences and update

labeling rules

The Result of Consensus

Figure 3 7 Consensus assessment process for linguistic labeling

During the labeling process, for complex cases, comments do not have the

tules listed in the labeling rule set and the cases of disagreement between the members

of the label will be given to them In a meeting, update my labeling rules and makethe final decisions on disagreements

For this phase 1, we perform the third labeling process, the consensus betweenmembers is passed (80%), the consensus results between the three labeled members

are submitted shown in table 3.1 Once the consensus is in place, we proceed to phase

2 of the process

21

Tiêu đề	Sentiment Analysis of User's Vietnamese Movie Reviews Using Support Vector Machine
Tác giả	Nguyen Thi Nga
Người hướng dẫn	Dr. Cao Thi Nhan
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	65
Dung lượng	16,78 MB