VIETNAM NATIONAL UNIVERSITY HCM CITYUNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS NGUYEN THI NGA - 16520787 SENTIMENT ANALYSIS OF USER’S VIETNAMESE MOVIE R
Trang 1VIETNAM NATIONAL UNIVERSITY UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
NGUYEN THI NGA
SENTIMENT ANALYSIS OF USER’S VIETNAMESE MOVIE REVIEWS USING
SUPPORT VECTOR MACHINE
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
HO CHI MINH CITY, 2021
Trang 2VIETNAM NATIONAL UNIVERSITY HCM CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
NGUYEN THI NGA - 16520787
SENTIMENT ANALYSIS OF USER’S VIETNAMESE MOVIE REVIEWS USING
SUPPORT VECTOR MACHINE
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR
Dr CAO TH] NHAN
HO CHI MINH CITY, 2021
Trang 3ASSESSMENT COMMITTEE
The Assessment Committee is established under the Decision
date by Rector of the University of Information Technology
— ÔỎ - Chairman
- - Secretary
Bo ieceeteeeetecseseseseseeecsesesesscsessseaeseseeessseseeeaeee - Member
- Member
Trang 4The graduation thesis has been a golden opportunity for me to testify and implement
the knowledge that I have learned during our time at the university, yet opportunities
always accompany difficulties Thus, I would like to express our special thanks ofgratitude to our beloved ones that have helped me overcome this challenging period
First and foremost, we are so grateful for having Dr.Cao Thi Nhan as my thesis
advisor Were it not for her being incredibly patient, inspiring, and knowledgeable, I
probably would not accomplish this thesis at this level
Secondly, to Mr.Nguyen Minh Tri, Mr Nguyen Van Vinh with our highest
gratitude and appreciation, I am so thankful for your support Thank you for helping
me to keep track of my direction in research.I would also like to thank Pham Thi AnhMinh, Nguyen Thi Thu Huong for accompanying me in the process of creating thedataset
Last but not least, I would like to thank my parents for having been there for us
as always
In the course of doing the thesis, I could not avoid shortcomings and limitations
completely Therefore, I really look forward to receiving valuable feedback and
suggestions from teachers Once again, thank you very much!
ii
Trang 5THESIS PROPOSAL
THESIS TITLE: Sentiment Analysis of User’s Vietnamese Movie Reviews
using Support Vector Machine
Advisor: Dr Cao Thi Nhan
Duration: Oct 5, 2020 to Dec 15, 2020
Student: Nguyen Thi Nga — 16520787
Contents:
1.Scope
e Dataset: Vietnamese movie review.
e Level: sentence.
e Algorithm: SVM, N-gram, TF-IDF
e Programming language: Python.
2 Objectives
e Dataset: Vietnamese movie review.
e Data preprocessing algorithms, data mining.
3 Methodologies
ii
Trang 6e Survey: collect and read information from documents and textbooks
related to data mining, machine learning and issues related to sentiment
analysis
e Create dataset.
e Use tools to crawl data from Facebook pages about movies: CGV
Cinema, Lotte,
e Research data preprocessing technologies.
e Research about SVM models to use with this data.
e Evaluation: use technologies to evaluate the classification issues.
4 Expected results
e Vietnamese standard dataset of movie reviews.
e Successfully build a demo.
e Applying the SVM model will result in more than 60%.
Timeline:
Phase 1 (5/9/2020 — 18/9/2020): Discuss to select thesis, find out the
situation, related articles and thesis, write thesis outline
Phase 2 (19/9/2020 - 15/11/2020): Draw1 data, process the data and Find
out and apply solutions for abbreviations in data
Phase 3 (15/11/2020 - 8/12/2020): Install the environment and run the
algorithm Use techniques to evaluate such as: numpy-matrix
Phase 4 (8/12/2020 - 25/12/2020): Edit, supplement and complete report,
slide
iv
Trang 7TABLE OF CONTENTS
Contents
ACKNOWLEDGIMMENTTS GG G0001 0 1816 iiTHESIS PROPOSAL .scsssseseesssssssssssessssssssesevessssseesnsseseeesesseeececssseensessesesasenseensess iiiTABLE OF CONTEINTS - - ĂĂ 5 Họ ng vLIST OF TABLES - 5-5 nh nh nh nh ng nhe vii
TABLE OF FIGURES
ABBREVIATIONS cọ ng HH TH HH ng 0111080731777 ixABSTRACT T0 nhà HT v00 1001100801 10011001801 101 1
Chapter 1 PROBLEM STATEIMENT 5 5225 + sS*ESEEEeexersererserse 3
1.4 Thesis Structure «52555552 S* 5< 3£ S2%93£E9E1E11111111111111011111111110101015 5
CHAPTER 2: OVERVIEW Ăn ng ng ng ng 1 1111171 ng ng re 7
2.1 Sentiment ANALYSIS «<< « «sọ ii ii 08 8
2.2 Research Reldfedd - «s55 +s%<%15191111141111011010110100100101001110101010156 8
2.3 CONCIUSION ssesssesssesssessscssscssscssscssscssscesscsnssssscenscsnscsnscssscsnscsnsssnsesnscsssssnassnscsnenenanens 10
CHAPTER 3: DATASET CREATION - Ăn 111 10111 cee 12
3.1 Collecting and Preprocessing Data sscssccessesssscsssessssccsssssssecsssessscsssessssssessesseeseese 13
Trang 8CHAPTER 4: SENTIMENT ANALYSIS MODEL <55<<<<<<<<e<.ee-.ee 20
4.1 Support Vector IVIqChrie - «<< 5< + SE EE*EE*EE E155 3 181 1x 0 vn 26
4.2 VeCtOFÍZGtÍON «5 «SH HH1 1010010111111 29
4.3 Feature N-Graims csscsssscssscssscssensssssssssnsennsesscsssnsssnssssesscnssensecssessanenaseeasessanensaeseasens 29
CHAPTER 5: EXPERIMENT AND RESULTS
5.1 Preprocessing Ddd - - «+ + «5+ £2*E£#E£*S£.E£ZEEZSEZSEereseierretreeerrerrrreerrerre 33 5.2 VectOFÍZGEÍOTI ««<se<4s6S19066065486210442648406444404810000024008108400410021084000440420084040400020% 33
5.3 Implement And Setting Parameter - «+ + 5< + 1£ *2£E*E£1£.EEZEsEekekersereee 33
5.3.1 Traditional Classification MOGEI - - - « «2 « «s2 *E£#*£# 2# 8 vn vn 33
5.3.2 Model evaluation: numpy-IT(GEFÍX - - «5< << «5< + %2 E2 khe erere 34
Trang 9LIST OF TABLES
Table 3.1 Consensus among labeling members -‹ -‹ -++- 21
Table 3.2 The number and proportion of label categories 23
Table 4.1 Feature unigrams, bigrams, trigrams for sentence “phim vừa hay vừa cam
động phải dùng khăn giấy để lao LON” ccccccesssssseessssseeesssseessssssecessssecsessseeesssseeess 30
Table 5.1 The results for the issues of determining sentiment analysis in units
percent 234
Trang 101 Diagram Sentiment Analysis system overview
2 The accuracy comparison diagram among features
3 The Homepage of website
4 The Contact page of website
5 The Demo page of website
6 The recommend sentence feature
7 The results demo of sentence “phimm hay qua”
8 The results demo of sentence “công vinh đá đẹp ghê”
0 The results demo of sentence “bộ phim không được hap dân người
2 The results demo of sentence “góc nhìn từ cửa sô that là đẹp”
3 The results demo of sentence “phim cũng 6n áp! anh nhé?”
4 The results demo of sentence “phim có vẻ không oke ti nao”
5 The results demo of sentence “diễn viên chính xấu ghê mà diễn cũng
6 The results demo of sentence “trời lạnh buồn ngủ quá”.
viii
Trang 12In this document, we focus on building a sentence-level Vietnamese language
standard for user comments on the data movies domain Our set of documents
includes sentences that solve sentiment analysis issues We provide a set of rules tocater to community research and language development
The authors have planned to implement the topic as follows:
e Create a dataset about reviews movies.
e Create the guidelines for the dataset.
e Research and choose the model to solve the Sentiment Analysis of the
movies dataset
e Evaluation the model.
e Build a website to predict the sentences.
In addition, in this topic, we research and test with the Support Vector Machine
model on the dataset we build Test results for the issues of detecting sentiment thetype SVM model achieved 92.23% with 2-grams
However this thesis has many limits bellow:
e Some words still interfere with the model with keyword such as : “không
được, kha, gu, có vẻ”
e Besides the results, there are still some limitations in our thesis Our dataset
builds unequally between labels and influences the test results Thislimitation will have a great impact on bringing the application into practice
e The dataset has 5.194 rows to solve sentiment analysis issues In this,
positive has 544 sentences, the neutral has 4.326 sentences and the negative
has 324 sentences With the distribution ratio having a large difference
between the labels, it will cause linguistic imbalance and affect the results
Trang 13during later testing The model is overfitted because the test dataset is
overfitted with train dataset
Trang 14Chapter 1 PROBLEM STATEMENT
1.1 Introduction
In recent years, with the strong and rapid growth of the Internet and the need to
consult the feedback of previous customers as the demand for entertainment of youngpeople is increasing Therefore, websites are now being developed to allow users toshare experiences, reviews, comments and feedback on different types of services andmovies from cinemas When users decide to go to a certain movie, they not onlyconsider information about the actor, trailer, and director, but also tend to beinterested in the feedback of other users
When reviewing the reviews and feedback of other users, customers tend tomake decisions on choosing a more suitable and reliable movie Along with that,
businesses, services and organizations also collect feedback from users about theirmovies to give better directions However, with a large amount of feedback from
users about movies, it is difficult for users and businesses and organizations to care
about them To solve these issues, businesses, organizations and users need a systemthat can automatically analyze all reviews and summarize all the feedback forcustomers and businesses to refer and make quick decisions
Currently, the information that systems are used to analyze user feedback on
websites is usually only interested in the scores that users rate about those productsand services However, the feedback rating scales do not objectively express the level
of user satisfaction with sentences and comment paragraphs
The Sentiment Analysis issues, particularly the Sentiment Analysis issues onmovies dataset in the movie data domain, is very attractive to the research community
in the world and in the country Most linguistic sets and algorithms are built and
experiment in many different languages such as English, Chinese, etc However, for
Vietnamese, not many linguistic sets have been built to serve the research community.Therefore, we decide to build a standard sentence-level linguistic dataset for
Vietnamese to serve this issue and install a system using SVM to automaticallyanalyze the comment
Trang 15Consequently, a number of systems have also been built to analyze usercomments But, there are no systems analysts in the movies on Vietnamese language.
1.2 Objectives and Scope
1.2.1 Objectives
The most important audience in this thesis is user reviews These
reviews are exploited from user feedback on CGV Vietnamesefanpage about movies This is the basis for building and developingdatasets for the issues in this thesis
In recent years, with the strong and rapid growth of the Internet and
the need to consult the feedback of previous customers as thedemand for entertainment of young people is increasing Therefore,
websites are now being developed to allow users to share
experiences, reviews, comments and feedback on different types ofservices and movies from cinemas When users decide to go to acertain movie, they not only consider information about the actor,
trailer, and director, but also tend to be interested in the feedback of
other users
In this thesis, we focus on researching and implementing traditional
machine learning model Support Vector Machine and n-grams to
solve the Sentiment Analysis issues on movies dataset
1.2.2 Scope
The scope we research in this thesis is the user's review on CGV
facebook about the movies
The thesis is at sentence level
For our thesis, we perform 3 labels for sentiment analysis: negative,positive and neutral
1.2.3 Goals
The scope we research in this thesis is the user's review on CGVfacebook about the movies
Trang 16e The thesis is at sentence level.
e The algorithm is used: SVM, n-grams
In this thesis, we research, study and solve four main goals as follows:
e@ We set a standard of building a target of Vietnamese language on
the sentence level for domain data movies and solving the issues ofsentiment analysis
e Build the dataset at the sentence level for the dataset movie
e Preprocessing data and data mining in movies domain.
e@ We implement, test, and compare different approaches to solving
issues based on traditional SVM machine learning and n-gramsfeature extraction
1.3 Results
From the researches in the thesis, I have achieved the following results:
e Building a standard Vietnamese dataset with 5.194 sentences labeled for
the issues of SA In which, there are 324 sentences with negative sentences,
positive is 544 sentences and neutral is 4.326 sentences Then, we dividethe data into 2: test dataset and training dataset with the corresponding ratio
of 70-30 for research purposes
e Build a guideline for the labeling process.
e Applying the SVM model will result in more than 60% We achieved
results with the SVM method and 2-grams of 92.23%
1.4 Thesis Structure
This thesis is divided into 6 chapters as follows:
e Chapter 1: Introduction Thesis chapter presents reasons for choosing the
thesis, our objectives, scope, our contributions in this attempt, and theissues analysis ment are also written in part
e Chapter 2: Overview Introduce the concept of the Sentiment Analysis
issues on movies dataset, analyse the research directions that have been
n
Trang 17done at home and abroad related to these issues Present issues in this thesis
research and implementation
Chapter 3: Create Dataset Describe the data labeling process fromgathering information, building label rules to the stages of the labelprocess, and analyze datasets
Chapter 4: Sentiment Analysis Model Thesis chapter presents theapproach to the Sentiment Analysis issues on movies dataset Then,
presents the theoretical basis of the methods used to the experiment with
the dataset with the corresponding feature
Chapter 5: Experiment and Result Thesis chapter presents the testinstallation process, parameter tables, and analyzes results between trials.Chapter 6: Conclusion and Future Work For the last chapter, wesummarize and give a conclusion of our work, and propose options for thefuture development of this piece of work
Trang 18made public on the Internet This might explain why sentiment analysis and opinion
mining are often used as synonyms, although, we think it is more accurate to viewsentiments as emotionally loaded opinions [1]
We have seen a massive increase in the number of papers focusing on
sentiment analysis and opinion mining during the recent years Figure 2.1 shows theincrease in searches made with a search string “sentiment analysis” in Google search
engine
® Analyze your opinion ® cus!
Figure 2 1 Google Trends data showing the relative popularity of search strings
“sentiment analysis’’ and ‘‘customer feedback’’ Source: www google.com/trends
In order to enrich Vietnamese sentiment analysis studies, we decided to do a
problem for movies about sentiment analysis at sentence level in Vietnamese
language In condition, we create a dataset that is drawn from a CGV Vietnamfanpage
Trang 19For example, if we have the user response "chị diễn viên chính diễn quá đỉnh"
then the Sentiment Analysis issues on movies dataset will return a positive analysisfor the review
2.1 Sentiment Analysis
Sentiment analysis (also known as opinion mining or emotion AT) refers to the use ofnatural language processing, text analysis, computational linguistics, and biometrics
to systematically identify, extract, quantify, and study affective analysis and
subjective information Sentiment analysis is widely applied to voice of the customermaterials such as reviews and survey responses, online and social media, and
healthcare materials for applications that range from marketing to customer service
to clinical medicine
The purpose of sentiment analysis is to identify, analyze and evaluate
opinions, emotions, attitudes, etc about people's reviews of products, services,teams organizations, individuals, events Research on Sentiment Analysis and
Opinion Mining dates back to the 2000s [2] Since then, this issue has become aresearch student very interested in scientists
One of the reasons that this issue is interesting is research and development.Firstly, this issue is applied in many practical areas in life Second, the development
of social networks provides a huge amount of data to serve for research with these
issues The study of opinion analysis not only has a significant impact on naturallanguage processing, but can also have a profound impact on governance, politics,economics and the social sciences as they are all influenced by human opinion
Currently, the issues of opinion analysis has three different levels: sentence
level (sentence level), text level (document level), [3] Recent research studies mainlyfocus on determining feedback is expressed directly or implied through positive,
negative or neutral analysis
2.2 Research Related
From the 2000s up to now, sentiment analysis as well as opinion analysis have
been attracting researchers’ attention, developing and putting them into practice The
Trang 20concept of sentiment analysis (sentiment analysis) appeared for the first time in thework of Nasukawa and Yi [4] The concept of opinion mining appeared for the firsttime in the work of Dave, Lawrence and Pennock [5] However, the study that isconsidered the first to lay the foundation for opinion analysis is that of Pang et al [2].Since then, the researchers in this issue have been increasingly interested anddeveloped.
The work [2] has research on opinion analysis from user feedback on moviedomain with two subclasses interested in the research are positive and negative Three
methods of machine learning (Naive bayes, maximum entropy classification, andsupport vector machine) were used to solve the issues of categorizing ideas in this
research
In addition to the research in the world, the sentiment analysis issues has also
attracted the domestic research community on various data domains such asrestaurants, hotels, electronics and education, etc As far as we learn, the first
research on opinion analysis in Vietnamese is done by Kieu & Pham [6] on the
sentence level and built a rule-based system using Gate platform, ing experiments toevaluate data on computer data domain and reaching F1 of 62.84%
From domestic and international studies related to sentiment analysis issues,
we can see that the need for datasets to serve the research development of these issues
is huge In the world today, there are many datasets that have been built and solved
many issues
However, there is a lack of Vietnamese data about movie domain, so we
decided to build a standard Vietnamese data set about movie
Realizing the importance and the essential needs of the issues, we proceed to builddatasets for the movie data domain on the sentence level with the average data size
to serve and motivate the work research as well as develop methods of processingand testing for these issues Our dataset builds on identifying sentiment analysis.Then, our dataset builds on a different sentence level from the text level
Trang 212.3 Conclusion
Sentiment analysis is one of the fastest growing research areas in computer science,making it challenging to keep track of all the activities in the area We find that theroots of sentiment analysis are in the studies on public opinion analysis at thebeginning of 20th century and in the text subjectivity analysis performed by thecomputational linguistics community in the 1990's In recent years, sentiment
analysis has shifted from analyzing online product reviews to social media texts from
Twitter and Facebook Many topics beyond product reviews like stock markets,elections, hotel, movies, medicine, software engineering the utilization of sentimentanalysis
However, there are not many researches in Vietnam sentiment analysis issues,
many domains have not been explored yet, the data set for Vietnamese language is
still limited or not keeping up with world trends on social networks
So we decided to choose to solve the Sentiment Analysis problem on the
movie domain at the sentence level in the Vietnamese language
Vietnamese Movie review dataset created by us This is the dataset usedthroughout the experiment The user’s comment is crawled automatically fromVietnamese CGV fanpage https://www.facebook.com/CJCGV by Get Comment
Facebook tool with version 1.1.3.3 of iClick company Then, preprocess data and get
labels for the dataset
The issues of determining the sentiment analysis, from the user's comments:
identifying the corresponding sentiment analysis for each comment sentence The
sentiment analysts that are interesting in this thesis are positive, neutral and negative
The goal in our thesis is to implement a traditional machine learning approach
to be able to automatically analyze user comments With the issues that we areinterested and solved, the issues overview of our research will be analysed as follows:
e Input: User comments on movies.
e Output: Sentiment analysis of the sentence, including 3 labels (positive,
negative and neutral)
10
Trang 22®_ Input sentence: nội dung phim oke, diễn rất hayyyy
e Preprocessing data: nội dung phim ok, diễn rat hay
e@ Output: The sentence is positive
11
Trang 23CHAPTER 3: DATASET CREATION
In this chapter, we present the process of constructing a dataset for the issues ofdetermining the emotional analysis from the user's comment for the movie datadomain
For current prediction systems, most systems are built from supervisedlearning algorithms [7] With the supervised learning algorithm, we need to have alabeled sample dataset to experiments for the system The dataset for experimentdetermines the quality of the system Therefore, the construction of standard datasets
is currently of the utmost concern and takes up a lot of time in the process of building
a system using supervised learning methods
From Figure 3.1 extracted from the data science report [8], we can see that theprocess of building a data science system is almost the whole time involved with thedata 88% of the time (9% of the time it is spent analyzing data, 19% of the time it iscollecting the data, and 60% of the time it is cleaning and reorganizing the data)
What data scientists spend the most time doing
® Building training sets: 3%
® Cleaning and organizing data: 60%
© Collecting data sets; 19%
® Mining data for patterns: 9%
® Refining algorithms: 4%
® Other: 5%
Figure 3 1 Percentage of time spent in the stages to build a data science system
Source: 2016 data learning report [8]
12
Trang 243.1 Collecting and Preprocessing Data
3.1.1 Collecting Data
Our datasets are collected from user feedback on CGV Facebook: about movies.However, we only collect data about user comments, the images and videos attached
to our user will not collect and remove them We have research on the following:
e CGV representatives said, according to their calculations, the total
box office revenue in Vietnam in 2018 is 3.252 billion VND
(equivalent to 143.3 million USD); the number of people coming tothe theater is 47.2 million; The average ticket price is 68.9 thousand
VND/ ticket (about 3.04 USD), releasing 275 movies From theabove information, we can see that the movie market in Vietnam is
huge and bustling With such a market, the competition is enormous
between movie theaters If businesses understand the needs andimprove from the user's point of view, they will gain a large marketshare in a potential market
13
Trang 25Số lượng thảo luận về các rạp phim được tạo ra trên social media
*Thời gian từ ngày 01/01/2014 đến ngày 31/12/2014
Figure 3 2 The number of discussions about movie theaters on social media
Source: buzzmetrics.com
e In the world, Vietnam ranks seventh with 70 million users, up 5%
in the first quarter of the year, and 16% broader compared to thesame period last year From Figure 3.2, the number of users’s age
18-34 is highest This is the main customer of the Cinema
14
Trang 26Ea Facebook users in Viet Nam @
& NapoleonCat Source: NapoleonCat.com
Figure 3 3 Facebook Vietnam users 2020 Source: NapoleonCat.com
e From the information, we choose the movies comment on facebook
to create a dataset
In this thesis, we use the Get Comment Facebook version 1.1.3.3 tool to collect
data from CGV Cinemas Vietnam in Figure 3.3 The Get Comment Facebook version
1.1.3.3 tool
15
Trang 2717 J3093011I948400 HOSM Cuộc sống mau sin bit buậc Rm phi chy hông ngừng nghỉ ibn
18 290900110896809 35206 “tem những cản gônip rập, 30, ao ni, nhưng li th.
19 (2090010909900 1601351705, Duong hi Hi Tang Chu hình Lim Ba Tm tin thi then hay thờ?
"_ Nab A TESTIS i cing conng bing gt nm i vo
"m Sẽ
"1 Ổ] ThịHuớn 1m0S25707353 Ta loa usu dibngphiingfisolenBAmdob —
" pid cho thi đã hông phải idm vine ti ang Tv ¬¬
“ra "i Sen te ah ddowec-.c
Then, We extract the excel and keep the “Nội Dung” column, delete theother domains Figure 3.4 is the file that processes after crawl and we receive a fileexcel in Figure 3.5
16
Trang 28Summarry wm@
File Edit View Insert Format Data Tools Help dit was on December 1
BR 105v 9 %0 OQ 12 cab 60 YOU BITE A OH
Dương Hoa thứ 6 đi đi
A 5 ° D E F
3535 _ Jengười yêu Tran Tuyen Thanh Nghe coi xong rồi đẻ meo
3536 Jengười yêu Nguyen phim trước ta coi
3537 Jeon không okie, Bi của tôiơi
3538 _ Jeremy Dg Quyên Quyên coi cái này đi
3533 Jery Hoàng năm sau dẫn coi
3540 _ Jery Hồng Trang Trương Phương Thi ê từ đây cuối tháng toàn phim hay
3541 Jese Nguyen
3542_Jesie Ph di thôi, linh cảm phim hay nề.
3543 Jiang Y Trang Dinh Duyệt Phạm Phạm Thị Ngoc Ánh tuần sau triển nhé
3544 Jin Chu
3545 _ Jin Do Mạnh Tuấn Tùng Nguyễn
3546 Jin Jin mình di hôi
357 _ Jin Saya thứ 6 chiếu á
3553 Johngười yêu Huynh
3554 Joker là bộ phim kinh dị cuối cùng tao xem
3555 Jolie Bảo Trân
Figure 3 5 The data after crawl and save in excel file
We crawl 40.038 rows of comments from the link movies in CGV fanpage.The data has: the other language, link advertising, icons, acronyms, tag name, Theaverage length of sentence 8 words
3.1.2 Preprocessing Data
Our dataset is built at the sentence level, and when we collect raw data fromFacebook, user comments are mostly at the paragraph level, and some passages causeinterference with the data, so we do clean and organize data to facilitate futurelabeling To see the importance of data pre-processing, as shown in Figure 3.1, wecan see that the time spent cleaning and reorganizing raw data extracted fromFacebook accounts for 60% of the time overall time is spent building a data science
system The pre-processing and data cleaning will be carried out by us through the
following work:
17
Trang 29Firstly, since our dataset is based on Vietnamese, all comments that are not in
Vietnamese will be removed Then, user comments that are not signed in Vietnameseare also removed
Second, user comments only contain advertising content, other Facebook links
are also removed
Next, we use the UETSegmentation library to separate user comments intosingle sentences However, because the data is collected from Facebook, many usercomments are not grammatically correct, such as missing punctuation, using incorrectpunctuation So, we proceeded to add and adjust the punctuation for commentparagraphs before splitting them into single sentences
e If there are words in the comment between lowercase and capital letters,
we will separate them and add punctuation between them For example,for the case “chán lắmĐừng đixem”it will be processed by us “chán lắm.
Đừng đi xem”
e If there are icons or lowercase words in the comments followed by capital
letters, we will add punctuation marks in the middle For example, for the
case “tình yêu của em diễn xuất quá tuyệt em yêu chi.” will be processed into “tình yêu của em diễn xuất quá tuyệt @) Em yêu chị.”
e We remove tag names from the sentence For example: “Nga Nguyễn Vinh
Nguyễn các cao nhân nào xem rồi giải thích giùm mình chi tiết ai đó quay video lúc chung cư bị cháy dé làm gi vay” will be processed into “các cao
nhân nào xem rồi giải thích giùm mình chỉ tiết ai đó quay video lúc chung
cư bị cháy đề làm gì vậy”
e In the end, we remove duplicates from the dataset.
After the data is processed and cleaned, we organize the data in excel format
to facilitate the later data labeling Figure 3.6 shows file data for a comment segmentcontaining the user's comment sentences
18
Trang 30BB 105v § S0 00 123v 08A v 10 v BISA % BH Ex lv tr + eB.
A D e
1 có lẽaiđó sé thich
2 con so anh tai em
3 con _so ban có muốn tham gia bộ phim này không
‘con_so bạn đi coi với minh không
‘con_so bạn trề of
con_so triệu đô
so còn ở thanh hóa không em
6
5 con,
9 cho nề ‘con_so đà nẵng
10 con_so di thi thi sao
11 con so ngây nữa nhé
1? anhtỗchimkia
14 ma chắc mấy người không đi coi này đâu nhỉ
T6_ töiễm thiỡ nhà vậy
37 aichởkhông
1 ãnxongđiluôn
20 ãnxongổixemnhé
2) ảnh
anh đi đồng phim hồi nào thể
2 _ anhđixem phim không
Figure 3 6 The data after preprocessing in excel file
After preprocessing data, we receive the dataset has 6.967 rows The sentence
has length less than 8 words is 86.8% The dataset is simple Then, the model will not
analyze complex sentences with length that is long
3.2 Guidelines
e Positive: The sentence that clearly or implicitly shows the commentary
analysis that the speaker's analysis with the keyword is positive such as:
“thích thú, nổi, mặn, hap dan, xuất sắc, dé thương, cưng, đáng dé xem,mong
muốn xem lại, khen diễn viên, âm thanh, cảnh quay bài hát, đạt giải thưởng, ” Suggestions for a better movie or show a positive attitude, a
positive analysis of emotions, express to others to read this is a good movie
to go to see.
Example:
© chân thực đến từng thước phim diễn xuất quá xuất sắc
© công nhận cách quay phim quá đỉnh luôn
o đáng xem đó mọi người ơi nhiều khúc cảm động muốn xiu luôn á
e Neutral: the sentence cannot be clear or implied about the emotional
analysis of the person who is neither in nor condescending, users are only
19
Trang 31generally commenting on the movie.
© Questions sentences.
© The sentences are not related to the movie.
Example:
© aio bình dương làm quen di coi chung không
© anh xem nội dung có hay không
© bán phụ tùng còn đóng phim tiền đâu dé hết nữa ông chủ
e@ Negative: a clear or implicit sentence in the text suggesting that the speaker
is in a negative analysis towards the movie: angry, disparaging, disdain,
drowsy or expressions of criticism, judgment, negative attitude about themovie
© bỏ đi mà làm người phim không hay
© tên phim quê mùa
© cả cái đoàn làm phim không nghĩ ra được cái tên phim nào hay
linguistic labeling process, we will meet to discuss and discuss update the labeling
tule to match the language so that the labeling can achieve high results Our labelingprocess is divided into two phases
Stage 1: We carried out labeling materials to assess the consensus among thethree members:
First, we divide the collected data into files that each contain 200 to 300
comments We will take a random file and give it to the three members toindependently label each other on this file Next, to evaluate the quality of the dataset,
we gathered the file independently assigned by three members and then proceeded to
20
Trang 32calculate the consensus After calculating the consensus among the three members, ifthe consensus has not reached a certain threshold, we will restart the division andlabel the material to evaluate the consensus The process of carrying out the consensusassessment label is described in Figure 3.5.
File Excel has }-——— >
200-300
sentences
Set Label
Discuss to resolve conflicting
sentences and update
labeling rules
The Result of Consensus
Figure 3 7 Consensus assessment process for linguistic labeling
During the labeling process, for complex cases, comments do not have the
tules listed in the labeling rule set and the cases of disagreement between the members
of the label will be given to them In a meeting, update my labeling rules and makethe final decisions on disagreements
For this phase 1, we perform the third labeling process, the consensus betweenmembers is passed (80%), the consensus results between the three labeled members
are submitted shown in table 3.1 Once the consensus is in place, we proceed to phase
2 of the process
21