1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp: Applying text mining in restaurant recommendation based on customer reviews

113 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Applying Text Mining in Restaurant Recommendation Based on Customer Reviews
Tác giả Trinh Thi Thu Ha, Nguyen Minh Quan
Người hướng dẫn TS. Cao Thi Nhan
Trường học University of Information Technology
Chuyên ngành Information Systems
Thể loại Graduation Project
Năm xuất bản 2020
Thành phố Ho Chi Minh City
Định dạng
Số trang 113
Dung lượng 55,47 MB

Nội dung

năm 2021 NHẬN XÉT KHÓA LUẬN TÓT NGHIỆP CUA CÁN BỘ HƯỚNG DAN Tên khóa luận: APPLYING TEXT MINING IN RESTAURANT RECOMMENDATION BASED ON CUSTOMER REVIEWS Nhóm SV thực hiên: Cán bô hướng dẫn

Trang 1

VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

ADVANCED PROGRAM IN INFORMATION SYSTEMS

TRINH THI THU HA - 16520323

NGUYEN MINH QUAN - 16521574

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

THESIS ADVISOR

Dr CAO THI NHAN

HO CHI MINH CITY, 2020

Trang 2

VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

ADVANCED PROGRAM IN INFORMATION SYSTEMS

TRINH THI THU HA - 16520323

NGUYEN MINH QUAN - 16521574

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

THESIS ADVISOR

Dr CAO THI NHAN

HO CHI MINH CITY, 2020

Trang 3

ASSESSMENT COMMITTEE

The Assessment Committee is established under the Decision ;

—— - by Rector of the University of Information Technology

— e eee cece eee eee e testes eee eneenaens — Chairman

Qe kee e cece e ence eee e cent eens tenet eee eenae es — Secretary

— Member

— cece ccceeeeee cesses eeeeeneneesneeeneeees — Member

Trang 4

ĐẠI HỌC QUOC GIA TP HO CHÍ MINH CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

TRƯỜNG ĐẠI HỌC Độc lập - Tự do - Hạnh phúc CÔNG NGHỆ THÔNG TIN

TP HCM, ngày tháng năm 2021

NHẬN XÉT KHÓA LUẬN TÓT NGHIỆP

CUA CÁN BỘ HƯỚNG DAN

Tên khóa luận:

APPLYING TEXT MINING IN RESTAURANT RECOMMENDATION

BASED ON CUSTOMER REVIEWS

Nhóm SV thực hiên: Cán bô hướng dẫn:

Trịnh Thị Thu Hà - 16520323 TS Cao Thị Nhạn

Nguyễn Minh Quân - 16521574

Đánh giá Khóa luận

1 Vé cuôn báo cáo:

Số trang Số chương

Số bảng số liệu Số hình vẽ

Số tài liệu tham khảo Sản phẩm

Một sô nhận xét vê hình thức cuôn bao cáo:

2 Về nội dung nghiên cứu:

Trang 5

Điểm từng sinh viên:

Trịnh Thị Thu Hà: 0

Nguyễn Minh Quân: /10

Người nhận xét

(Ký và ghi rõ họ tên)

Trang 6

First of all, we would like to express our sincere thanks to the Lecturers of the University

of Information Technology, especially the Members of Information System Faculty who used their knowledge and enthusiasm to transmit us valuable knowledge during thetime we study at school The knowledge that teachers imparted to us is an important base

-to help us complete the -topic better

In particular, we would like to express our special appreciation, thanks and deepgratitude to Dr Cao Thi Nhan, thank you for your enthusiastic guidance and alwaysfacilitate us to complete this topic Her sincere words of encouragement and suggestionsare valuable motivation for us to gain a lot of useful knowledge as well as overcomedifficulties in learning and implementing the thesis

During the implementation of the topic, we tried to apply the background knowledge,research and learn new technologies to build this graduated thesis However, in theprocess of implementation, due to limited knowledge and experience, it is difficult toavoid shortcomings Therefore, we hope to receive comments from teachers so that wecan improve the necessary knowledge and skills

Thank you so much!

Authors

Trang 7

THESIS TITLE: APPLYING TEXT MINING IN RESTAURANT RECOMMENDATION

BASED ON CUSTOMER REVIEWS

Advisor: Dr Cao Thi Nhan

Duration: August 15th, 2020 — December 31st, 2020

Student:

1 Trinh Thi Thu Ha — 16520323

2 Nguyen Minh Quan — 16521574

Contents:

1 Descriptions

A teal time system that enables users to find the dining options around a desired travel

destination The project tackles certain problems that many tourists face on a daily basis

such as wasting unnecessary time planning what to eat and struggling with finding the

suitable restaurants The lack of travel websites, blogs, reviews and their lack of simplicity

due to large amount of data makes such decisions time consuming

2 Scope

- Dataset about tourist’s review

Trang 8

Learn how to use sentiment analysis to make restaurant sentiment scoring

subsystem and use machine learning to extract keyword

Learn how to make recommendation system to generate an optimal automated dayplan by using available data

Make a web application to analyze data and display results

Rake algorithm: weighted text cloud by frequency

Key-Graph algorithm: interconnected text cloud by relationship

Random Forests: clustered text cloud by similarity

Recommender system: recommend the suitable restaurants (also output the overall

satisfaction score)

Expected results

Trang 9

- Understand fundamental algorithms and methodologies using in recommendation

system, sentiment analysis and keyword extraction

- Successfully build a sentiment scoring subsystem to score each entity (restaurants

and touristic events) which are stored up to date

- Form a text cloud from each entity’s keyword which are generated by machine

learning approach.

- The system successfully generates the suitable restaurants by using the sentiment

scores that are stored and constraints that are defined by the user

- Successfully build the web application

Each entity’s keywords are generated by the machine leaning approach Later all of

these keywords are gathered to form a text cloud (Text cloud might have several

forms as: weighted text cloud by frequency, clustered text cloud by similarity or

interconnected text cloud by relationship)

Phase 3 (23/09/2020 — 23/10/2020): Making a recommendation system

Recommendation system will generate the suitable restaurants which will have

restaurants and touristic events spanning the day’s timeline

Phase 4 (24/10/2020 — 31/12/2020): Develop web app to interact with user

Trang 10

Build a web app for user to enter data and display result.

Approved by the advisor(s)

Signature(s) of advisor(s)

Dr Cao Thi Nhan

Ho Chi Minh city, / /2020

Signature(s) of student(s)

Trinh Thi Thu Ha Nguyen Minh Quan

Trang 11

TABLE OF CONTENT

LIST OF TABLES 5-5 G5 25 9 9 9.9.9.0 0.0 00500800809009009004 1

LIST OF FIGURES 6-5-5 2S 9 in g0 0 00000800809009009804 2

LIST OF ACRONYMS AND ABBREVIATIONS ccSSSĂSSSS°Seesesseee 4

Chapter 1 INTRODUCTION u sssssssssssssccscssssssssssessecscssesscsscssesessssssssecsecseesersersees 1

1.2 ÏPUTDOSG 00G G G6 50 9 99.9 9.0 9.0.0 90.0 004.000900400004006049 80 2

1.3 Objectives and SCOD o5 G5 G6 9 9 9 9.9 0 0 000009 00 2

1.3.1 Djj€CÏVS G0 cọc HC cọ HH TH T0 T001 00 008 06 21.3.2 kuiii cất PP eo 7n ẽ he 2

2.2.2 Text Mining techniques for recommender sysfem «- 8

2.2.3 Methods and 'TechniÏQU€S o5 6 S5 5S 55 999 55995586 9589655 9

Trang 12

2.5.3 Confusion Matrix o 5-5-5555 5 5 5 900 905095005 9609005 35

Chapter 3 SYSTEM ANALYSIS AND DESIGN -s-csĂSss2ssseSsesesse 37

3.2 Data COIÏ€CẨÏOH co 5c 5G 5 5 5 9 c0 0 00900080966906 38

3.3 Restaurant sentiment SCOFÏIØ ds- ó5 5 <9 %9 989999 989598999599598696558 43

3.3.1 Data Ta €ÌÏÏTDP 7 G G5 G G556 S5 9 999.99 9099094909 00968996886 8096 44

Trang 13

3.6.4 The result of recommender System 5-55 < 5 55< «sseseess 61

3.7.1 Use case ÏÀTITN 0G G5 6 S99 99 99 95 898994906998096899498968696 64

3.7.2 ACtiVity ÌÏ23ØT3IH G5 G G5 6 S5 9 99.9 909.909.9090 909 0940.006 8096 67

3.7.4 Sequence (ÏÏÀF101 so 5 5< 5 5< 999 5999.959896 65964096048966696 70

Chapter 4 SYSTEM IMPLEMENTA TION s 25-5 <2 S1.9S1Es.ssee 76

4.1 Overview of the system implementation -<s-<s5< 5s ss£<ssessesseese 764.2 Data collection S€TVCC co <6 5 5< 5 HH n0 0090008896 0006 76

4.3 Restaurant sentiment SCOPING S€TVỈC 0 G56 S2 55 599 98995059558969698 78

4.4 Entity’s keyword extraction and word-cloud visualization 78

Trang 14

Chapter 5 CONCLUSIONS AND FUTURE WORKS ccsc2SeSeesesesse 88

Trang 15

I 3 ibyiẳiắăi(Úắ.£ÝÝÝÝỀẼỶÝ 17Matrix of Word CO-OCCUTT€TIC€S - 2G c1 ng ng ng ưy 27

N6 10011 0 4 35

The summary of collected a{a ¿5c 2 33133333 E+EEEreereeereresrreeree 42

The short sample list of scored resfAUTATIS - 55s <++s£+s+seessss 52The distribution of restaurant by review quantity - - -<-s<<s<2 61

The distribution of restaurant that scored , - 5s £+s£+x++ersxss 62

Use case €SCTIDEIOTNS G1111 9T HH ngư 66

Illustrate the meaning of data fields in table revIewitem - 74

Trang 16

LIST OE FIGURES

Figure 2.1 Classification model - «<< + 6 E2 13311311 E1 E911 1 E91 1 HH như, 7Figure 2.2 General framework: using text mining in recommender systems 9Figure 2.3 Support vectors delimiting the widest margin between classes 22Figure 2.4 Random Forest Algorithm in predICting - - «<< <£++e++eexsseresees 24

Figure 2.6 Precision and Recall calculafIOn - 5 5< + *++*++skEssessereereers 33Figure 2.7 Confusion Matrix c5 31901193191 911911 1 91H ng nghệ 36

Figure 3.2 Illustrate the process for data CỌÏ€CfIOI 5 55 5s S+ + eeereererrses 39

Figure 3.4 Illustrate an overview of restaurant in TripAdViSOF - -«<<s<2 40Figure 3.5 Illustrate an overview of restaurant in TripAdViSOF - - «<-s<2 41

Figure 3.7 Illustrate the running time in second of the trained models 50Figure 3.8 Illustrate the metrics result of each trained classification models 51Figure 3.9 Illustrate the word-cloud for extracted keyWOTrdS -«<+<<c+sxe+ 59

Figure 3.11 Recommend restaurant activity diagram of restaurant recommendet 68Figure 3.12 View restaurant information activity diagram of restaurant recommender

Figure 3.13 Illustrate class diagram of restaurant recommender -. «- 70

Figure 3.14 Recommend restaurant sequence diagram of restaurant recommender 71Figure 3.15 View restaurant information sequence diagram of restaurant recommender

Figure 3.16 Illustrate the database of restaurant recommendet -«s+ 73Figure 4.1 Illustrate the table restinfor in MySQL database ‹ <<<<x+2 77

Trang 17

Figure 4.2 Illustrate the table reviewitem in MySQL database -‹ 78 Figure 4.3 Illustrate the quote displaying in recommended restaurants 79

Figure 4.4 Illustrate the word-cloud of restaurant descriptive tags 80Figure 4.5 Illustrate the inputting section in Homepage «+5 «++s£+s<++ex+ss2 81Figure 4.6 Illustrate the recommended restaurant displaying in the Homepage 82

Figure 4.7 Illustrate the recommended restaurant displaying after changing the review

Figure 4.8 Illustrate the restaurant information Ì ««++s<£++e+see++seresees 84Figure 4.9 Illustrate the restaurant information 2 s6 + +sv£+s+seEssesseessee 84Figure 4.10 Illustrate the restaurant information 3 - - «+ ++<£++e+seeessresees 85

Figure 4.12 Illustrate the web page of restaurant on TTIpAdVISOT - «+ 87

Trang 18

LIST OF ACRONYMS AND ABBREVIATIONS

No Acronyms Meaning

1 IE Information Extraction

2 NV Naive Bayes

3 NLP Natural Language Processing

4 RAKE Rapid Automatic Keyword Extraction

5 SVM Support vector machine

6 TF-IDF Term Frequency - Inverse Document Frequency

7 UI User Interface

Trang 19

Chapter 1 INTRODUCTION

1.1 Context

The rapid spread of Internet has provided people with a new way of getting

information It has become one of the largest sources of information, people can be able

to do more searches than ever before on the Web Besides, social networking has also

got attention of Internet users Social media can take many different forms, one of which

is service or product-review websites These sites provide a platform for consumers to

share their experiences and opinions about the products or services they have purchasedand used, hence providing other consumers with information about the pros and cons ofthese products or services

Recently there has been many restaurants when you are looking for a new place toeat, so how can you find the best restaurant and what is the best way to find a greatrestaurant Traditionally, we can ask someone who has been there But if you do not havesomeone to ask, now you can always turn to online reviews Customers consider manyfactors when deciding where to eat It is not just about how great the food taste but howgood the service is, how polite the staff are, and how well the atmosphere is, and howreasonable the price is The truth is that consumers are less trusting in advertising andtending to turn in to reviews to find out what dining at a restaurant is really like Now it

is known that people are now focusing on customer reviews first to decide where to eat

As the result, online reviews today have the power to connect the potential consumerdirectly with a restaurant even before they come Furthermore, the popularity of onlinereview sites (Tripadvisor.com, Yied.com ) have increased in recent years, andtherefore more reviews have been created for a wide variety of products and services.Have you ever had a trouble in finding the dining options? It will be difficult and wasting

Trang 20

amount of time in searching each webpage of restaurants, read reviews and comparedbetween them, later find out the suitable restaurant With our system, user can easily findout the suitable restaurants based on some basic desired inputs.

1.2 Purpose

The main aim is to understand the core concepts and algorithms using in

recommendation system, sentiment analysis and keyword extraction Thenimplementing algorithms, compare and evaluate the achieved results Firstly, we collect

the datasets All datasets are generated by collecting user reviews from TripAdvisor The

text then went through a process including multiple tasks: data review, preprocessing

(trim lowercase, remove punctuation, stop word removal) Then, we use sentimentanalysis for identifying the subjectivity of user review and later determining its class asbeing neutral, positive, and negative From that, we will create a ranked list of restaurantsbased on the percentage of positive reviews Finally, determine the characteristics of a

particular restaurant by using keyword extraction The extracted keywords will be the

descriptive tags of each restaurant, and these keywords are gathered to form a text cloud

1.3 Objectives and scope

1.3.1 Objectives

- All datasets are generated by collecting restaurant information and restaurants’

reviews from TripAdvisor

- Text mining techniques include sentiment analysis and keyword extraction

algorithms

1.3.2 Scope

- Dataset about tourist’s reviews

- Recommendation system

Trang 21

1.5 Report Outline

Chapter]: Introduction

This chapter will introduce and summarize the context, the purpose, the scope, and itssignificant, related works to this project and the motivation of this graduate thesis

Chapter2: Background and theory

This chapter is all about the background and theory that we need to research in other tofinish this thesis

Chapter 3: System analysis and Design

This chapter will describe the design of our system, the components inside it, and designchoices

Trang 22

Chapter 4: System implementation

This chapter will explain the implementation detail, including the descriptions, inputs,output of the services, and the graphical user interface

Chapter 5: Conclusions and future works

This chapter will present the conclusions of this graduate thesis and future developmentsthat we will do

Trang 23

Chapter 2 BACKGROUND AND THEORY

2.1 Recommender System

2.1.1 Overview

On the Internet, where there are many options for users, it is necessary to filter,prioritize and efficiently provide relevant information to minimize the informationoverload problem Recommender system will solve this problem by looking for throughlarge volume of dynamically generated information to provide users with content andservices [1]

A recommender system is a system designed to recommend everything to the userbased on many different factors These systems predict the most likely product that usersare most likely to purchase or are interested in It deals with a large volume ofinformation by filtering the most important information based on the data provided by

the user and other factors that are of interest to the user's preferences and interests It

finds the match between the user and the item and suggests similarities between the userand the item to suggest

A common architecture for the recommended system includes the following

components:

- Candidate generation: the system generates a much smaller subset of candidates

from the potentially huge corpus

- Scoring: As the name of this component, the system needs to score the candidates

and select the set of items to display to user Because the model at this stage usuallydeals with a relatively small subset of items, the system can use the more precisemodel replying on additional queries

Trang 24

- Re-ranking: Finally, the system must consider additional constraints for the final

rank For example, the system will remove items which the user disliked Re-rankingcan help ensure diversity, freshness, and fairness

Recommendation systems is the system that can support users in finding items oftheir interest [2] it helps item providers in delivering their items to right user and identifyproducts which are most relevant to users Additionally, it also helps websites to improve

user engagement,

Nowadays, more and more fields are used in recommendation system such asmovies, books, news, hotels, jobs, etc Many streaming services like Netflix, Apple TV,Disney use a recommender system to recommend movies and web-series to its users

YouTube also apply it in recommending videos [2]

Depend on the data and method used in recommender system, it has many differenttypes of recommendation system:

- Popularity-Based recommendation system works on the principle of popularity

and or anything which is in trend Example: recommends the trending videos inYouTube

- Classification model: This model uses features for both products as well as users to

predict whether a user will like the product or not

- Content-Based recommendation system works on the principle of similar content

- Collaborative filtering: works based on similarity between different users and the

widely used categories as an e-commerce website and online movie sites

Because our study focuses on analyzing the sentiment of restaurant’s review and

evaluate the preferred level of customer to the restaurant, we choose the classificationmodel as the model for building the recommendation model

Trang 25

2.1.2 Classification-based recommendation system

As the name of this model, recommendation system will be build based onclassification model When a new user come, the classifier will provide the binary value

of whether the product is liked by this user, that way we can recommend the product to

Figure 2.1 Classification model!

In above example as shown in Figure 2.1 using user features like Age, gender and

product features like cost, quality and product history, based on this input our classifier

will give a binary value which is represented for like or dislike of user, based on thatBoolean (1 or 0) we could recommend product to a customer In this case, we build andlearn one model based on user features to try to answer the question "What is the

probability that each user likes this item?" [3]

2.2 Text Mining

2.2.1 Overview

!

Source:_https://medium.com/@madasamy/introduction-to-recommendation-systems-and-how-to-design-recommendation-system-that-resembling-the-9ac 167e30e95,

Trang 26

According to Wikipedia’, “Text mining, also referred to as text data mining, roughly

equivalent to text analytics, is the process of deriving high-quality information fromtext.”

Another words, the “Text mining” phrase can be defined as automated retrieval ofmeaningful from textual data by the techniques rooting down to the machine learning,

data mining and statistics.

Unlike data stored in databases, the text is unstructured, ambiguous, and challenging

to process [4] Since these opinions are in form of natural language, they cannot be

exploited directly to understand the favorite of users and recommend users other items

that they are expected to like Therefore, a common use of text mining for recommendersystems is to turn textual user reviews into scores (in a range of 0 to 5) or predefinedcategories such as positive, neutral, or negative opinion, that can be used to make theuser-item matrices that recommender systems use in the recommendation process;specifically, sentiment analysis or opinion mining techniques can be applied

2.2.2 Text Mining techniques for recommender system

Nowadays, recommender systems become more popular, because they can helpusers reduce the overwhelming information from the Internet According to the

preferences and tastes of users, these systems provide recommendations which allow

them to filter from many items (e.g., movies, books, news, restaurants, services, etc.).

The Text mining techniques can be used to develop the recommendation system.The popular tasks of text mining include text classification, clustering, information

extraction, sentiment analysis, etc For instance, sentiment analysis could be applied

determine the user preferences based on their reviews in natural language or even to

? Source: https://en.wikipedia.org/wiki/Text_mining

Trang 27

identify inconsistency (spam review) if there are both textual reviews and rating fromthe users.

Sometimes, users provide their opinion on about the services or items they consume

by text reviews in blogs, review websites and different forums or social networks These

opinions include valuable to detect their satisfaction and feelings which can be used for

2.2.3 Methods and Techniques

3 Source: “Use of Text Mining Techniques for Recommender Systems”, 2020 [29]

Trang 28

In order to create such a restaurant recommendation for consumers to choose theirsuitable meal we will use two majors of Text mining techniques include: Sentimentanalysis and Keyword extraction.

2.3 Sentiment Analysis

2.3.1 Overview

Sentiment

Sentiment can be defined as an attitude, thought or adjustment prompted by feelings

or a specific view or opinion [5]

Sentiment analysis

Sentiment analysis is a process of determining whether a piece of writing is positive,neutral or negative For example, if your restaurant takes your customer feedback,sentiment analysis measures the attitude of the customer towards the aspects of a service

or product which they describe in text This typically involves taking a piece of text,whether it is a sentence, a comment or an entire document and returning a “score” thatmeasures how positive, neutral or negative the feedback is [6] For example:

“T really like the new dish of your restaurant!” — Positive

“T’m not sure if I like the new dish” — Neutral

“The new dish is awful!” — Negative

Sentiment analysis is used in many industries to extract customers’ knowledge,feeling and opinions Extracting customer emotions plays an important role in making

decisions, making business strategies These decisions can come from buying a

product online or a food service, all contact, opinion greatly affects everyday life

Extraction of opinion and emotional information is a research of language processing

10

Trang 29

The task of extracting information from comments and quotes to determine the user's

opinions and feelings about a particular topic, often trying to quote the emotions in theentire document is positive negative Therefore, sentiment analysis research not onlyhas an important impact in the field of natural language processing, but also has a deepimpact on management science, political science, economics and social [7] Human

language is very complex Thus, interpreting language for computers to understand and

analyze grammar, context, slang, and errors is a difficult process Language intonation

combined with context can influence the context even more difficult to describe it

Types of Sentiment analysis

Sentiment analysis focus on polarity (classify a text or an opinion as positive,neutral, or negative) but also on emotions and feelings [7]

Depending on how you want to do with customer feedbacks, you can define thecategories to meet your sentiment analysis purposes There are some common types ofsentiment analysis:

Standard Sentiment analysis: involves determining the polarity of the opinion It can

be a simple binary positive/negative sentiment identify This type can also go into thehigher specification (very positive, positive, neutral, negative, very negative) [8] Forexample,

“Really good atmosphere and amazing tacos” — Positive

“T would never come to this restaurant again!” — Negative

Emotional detection: is used to identify signs of specific emotional states presented

in the text There are combination lexicons (association of word and emotions) and

machine learning algorithms used to detect emotions For example,

11

Trang 30

“This was such a wonderful experience for our time in Saigon” — Happiness.

“The customer service was so bad” —› Anger

Aspect-based Sentiment Analysis: focuses on aspects or features that are beingmentioned in opinions Product reviews, for instance, are composed between manycharacteristics, like price, quality

“The price of beef steak is quite expensive”.

[Entity]: beef steak, [Aspect]: price, [Opinion]: expensive — Negative, price

Intent Detection: is used to find what kind of intention behind a given opinion It is

used in customer service support or for marketing and advertising

Sentiment analysis algorithms

Sentiment analysis is based on Natural Language Process and Machine Learning.[7] There are many different algorithms can use to implement sentiment analysis models.There are three major buckets of Sentiment analysis:

- Rule-based: these systems based on manual defined rules to automatically perform

sentiment analysis

- Automatic: systems based on machine learning algorithms to learn data

- Hybrid: system combine both rule-based and automatic

Rule-based approach

Rule-based system help identify the polarity, subjectivity, or the opinion holder rely

on sets of manual crafted rules This approach involves in basic NLP techniques, asfollowing operations:

- Stemming (removing suffix of a word and bring it to a base word)

12

Trang 31

- Tokenization (breaking the raw text into units called tokens).

- Part of speech tagging (assigning a tag/category to each word/token)

- Parsing (determining the syntactic structure of a text by analyzing its constituent

words based on an underlying grammar)‘.

- Lexicon analysis (lists of words and emotions)

Rule-based system works as following steps:

- Step 1: Define two lists of words includes positive words and negative words

- Step 2: The algorithm goes through the text and calculates the number of positive

and negative words that appear in text If there are more positive words, the text isconsidered as positive polarity and vice versa

The rule-based algorithms deliver some results but lack flexibility and accuracy thatwould make them truly usable

Automatic approach [7]

This type of sentiment analysis uses machine learning to identify the gist of themessage instead of using clearly defined rules It involves supervised machine learningclassification algorithms Sentiment analysis tasks are considered as classificationproblems, a text is provided to the classifier and then return a category (positive,negative, neutral) Sentiment analysis involves the following classification algorithms:

- Linear regression: an algorithm in statistic used to predict some value (Y) given a

set of features (X)

4 Source:

https://forum.huawei.com/enterprise/en/what-is-parsing-in-nlp/thread/571685-100429#:~:text=Summary%3A, grammar%20(of %20the%20language).

13

Trang 32

- Naive Bayes: a family of simple "probabilistic classifiers" based on applying Bayes'

theorem to predict the category of the text >.

- Support vector machine are supervised learning models with associated learning

algorithms that analyze data used for classification and regression analysis °.

- Deep learning: a diverse set of algorithms, using artificial neural networks to process

- TF-IDF Approach.

- Boolean Multinomial Naive Bayes

- Gaussian Naive Bayes

- Bernoulli Naive Bayes

Trang 33

TF-IDF Approach

TF-IDF stands for Term Frequency - Inverse Document Frequency, a statisticalmethod commonly used in information retrieval and text mining to evaluate the level of

the importance of a phrase to a particular document in a set that includes many

documents [9] This concept has appeared early in various fields of study, such aslinguistics and information architecture, based on its ability to support processing ofmultiple documents with bulk in a short period of time

Search engines often use different variables of the TF-IDF algorithm as part of their

ranking mechanism By assigning documents a relevance score, they can give relevant

search results in just milliseconds

Before going to calculation TF-IDF weight, the Corpus has to go through

tokenization step In this step, each individual sentence is broken into words, and laterthe documents are converted into a feature matrix which consists of rows for documentsand columns for each tokenized word

tf —idf(t,d) = tf(t,d) x idf(t,C) (2-1)

tf: term frequency - number of term occurrences in a document

idf: inverse document frequency - how much information the term provides incorpus C

idf (t,C) = log (2-2)

ICel

where:

|C|: the number of documents in the corpus.

IC,| = |{d € C : t € d}I: the number of documents containing term t.

15

Trang 34

More documents contain term £, less information it provides (tdf — 0).

Example: Assume that we have 2 documents:

Document 1: The goal is to turn data into information, and information intoinsight

Document 2: You can have data without information, but you cannot haveinformation without data

In order to calculate TF-IDF weight, we need to compute the term frequency (TF)first To do this, we create a feature matrix consisting of rows for documents and columnsfor each tokenized word At this step, we do not calculate the term frequency of commonwords (stop words) which is known as unimportant words and do not have any meanings

in the sentence such as the, and, are they usually appear in almost documents (tdf —>

0) Below table is term frequency matrix

Table 2-1 Term frequency (tf)

Terms Goal Data Information | Insight You

Docl 1 1 2 1 0

Doc2 0 2 2 0 1

To get the value of term frequency, we need to count the number of terms t in the

document, likes Goal appear one time in document 1, therefore, tf(Goal, Doc1) = 1 Otherwise, Goal do not exist in document 2, therefore, tf(Goal, Doc2) = 0.

The next step is to count the number of documents which contain term t (document

frequency) The value of document frequency (DF) is known as |C,| in equation.

16

Trang 35

Table 2-2 Document frequency (df)

As above table, df (d, Goal) = 1 because the term Goal appear in docl only It is

similarly to another terms

And then, we calculate the value of IDF as Table 2-3 below

Table 2-3 Inverse document frequency (idf)

Terms | Goal Data Information Insight You

Naive Bayes and Boolean Multinomial Naive Bayes

Naive Bayes Methods are a set of supervised learning algorithms based on theapplication of the Bayes theorem with the "naive" assumption of the conditionalindependence between all pairs that represent the value of the class variable [10] [11]

There are the briefly illustrate steps of Naive Bayes mathematically

17

Trang 36

Note: C: class, D: Document

1 Objective Function: argmax[P(C|D)] VC

Log scaling is used in order to prevent floating points and to prevent excessive

weights on frequently used words.

argmax{[log(P(C)) + >: log (P(w,|€))| (2-7)

18

Trang 37

Boolean Multinomial Naive Bayes is a special case of Naive Bayes with steps:

1 Preprocess text.

Remove punctuation and numbers

Remove stop words

Tokenizers the texts.

2 Remove all duplicate words in each document

Where f is frequency of word w in class C

Gaussian Naive Bayes

This version of NB mainly deals with continuous data [10] The probability

distribution for a class, p(x = v|c), can be computed by plugging ‘v’ into the equation

for a Normal distribution parameterized by „and 1,2 That is,

_@- Ue)

P(x=v|c)=-e ẻ (2-10)

19

Trang 38

where /„ is the mean of the values of ‘x’ associated with class ‘C’ and ‘ u,”’ is the

variance of values in ‘x’ associated with class ‘C’

Bernoulli Naive Bayes

This version of NB is used where there are multiple features, and each one isassumed to be a binary-valued variable [12] In text classification, word occurrencevector is used for training and then for classification The decision rule for Bernoulli NB1s as follows:

P(xly) = P(ily) x x; + (1 — P(|y)) x (1 - x) (2-11)

Where as P(ily) is probability the word ‘i’ appears in the documents of the class ‘y’

The Bernoulli NB classifier explicitly penalizes the non-occurrence of a feature ‘1’

that is an indicator for class ‘y’, whereas the multinomial variant would simply ignore a

non-occurring feature

Logistic Regression

In general, there are two different types of classification models: the generative

model (Naive Bayes, the hidden Markov model, etc.) and the discriminative model

(Logistic Regression, SVM, etc.) Both models try to compute P(class|features) or P(y|x) The main difference is that the generative model tries to model the joint probability distribution P(x, y) first and then compute the conditional probability P(y|x) using Baye’s Theorem, whereas a discriminative one directly models P(y|x).

[13]

20

Trang 39

Logistic Regression is categorized as a classification algorithm Mostly, it is used topredict a binary outcome (like 0 / 1, False / True, No / Yes, Wrong / Right) when a set

of independent variables is given

In sentiment analysis, the objective is to predict whether a body of text, say an opinion,

has a positive or a negative sentiment For e.g., let’s say you have 1,000 reviews

Sentiment analysis lets you build a system to automatically go through all of these

reviews to figure out what fraction of them are positive reviews or negative reviews

Building a logistic regression classifier that performs sentiment analysis on reviewscan be done in 3 steps:

- Extract features: process the raw reviews in the training set and extract useful

features [14] Reviews with a positive sentiment have a label of 1, while those with

a negative sentiment have a label of 0

- Train: Train your logistic regression classifier while minimizing the cost [14]

- Predict: Come up with predictions using your learned model [14]

Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm that can be used for bothclassification and regression challenges [15] Classification is predicting a label/groupand Regression is predicting a continuous value SVM performs classification by findingthe hyper-plane that differentiate the classes we plotted in n-dimensional space [16]

In this work, SVM have been applied in order to classify a set of opinions aspositives or negatives SVM is a product of applied complexity theory developed byVapnik (1995) Some years ago, Joachims (1998) proposed SVM for text categorizationtasks, to profit from its robustness in high dimensional [17] The main purpose of thisalgorithm is: find those samples (support vectors) that delimit the widest frontier

21

Trang 40

between positive and negative samples in the feature space as shown in the Fig 2.3below:

Figure 2.3 Support vectors delimiting the widest margin between classesŠ

In more detail, given a set of training examples, each is marked to belong one of twocategories, an SVM training algorithm builds a model that assigns new examples to onecategory or the other, making it a non-probabilistic binary linear classifier An SVMmaps training examples to points in space to maximize the width of the gap between thetwo categories New examples are then mapped into that same space and predicted to

belong to a category based on which side of the gap they belong [17]

8 Source:

https://www.researchgate.net/figure/Two-possible-margins-that-linearly-separate-positive-and-negative-samples-s-i-would-be_fig2_245536079

22

Ngày đăng: 02/10/2024, 03:03

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w