năm 2021 NHẬN XÉT KHÓA LUẬN TÓT NGHIỆP CUA CÁN BỘ HƯỚNG DAN Tên khóa luận: APPLYING TEXT MINING IN RESTAURANT RECOMMENDATION BASED ON CUSTOMER REVIEWS Nhóm SV thực hiên: Cán bô hướng dẫn
Trang 1VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
TRINH THI THU HA - 16520323
NGUYEN MINH QUAN - 16521574
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR
Dr CAO THI NHAN
HO CHI MINH CITY, 2020
Trang 2VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
TRINH THI THU HA - 16520323
NGUYEN MINH QUAN - 16521574
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR
Dr CAO THI NHAN
HO CHI MINH CITY, 2020
Trang 3ASSESSMENT COMMITTEE
The Assessment Committee is established under the Decision ;
—— - by Rector of the University of Information Technology
— e eee cece eee eee e testes eee eneenaens — Chairman
Qe kee e cece e ence eee e cent eens tenet eee eenae es — Secretary
— Member
— cece ccceeeeee cesses eeeeeneneesneeeneeees — Member
Trang 4ĐẠI HỌC QUOC GIA TP HO CHÍ MINH CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
TRƯỜNG ĐẠI HỌC Độc lập - Tự do - Hạnh phúc CÔNG NGHỆ THÔNG TIN
TP HCM, ngày tháng năm 2021
NHẬN XÉT KHÓA LUẬN TÓT NGHIỆP
CUA CÁN BỘ HƯỚNG DAN
Tên khóa luận:
APPLYING TEXT MINING IN RESTAURANT RECOMMENDATION
BASED ON CUSTOMER REVIEWS
Nhóm SV thực hiên: Cán bô hướng dẫn:
Trịnh Thị Thu Hà - 16520323 TS Cao Thị Nhạn
Nguyễn Minh Quân - 16521574
Đánh giá Khóa luận
1 Vé cuôn báo cáo:
Số trang Số chương
Số bảng số liệu Số hình vẽ
Số tài liệu tham khảo Sản phẩm
Một sô nhận xét vê hình thức cuôn bao cáo:
2 Về nội dung nghiên cứu:
Trang 5Điểm từng sinh viên:
Trịnh Thị Thu Hà: 0
Nguyễn Minh Quân: /10
Người nhận xét
(Ký và ghi rõ họ tên)
Trang 6First of all, we would like to express our sincere thanks to the Lecturers of the University
of Information Technology, especially the Members of Information System Faculty who used their knowledge and enthusiasm to transmit us valuable knowledge during thetime we study at school The knowledge that teachers imparted to us is an important base
-to help us complete the -topic better
In particular, we would like to express our special appreciation, thanks and deepgratitude to Dr Cao Thi Nhan, thank you for your enthusiastic guidance and alwaysfacilitate us to complete this topic Her sincere words of encouragement and suggestionsare valuable motivation for us to gain a lot of useful knowledge as well as overcomedifficulties in learning and implementing the thesis
During the implementation of the topic, we tried to apply the background knowledge,research and learn new technologies to build this graduated thesis However, in theprocess of implementation, due to limited knowledge and experience, it is difficult toavoid shortcomings Therefore, we hope to receive comments from teachers so that wecan improve the necessary knowledge and skills
Thank you so much!
Authors
Trang 7THESIS TITLE: APPLYING TEXT MINING IN RESTAURANT RECOMMENDATION
BASED ON CUSTOMER REVIEWS
Advisor: Dr Cao Thi Nhan
Duration: August 15th, 2020 — December 31st, 2020
Student:
1 Trinh Thi Thu Ha — 16520323
2 Nguyen Minh Quan — 16521574
Contents:
1 Descriptions
A teal time system that enables users to find the dining options around a desired travel
destination The project tackles certain problems that many tourists face on a daily basis
such as wasting unnecessary time planning what to eat and struggling with finding the
suitable restaurants The lack of travel websites, blogs, reviews and their lack of simplicity
due to large amount of data makes such decisions time consuming
2 Scope
- Dataset about tourist’s review
Trang 8Learn how to use sentiment analysis to make restaurant sentiment scoring
subsystem and use machine learning to extract keyword
Learn how to make recommendation system to generate an optimal automated dayplan by using available data
Make a web application to analyze data and display results
Rake algorithm: weighted text cloud by frequency
Key-Graph algorithm: interconnected text cloud by relationship
Random Forests: clustered text cloud by similarity
Recommender system: recommend the suitable restaurants (also output the overall
satisfaction score)
Expected results
Trang 9- Understand fundamental algorithms and methodologies using in recommendation
system, sentiment analysis and keyword extraction
- Successfully build a sentiment scoring subsystem to score each entity (restaurants
and touristic events) which are stored up to date
- Form a text cloud from each entity’s keyword which are generated by machine
learning approach.
- The system successfully generates the suitable restaurants by using the sentiment
scores that are stored and constraints that are defined by the user
- Successfully build the web application
Each entity’s keywords are generated by the machine leaning approach Later all of
these keywords are gathered to form a text cloud (Text cloud might have several
forms as: weighted text cloud by frequency, clustered text cloud by similarity or
interconnected text cloud by relationship)
Phase 3 (23/09/2020 — 23/10/2020): Making a recommendation system
Recommendation system will generate the suitable restaurants which will have
restaurants and touristic events spanning the day’s timeline
Phase 4 (24/10/2020 — 31/12/2020): Develop web app to interact with user
Trang 10Build a web app for user to enter data and display result.
Approved by the advisor(s)
Signature(s) of advisor(s)
Dr Cao Thi Nhan
Ho Chi Minh city, / /2020
Signature(s) of student(s)
Trinh Thi Thu Ha Nguyen Minh Quan
Trang 11TABLE OF CONTENT
LIST OF TABLES 5-5 G5 25 9 9 9.9.9.0 0.0 00500800809009009004 1
LIST OF FIGURES 6-5-5 2S 9 in g0 0 00000800809009009804 2
LIST OF ACRONYMS AND ABBREVIATIONS ccSSSĂSSSS°Seesesseee 4
Chapter 1 INTRODUCTION u sssssssssssssccscssssssssssessecscssesscsscssesessssssssecsecseesersersees 1
1.2 ÏPUTDOSG 00G G G6 50 9 99.9 9.0 9.0.0 90.0 004.000900400004006049 80 2
1.3 Objectives and SCOD o5 G5 G6 9 9 9 9.9 0 0 000009 00 2
1.3.1 Djj€CÏVS G0 cọc HC cọ HH TH T0 T001 00 008 06 21.3.2 kuiii cất PP eo 7n ẽ he 2
2.2.2 Text Mining techniques for recommender sysfem «- 8
2.2.3 Methods and 'TechniÏQU€S o5 6 S5 5S 55 999 55995586 9589655 9
Trang 122.5.3 Confusion Matrix o 5-5-5555 5 5 5 900 905095005 9609005 35
Chapter 3 SYSTEM ANALYSIS AND DESIGN -s-csĂSss2ssseSsesesse 37
3.2 Data COIÏ€CẨÏOH co 5c 5G 5 5 5 9 c0 0 00900080966906 38
3.3 Restaurant sentiment SCOFÏIØ ds- ó5 5 <9 %9 989999 989598999599598696558 43
3.3.1 Data Ta €ÌÏÏTDP 7 G G5 G G556 S5 9 999.99 9099094909 00968996886 8096 44
Trang 133.6.4 The result of recommender System 5-55 < 5 55< «sseseess 61
3.7.1 Use case ÏÀTITN 0G G5 6 S99 99 99 95 898994906998096899498968696 64
3.7.2 ACtiVity ÌÏ23ØT3IH G5 G G5 6 S5 9 99.9 909.909.9090 909 0940.006 8096 67
3.7.4 Sequence (ÏÏÀF101 so 5 5< 5 5< 999 5999.959896 65964096048966696 70
Chapter 4 SYSTEM IMPLEMENTA TION s 25-5 <2 S1.9S1Es.ssee 76
4.1 Overview of the system implementation -<s-<s5< 5s ss£<ssessesseese 764.2 Data collection S€TVCC co <6 5 5< 5 HH n0 0090008896 0006 76
4.3 Restaurant sentiment SCOPING S€TVỈC 0 G56 S2 55 599 98995059558969698 78
4.4 Entity’s keyword extraction and word-cloud visualization 78
Trang 14Chapter 5 CONCLUSIONS AND FUTURE WORKS ccsc2SeSeesesesse 88
Trang 15I 3 ibyiẳiắăi(Úắ.£ÝÝÝÝỀẼỶÝ 17Matrix of Word CO-OCCUTT€TIC€S - 2G c1 ng ng ng ưy 27
N6 10011 0 4 35
The summary of collected a{a ¿5c 2 33133333 E+EEEreereeereresrreeree 42
The short sample list of scored resfAUTATIS - 55s <++s£+s+seessss 52The distribution of restaurant by review quantity - - -<-s<<s<2 61
The distribution of restaurant that scored , - 5s £+s£+x++ersxss 62
Use case €SCTIDEIOTNS G1111 9T HH ngư 66
Illustrate the meaning of data fields in table revIewitem - 74
Trang 16LIST OE FIGURES
Figure 2.1 Classification model - «<< + 6 E2 13311311 E1 E911 1 E91 1 HH như, 7Figure 2.2 General framework: using text mining in recommender systems 9Figure 2.3 Support vectors delimiting the widest margin between classes 22Figure 2.4 Random Forest Algorithm in predICting - - «<< <£++e++eexsseresees 24
Figure 2.6 Precision and Recall calculafIOn - 5 5< + *++*++skEssessereereers 33Figure 2.7 Confusion Matrix c5 31901193191 911911 1 91H ng nghệ 36
Figure 3.2 Illustrate the process for data CỌÏ€CfIOI 5 55 5s S+ + eeereererrses 39
Figure 3.4 Illustrate an overview of restaurant in TripAdViSOF - -«<<s<2 40Figure 3.5 Illustrate an overview of restaurant in TripAdViSOF - - «<-s<2 41
Figure 3.7 Illustrate the running time in second of the trained models 50Figure 3.8 Illustrate the metrics result of each trained classification models 51Figure 3.9 Illustrate the word-cloud for extracted keyWOTrdS -«<+<<c+sxe+ 59
Figure 3.11 Recommend restaurant activity diagram of restaurant recommendet 68Figure 3.12 View restaurant information activity diagram of restaurant recommender
Figure 3.13 Illustrate class diagram of restaurant recommender -. «- 70
Figure 3.14 Recommend restaurant sequence diagram of restaurant recommender 71Figure 3.15 View restaurant information sequence diagram of restaurant recommender
Figure 3.16 Illustrate the database of restaurant recommendet -«s+ 73Figure 4.1 Illustrate the table restinfor in MySQL database ‹ <<<<x+2 77
Trang 17Figure 4.2 Illustrate the table reviewitem in MySQL database -‹ 78 Figure 4.3 Illustrate the quote displaying in recommended restaurants 79
Figure 4.4 Illustrate the word-cloud of restaurant descriptive tags 80Figure 4.5 Illustrate the inputting section in Homepage «+5 «++s£+s<++ex+ss2 81Figure 4.6 Illustrate the recommended restaurant displaying in the Homepage 82
Figure 4.7 Illustrate the recommended restaurant displaying after changing the review
Figure 4.8 Illustrate the restaurant information Ì ««++s<£++e+see++seresees 84Figure 4.9 Illustrate the restaurant information 2 s6 + +sv£+s+seEssesseessee 84Figure 4.10 Illustrate the restaurant information 3 - - «+ ++<£++e+seeessresees 85
Figure 4.12 Illustrate the web page of restaurant on TTIpAdVISOT - «+ 87
Trang 18LIST OF ACRONYMS AND ABBREVIATIONS
No Acronyms Meaning
1 IE Information Extraction
2 NV Naive Bayes
3 NLP Natural Language Processing
4 RAKE Rapid Automatic Keyword Extraction
5 SVM Support vector machine
6 TF-IDF Term Frequency - Inverse Document Frequency
7 UI User Interface
Trang 19Chapter 1 INTRODUCTION
1.1 Context
The rapid spread of Internet has provided people with a new way of getting
information It has become one of the largest sources of information, people can be able
to do more searches than ever before on the Web Besides, social networking has also
got attention of Internet users Social media can take many different forms, one of which
is service or product-review websites These sites provide a platform for consumers to
share their experiences and opinions about the products or services they have purchasedand used, hence providing other consumers with information about the pros and cons ofthese products or services
Recently there has been many restaurants when you are looking for a new place toeat, so how can you find the best restaurant and what is the best way to find a greatrestaurant Traditionally, we can ask someone who has been there But if you do not havesomeone to ask, now you can always turn to online reviews Customers consider manyfactors when deciding where to eat It is not just about how great the food taste but howgood the service is, how polite the staff are, and how well the atmosphere is, and howreasonable the price is The truth is that consumers are less trusting in advertising andtending to turn in to reviews to find out what dining at a restaurant is really like Now it
is known that people are now focusing on customer reviews first to decide where to eat
As the result, online reviews today have the power to connect the potential consumerdirectly with a restaurant even before they come Furthermore, the popularity of onlinereview sites (Tripadvisor.com, Yied.com ) have increased in recent years, andtherefore more reviews have been created for a wide variety of products and services.Have you ever had a trouble in finding the dining options? It will be difficult and wasting
Trang 20amount of time in searching each webpage of restaurants, read reviews and comparedbetween them, later find out the suitable restaurant With our system, user can easily findout the suitable restaurants based on some basic desired inputs.
1.2 Purpose
The main aim is to understand the core concepts and algorithms using in
recommendation system, sentiment analysis and keyword extraction Thenimplementing algorithms, compare and evaluate the achieved results Firstly, we collect
the datasets All datasets are generated by collecting user reviews from TripAdvisor The
text then went through a process including multiple tasks: data review, preprocessing
(trim lowercase, remove punctuation, stop word removal) Then, we use sentimentanalysis for identifying the subjectivity of user review and later determining its class asbeing neutral, positive, and negative From that, we will create a ranked list of restaurantsbased on the percentage of positive reviews Finally, determine the characteristics of a
particular restaurant by using keyword extraction The extracted keywords will be the
descriptive tags of each restaurant, and these keywords are gathered to form a text cloud
1.3 Objectives and scope
1.3.1 Objectives
- All datasets are generated by collecting restaurant information and restaurants’
reviews from TripAdvisor
- Text mining techniques include sentiment analysis and keyword extraction
algorithms
1.3.2 Scope
- Dataset about tourist’s reviews
- Recommendation system
Trang 211.5 Report Outline
Chapter]: Introduction
This chapter will introduce and summarize the context, the purpose, the scope, and itssignificant, related works to this project and the motivation of this graduate thesis
Chapter2: Background and theory
This chapter is all about the background and theory that we need to research in other tofinish this thesis
Chapter 3: System analysis and Design
This chapter will describe the design of our system, the components inside it, and designchoices
Trang 22Chapter 4: System implementation
This chapter will explain the implementation detail, including the descriptions, inputs,output of the services, and the graphical user interface
Chapter 5: Conclusions and future works
This chapter will present the conclusions of this graduate thesis and future developmentsthat we will do
Trang 23Chapter 2 BACKGROUND AND THEORY
2.1 Recommender System
2.1.1 Overview
On the Internet, where there are many options for users, it is necessary to filter,prioritize and efficiently provide relevant information to minimize the informationoverload problem Recommender system will solve this problem by looking for throughlarge volume of dynamically generated information to provide users with content andservices [1]
A recommender system is a system designed to recommend everything to the userbased on many different factors These systems predict the most likely product that usersare most likely to purchase or are interested in It deals with a large volume ofinformation by filtering the most important information based on the data provided by
the user and other factors that are of interest to the user's preferences and interests It
finds the match between the user and the item and suggests similarities between the userand the item to suggest
A common architecture for the recommended system includes the following
components:
- Candidate generation: the system generates a much smaller subset of candidates
from the potentially huge corpus
- Scoring: As the name of this component, the system needs to score the candidates
and select the set of items to display to user Because the model at this stage usuallydeals with a relatively small subset of items, the system can use the more precisemodel replying on additional queries
Trang 24- Re-ranking: Finally, the system must consider additional constraints for the final
rank For example, the system will remove items which the user disliked Re-rankingcan help ensure diversity, freshness, and fairness
Recommendation systems is the system that can support users in finding items oftheir interest [2] it helps item providers in delivering their items to right user and identifyproducts which are most relevant to users Additionally, it also helps websites to improve
user engagement,
Nowadays, more and more fields are used in recommendation system such asmovies, books, news, hotels, jobs, etc Many streaming services like Netflix, Apple TV,Disney use a recommender system to recommend movies and web-series to its users
YouTube also apply it in recommending videos [2]
Depend on the data and method used in recommender system, it has many differenttypes of recommendation system:
- Popularity-Based recommendation system works on the principle of popularity
and or anything which is in trend Example: recommends the trending videos inYouTube
- Classification model: This model uses features for both products as well as users to
predict whether a user will like the product or not
- Content-Based recommendation system works on the principle of similar content
- Collaborative filtering: works based on similarity between different users and the
widely used categories as an e-commerce website and online movie sites
Because our study focuses on analyzing the sentiment of restaurant’s review and
evaluate the preferred level of customer to the restaurant, we choose the classificationmodel as the model for building the recommendation model
Trang 252.1.2 Classification-based recommendation system
As the name of this model, recommendation system will be build based onclassification model When a new user come, the classifier will provide the binary value
of whether the product is liked by this user, that way we can recommend the product to
Figure 2.1 Classification model!
In above example as shown in Figure 2.1 using user features like Age, gender and
product features like cost, quality and product history, based on this input our classifier
will give a binary value which is represented for like or dislike of user, based on thatBoolean (1 or 0) we could recommend product to a customer In this case, we build andlearn one model based on user features to try to answer the question "What is the
probability that each user likes this item?" [3]
2.2 Text Mining
2.2.1 Overview
!
Source:_https://medium.com/@madasamy/introduction-to-recommendation-systems-and-how-to-design-recommendation-system-that-resembling-the-9ac 167e30e95,
Trang 26According to Wikipedia’, “Text mining, also referred to as text data mining, roughly
equivalent to text analytics, is the process of deriving high-quality information fromtext.”
Another words, the “Text mining” phrase can be defined as automated retrieval ofmeaningful from textual data by the techniques rooting down to the machine learning,
data mining and statistics.
Unlike data stored in databases, the text is unstructured, ambiguous, and challenging
to process [4] Since these opinions are in form of natural language, they cannot be
exploited directly to understand the favorite of users and recommend users other items
that they are expected to like Therefore, a common use of text mining for recommendersystems is to turn textual user reviews into scores (in a range of 0 to 5) or predefinedcategories such as positive, neutral, or negative opinion, that can be used to make theuser-item matrices that recommender systems use in the recommendation process;specifically, sentiment analysis or opinion mining techniques can be applied
2.2.2 Text Mining techniques for recommender system
Nowadays, recommender systems become more popular, because they can helpusers reduce the overwhelming information from the Internet According to the
preferences and tastes of users, these systems provide recommendations which allow
them to filter from many items (e.g., movies, books, news, restaurants, services, etc.).
The Text mining techniques can be used to develop the recommendation system.The popular tasks of text mining include text classification, clustering, information
extraction, sentiment analysis, etc For instance, sentiment analysis could be applied
determine the user preferences based on their reviews in natural language or even to
? Source: https://en.wikipedia.org/wiki/Text_mining
Trang 27identify inconsistency (spam review) if there are both textual reviews and rating fromthe users.
Sometimes, users provide their opinion on about the services or items they consume
by text reviews in blogs, review websites and different forums or social networks These
opinions include valuable to detect their satisfaction and feelings which can be used for
2.2.3 Methods and Techniques
3 Source: “Use of Text Mining Techniques for Recommender Systems”, 2020 [29]
Trang 28In order to create such a restaurant recommendation for consumers to choose theirsuitable meal we will use two majors of Text mining techniques include: Sentimentanalysis and Keyword extraction.
2.3 Sentiment Analysis
2.3.1 Overview
Sentiment
Sentiment can be defined as an attitude, thought or adjustment prompted by feelings
or a specific view or opinion [5]
Sentiment analysis
Sentiment analysis is a process of determining whether a piece of writing is positive,neutral or negative For example, if your restaurant takes your customer feedback,sentiment analysis measures the attitude of the customer towards the aspects of a service
or product which they describe in text This typically involves taking a piece of text,whether it is a sentence, a comment or an entire document and returning a “score” thatmeasures how positive, neutral or negative the feedback is [6] For example:
“T really like the new dish of your restaurant!” — Positive
“T’m not sure if I like the new dish” — Neutral
“The new dish is awful!” — Negative
Sentiment analysis is used in many industries to extract customers’ knowledge,feeling and opinions Extracting customer emotions plays an important role in making
decisions, making business strategies These decisions can come from buying a
product online or a food service, all contact, opinion greatly affects everyday life
Extraction of opinion and emotional information is a research of language processing
10
Trang 29The task of extracting information from comments and quotes to determine the user's
opinions and feelings about a particular topic, often trying to quote the emotions in theentire document is positive negative Therefore, sentiment analysis research not onlyhas an important impact in the field of natural language processing, but also has a deepimpact on management science, political science, economics and social [7] Human
language is very complex Thus, interpreting language for computers to understand and
analyze grammar, context, slang, and errors is a difficult process Language intonation
combined with context can influence the context even more difficult to describe it
Types of Sentiment analysis
Sentiment analysis focus on polarity (classify a text or an opinion as positive,neutral, or negative) but also on emotions and feelings [7]
Depending on how you want to do with customer feedbacks, you can define thecategories to meet your sentiment analysis purposes There are some common types ofsentiment analysis:
Standard Sentiment analysis: involves determining the polarity of the opinion It can
be a simple binary positive/negative sentiment identify This type can also go into thehigher specification (very positive, positive, neutral, negative, very negative) [8] Forexample,
“Really good atmosphere and amazing tacos” — Positive
“T would never come to this restaurant again!” — Negative
Emotional detection: is used to identify signs of specific emotional states presented
in the text There are combination lexicons (association of word and emotions) and
machine learning algorithms used to detect emotions For example,
11
Trang 30“This was such a wonderful experience for our time in Saigon” — Happiness.
“The customer service was so bad” —› Anger
Aspect-based Sentiment Analysis: focuses on aspects or features that are beingmentioned in opinions Product reviews, for instance, are composed between manycharacteristics, like price, quality
“The price of beef steak is quite expensive”.
[Entity]: beef steak, [Aspect]: price, [Opinion]: expensive — Negative, price
Intent Detection: is used to find what kind of intention behind a given opinion It is
used in customer service support or for marketing and advertising
Sentiment analysis algorithms
Sentiment analysis is based on Natural Language Process and Machine Learning.[7] There are many different algorithms can use to implement sentiment analysis models.There are three major buckets of Sentiment analysis:
- Rule-based: these systems based on manual defined rules to automatically perform
sentiment analysis
- Automatic: systems based on machine learning algorithms to learn data
- Hybrid: system combine both rule-based and automatic
Rule-based approach
Rule-based system help identify the polarity, subjectivity, or the opinion holder rely
on sets of manual crafted rules This approach involves in basic NLP techniques, asfollowing operations:
- Stemming (removing suffix of a word and bring it to a base word)
12
Trang 31- Tokenization (breaking the raw text into units called tokens).
- Part of speech tagging (assigning a tag/category to each word/token)
- Parsing (determining the syntactic structure of a text by analyzing its constituent
words based on an underlying grammar)‘.
- Lexicon analysis (lists of words and emotions)
Rule-based system works as following steps:
- Step 1: Define two lists of words includes positive words and negative words
- Step 2: The algorithm goes through the text and calculates the number of positive
and negative words that appear in text If there are more positive words, the text isconsidered as positive polarity and vice versa
The rule-based algorithms deliver some results but lack flexibility and accuracy thatwould make them truly usable
Automatic approach [7]
This type of sentiment analysis uses machine learning to identify the gist of themessage instead of using clearly defined rules It involves supervised machine learningclassification algorithms Sentiment analysis tasks are considered as classificationproblems, a text is provided to the classifier and then return a category (positive,negative, neutral) Sentiment analysis involves the following classification algorithms:
- Linear regression: an algorithm in statistic used to predict some value (Y) given a
set of features (X)
4 Source:
https://forum.huawei.com/enterprise/en/what-is-parsing-in-nlp/thread/571685-100429#:~:text=Summary%3A, grammar%20(of %20the%20language).
13
Trang 32- Naive Bayes: a family of simple "probabilistic classifiers" based on applying Bayes'
theorem to predict the category of the text >.
- Support vector machine are supervised learning models with associated learning
algorithms that analyze data used for classification and regression analysis °.
- Deep learning: a diverse set of algorithms, using artificial neural networks to process
- TF-IDF Approach.
- Boolean Multinomial Naive Bayes
- Gaussian Naive Bayes
- Bernoulli Naive Bayes
Trang 33TF-IDF Approach
TF-IDF stands for Term Frequency - Inverse Document Frequency, a statisticalmethod commonly used in information retrieval and text mining to evaluate the level of
the importance of a phrase to a particular document in a set that includes many
documents [9] This concept has appeared early in various fields of study, such aslinguistics and information architecture, based on its ability to support processing ofmultiple documents with bulk in a short period of time
Search engines often use different variables of the TF-IDF algorithm as part of their
ranking mechanism By assigning documents a relevance score, they can give relevant
search results in just milliseconds
Before going to calculation TF-IDF weight, the Corpus has to go through
tokenization step In this step, each individual sentence is broken into words, and laterthe documents are converted into a feature matrix which consists of rows for documentsand columns for each tokenized word
tf —idf(t,d) = tf(t,d) x idf(t,C) (2-1)
tf: term frequency - number of term occurrences in a document
idf: inverse document frequency - how much information the term provides incorpus C
idf (t,C) = log (2-2)
ICel
where:
|C|: the number of documents in the corpus.
IC,| = |{d € C : t € d}I: the number of documents containing term t.
15
Trang 34More documents contain term £, less information it provides (tdf — 0).
Example: Assume that we have 2 documents:
Document 1: The goal is to turn data into information, and information intoinsight
Document 2: You can have data without information, but you cannot haveinformation without data
In order to calculate TF-IDF weight, we need to compute the term frequency (TF)first To do this, we create a feature matrix consisting of rows for documents and columnsfor each tokenized word At this step, we do not calculate the term frequency of commonwords (stop words) which is known as unimportant words and do not have any meanings
in the sentence such as the, and, are they usually appear in almost documents (tdf —>
0) Below table is term frequency matrix
Table 2-1 Term frequency (tf)
Terms Goal Data Information | Insight You
Docl 1 1 2 1 0
Doc2 0 2 2 0 1
To get the value of term frequency, we need to count the number of terms t in the
document, likes Goal appear one time in document 1, therefore, tf(Goal, Doc1) = 1 Otherwise, Goal do not exist in document 2, therefore, tf(Goal, Doc2) = 0.
The next step is to count the number of documents which contain term t (document
frequency) The value of document frequency (DF) is known as |C,| in equation.
16
Trang 35Table 2-2 Document frequency (df)
As above table, df (d, Goal) = 1 because the term Goal appear in docl only It is
similarly to another terms
And then, we calculate the value of IDF as Table 2-3 below
Table 2-3 Inverse document frequency (idf)
Terms | Goal Data Information Insight You
Naive Bayes and Boolean Multinomial Naive Bayes
Naive Bayes Methods are a set of supervised learning algorithms based on theapplication of the Bayes theorem with the "naive" assumption of the conditionalindependence between all pairs that represent the value of the class variable [10] [11]
There are the briefly illustrate steps of Naive Bayes mathematically
17
Trang 36Note: C: class, D: Document
1 Objective Function: argmax[P(C|D)] VC
Log scaling is used in order to prevent floating points and to prevent excessive
weights on frequently used words.
argmax{[log(P(C)) + >: log (P(w,|€))| (2-7)
18
Trang 37Boolean Multinomial Naive Bayes is a special case of Naive Bayes with steps:
1 Preprocess text.
Remove punctuation and numbers
Remove stop words
Tokenizers the texts.
2 Remove all duplicate words in each document
Where f is frequency of word w in class C
Gaussian Naive Bayes
This version of NB mainly deals with continuous data [10] The probability
distribution for a class, p(x = v|c), can be computed by plugging ‘v’ into the equation
for a Normal distribution parameterized by „and 1,2 That is,
_@- Ue)
P(x=v|c)=-e ẻ (2-10)
19
Trang 38where /„ is the mean of the values of ‘x’ associated with class ‘C’ and ‘ u,”’ is the
variance of values in ‘x’ associated with class ‘C’
Bernoulli Naive Bayes
This version of NB is used where there are multiple features, and each one isassumed to be a binary-valued variable [12] In text classification, word occurrencevector is used for training and then for classification The decision rule for Bernoulli NB1s as follows:
P(xly) = P(ily) x x; + (1 — P(|y)) x (1 - x) (2-11)
Where as P(ily) is probability the word ‘i’ appears in the documents of the class ‘y’
The Bernoulli NB classifier explicitly penalizes the non-occurrence of a feature ‘1’
that is an indicator for class ‘y’, whereas the multinomial variant would simply ignore a
non-occurring feature
Logistic Regression
In general, there are two different types of classification models: the generative
model (Naive Bayes, the hidden Markov model, etc.) and the discriminative model
(Logistic Regression, SVM, etc.) Both models try to compute P(class|features) or P(y|x) The main difference is that the generative model tries to model the joint probability distribution P(x, y) first and then compute the conditional probability P(y|x) using Baye’s Theorem, whereas a discriminative one directly models P(y|x).
[13]
20
Trang 39Logistic Regression is categorized as a classification algorithm Mostly, it is used topredict a binary outcome (like 0 / 1, False / True, No / Yes, Wrong / Right) when a set
of independent variables is given
In sentiment analysis, the objective is to predict whether a body of text, say an opinion,
has a positive or a negative sentiment For e.g., let’s say you have 1,000 reviews
Sentiment analysis lets you build a system to automatically go through all of these
reviews to figure out what fraction of them are positive reviews or negative reviews
Building a logistic regression classifier that performs sentiment analysis on reviewscan be done in 3 steps:
- Extract features: process the raw reviews in the training set and extract useful
features [14] Reviews with a positive sentiment have a label of 1, while those with
a negative sentiment have a label of 0
- Train: Train your logistic regression classifier while minimizing the cost [14]
- Predict: Come up with predictions using your learned model [14]
Support Vector Machine (SVM)
SVM is a supervised machine learning algorithm that can be used for bothclassification and regression challenges [15] Classification is predicting a label/groupand Regression is predicting a continuous value SVM performs classification by findingthe hyper-plane that differentiate the classes we plotted in n-dimensional space [16]
In this work, SVM have been applied in order to classify a set of opinions aspositives or negatives SVM is a product of applied complexity theory developed byVapnik (1995) Some years ago, Joachims (1998) proposed SVM for text categorizationtasks, to profit from its robustness in high dimensional [17] The main purpose of thisalgorithm is: find those samples (support vectors) that delimit the widest frontier
21
Trang 40between positive and negative samples in the feature space as shown in the Fig 2.3below:
Figure 2.3 Support vectors delimiting the widest margin between classesŠ
In more detail, given a set of training examples, each is marked to belong one of twocategories, an SVM training algorithm builds a model that assigns new examples to onecategory or the other, making it a non-probabilistic binary linear classifier An SVMmaps training examples to points in space to maximize the width of the gap between thetwo categories New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they belong [17]
8 Source:
https://www.researchgate.net/figure/Two-possible-margins-that-linearly-separate-positive-and-negative-samples-s-i-would-be_fig2_245536079
22