VIET NAM NATIONAL UNIVERSITY HCMCUNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS AN ARTIFICIAL INTELLIGENCE-BASED CHATBOT FOR UIT FACEBOOK PAGE BACHELOR OF E
Trang 1VIET NAM NATIONAL UNIVERSITY HCMC
UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
AN ARTIFICIAL INTELLIGENCE-BASED
CHATBOT FOR UIT FACEBOOK PAGE
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
HO CHI MINH CITY, 2020
Trang 2VIET NAM NATIONAL UNIVERSITY HCMC
UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
TRAN VAN HOÀNG - 16520449
VÕ HONG NHAT - 16520887
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR
PH.D Nguyén Thanh Binh
HO CHI MINH CITY, 2020
Trang 3ASSESSMENT COMMITTEE
The Assessment Committee is established under the Decision , date
— by Rector of the University of Information Technology
| - Chairman
Trang 4First and foremost, we would like to express our grateful attitude to the entire
Information System faculty staff for helping us since we set our foot in this school
During the time since my first year up to now, we have received a lot of help, not
only from teachers but also from the infrastructure staff
In particular, we would like to thank Dr Nguyén Thanh Binh for his help
throughout the realisation of this thesis, especially concerning its redaction as well
as for initially presenting this thesis’ subject during one of his academic lessons.
We would also like to personally thank the Chatbot Big Data team at FPT Telecom
for being an incredible support during the whole development of this thesis as well
as for providing the adequate tools necessary for the achievement of this work In
particular, I would like to thank Mr Trần Xuân Hậu - my leader and Mr Dang
Minh Chương for their help.
Finally, a big thank you goes to my friends and family who were an incredible
support for me during the redaction of this thesis
Once again, we sincerely thank you
Trang 51.6 Natural Language Processing in chatDOt s55 + + *++*vexseeeeeereeess 61.7 Chatbot building DFOC€SS - - + 31183111311 9 11191111 vn rệt 7
OBJECTIVES AND GOALU có 5c cọ HH CC 001 0080050896 11
2.1 Problem description n6 e 11
2.3 GOoals A Ta cổ À—⁄É_ LÔ À À («St hsktseksseesetrseessekssee 11
TECHNICAL BACKGROUND G5 <5 5S SH g0 9908936 123.1 Nšsiir00 0à 0700 6 12
4.2 Text modelling and features << + 1191 vn re 194.3 Intent clasSIÍICafIOT (<< 1130111111111 SS ST 5555556 20
Trang 64.4 Named Entity (voi e - 21
hào 235.2 Data Í[OW HH HH TH HH HH HH gi 24
6 CHATBOT IMPLEMENTATION c G0 13065083 50 39
BI; 9‹io0i6)9) 220i i(cc:aadđa 39N8 con 416.3 Rasa NLU mO€]Ï - -.- 5 <6 + E1 E1 991 93 E1 1v nh ng ng ệt 43
6.4.2 By COUTS€ NAIT - .G- G1313 1119 111 1111111 ng t 48
6.6 Installation and deployMent 5 5 1 199911 911 9111991 g nrưn 51
7.2 SUITRTV SG 0 Họ re 527.3 Possible improvements and future WOTKS - - eeeeteeeeeeeeeeeeseeseeaees 52
REFERENCE SG cscscssssssscsscsssscsrsesscssesssescseseseccsssessessssessesseseseasseseseasseseseessesesees 56
Trang 7LIST OF FIGURES
caso
Figure 1.1: UIT’s students needs for chatbot 2 Figure 1.2: UIT’s students needs for chatbot’s functionality 3 Figure 1.3: UIT’s students needs for chatbot’s platform 3
Figure 1.4: Popular types of chatbot 6
Figure 4.1: High-level of Rasa architecture 16
Figure 4.2: FPT.AI Conversation Platform working mechanism 20Figure 5.1: Architecture of chatbot application 22Figure 5.2: Data flow diagram 23Figure 5.3: Business Function Diagram 26Figure 5.4: Welcome carousel 27Figure 5.5: Overview of timetable feature 29
Figure 5.6: Conversation flow of timetable feature 29
Figure 5.7: Timetable feature activity diagram 30Figure 5.8: Step 1 Checking user’s history chatlogs for Student ID 31Figure 5.9: Step 2 Returning timetable information based on provided Student ID32Figure 5.10: Report feature 33
Figure 5.11: Overview of all Departments in report function 33
Figure 5.12: “Phong dao tao” and “Phong thiét bi” Department in the context of the
chatbot application 35Figure 5.13: Activity diagram of report feature 36Figure 5.14: Sequence diagram of report function 36
Figure 6.1: Json API Card in FPT.AI Conversation Platform 37
Figure 6.2: Example of training format for intent ask_schedule 40Figure 6.3: Training data intent distribution 41Figure 6.4: Timetable data as JSON format 41Figure 6.5: Rasa Component Lifecycle 42Figure 6.6: Full Pipeline of Rasa NLU model 45Figure 6.7: Training result over 300 epochs 45Figure 6.8: Dimension supported by duckling 47Figure 6.9: Example request of Broadcasting API 49Figure 7.1: High-level overview of DIET model 52Figure 7.2: In-depth overview of the DIET architecture 52Figure 7.3: DIET architecture Pipeline 53
Trang 8LIST OF TABLES
ca Leo
Table 5.1: Table mssv description
Table 5.2: Table tkb description
Table 5.3: Table report description
Table 5.4: All Departments along side with their website at UIT
Table 6.1: Backend routes implemented in the chatbot application
Table 6.2: Rasa NLU Components used in Pipeline
Table 6.3: Parameters of Broadcasting API request
24
2425
3438
43
49
Trang 9University of Information Technology
Natural Language ProcessingNatural Language UnderstandingDual Intent Entity TransformersBag-of-words
Support Vector Machine
Trang 10Nowadays, Chatbot is a trend for automating communication between user and
server Whoever is at the forefront of using Chatbot will have more opportunities intheir hands regardless of the profession, the University of Information Technology
(UIT) is no exception
Since UIT has so many types of information that come from many portals, it
is difficult for students, parents or even lecturers to find information when facing
problems Information is scattered all over the forums, websites and social
networks When they have problems while studying or working at UIT, they will beconfused since they do not know where to find information or ask for help from
which Department
Another issue is that every time students need to get timetable information,
they have to go to the Office of Academic Affairs website (https://daa.uit.edu.vn/),
login using their student account (and they must solve the CAPTCHA as well) This
is a complex and time-consuming process
By being aware of such issues, we decided to design and implement aChatbot system and integrate it on the most popular social network today —
Facebook, to be a communication tool between the school and students as well as
others UIT's chatbot will provide the most basic and diverse information to users
such as timetable and general information about schools and faculties The Chatbotcan also provide timetable information with just one click
In conclusion, we think this project will help students, parents and lecturers a lot,
not only by helping with finding information more easily and getting contact with
suitable departments when facing problems
Trang 111 INTRODUCTION
This chapter will briefly introduce the context in which this work takes place in
section 1.1 Then, the survey result we conducted will be given in section 1.2 A
definition of chatbot in section 1.3 Applications of the field of conversational
agents will be recalled in section 1.4 Afterwards are some popular types of chatbot
in section 1.5 Finally, some Natural Language Processing (NLP) keywords and
specific knowledge will be given in section 1.6 and chatbot building process in
section 1.7
1.1 Context
After over 10 years of founding and development, University of Information
Technology (UIT) includes 8 Faculties, 10 Administrative Offices, 7 Centers
-Laboratories and 3 Unions Every unit has their own website for storing
information Traditionally, students or lecturers can post their questions or problems
in the UIT forum to get answers, or they can come directly to suitable Departmentsfor their problem But sometimes, the information is too big and they don’t know
exactly where to go, or where to post their problems, especially for parents and highschool students who are looking for admission information So that, we need a
portal, or a channel capable of connecting all the information that stretches all over
websites of UIT, especially on the online scene where most users are extremely
demanding both in terms of response time and quality of the answers given
In order to provide students, parents and lecturers a portal to get information
quickly and correctly In particular, we would like to deploy a dialogue system
solution, also known as chatbot, that would be integrated seamlessly in the UIT
Facebook page The goal of this work is to design such systems
1.2 Survey
Before designing and implementing the chatbot application, we conducted a survey
in two weeks to collect student’s needs and feedback for the chatbot application.
Trang 12Bạn nghĩ Fanpage UIT nên hỗ trợ Chatbot không?
18 responses
@co
@ Không
@ Sao cũng được
Figure 1.1: UIT’s students needs for chatbot
Ban muốn Chatbot co những chức năng nào?
Figure 1.2: UIT’s students needs for chatbot’s functionality
Ban mong muốn Chatbot hỗ trợ trên nền tảng nào?
Trang 13In the scientific literature, chatbots are more formally referred to as
conversational agents In the context of this document, the terms
chatbot/conversational agent will be used interchangeably
The underlying principle of every chatbot is to interact with a human user (inmost cases) via text messages and behave as though it was capable of understanding
the conversation and reply to the user appropriately The origin of computers
conversing with humans is as old as the field of Computer Science itself Indeed,
Alan Turing defined a simple test referred to now as the Turing test back in 1950
where a human judge would have to predict if the entities they are communicating
with via text is a computer program or not However, this test’s ambition is much
greater than the usual use case of chatbots; the main difference being that the
domain knowledge of a chatbot is narrow whereas the Turing test assumes one can
talk about any topic with the agent This helps during the design of conversational
agents as they are not required to have a (potentially) infinite domain knowledge
and can, as such, focus on certain very specific topics such as for instance helping
users book a table at a restaurant
Furthermore, another general assumption chatbot designers bear in mind isthat users typically have a goal they want to achieve by the end of the conversationwhen they initiate an interaction with a chatbot This then influences the
conversation’s flow and topics in order to achieve the chosen goal This can be
exploited by developers since certain patterns of behavior tend to arise as a result
Trang 14Therefore, the definition of a chatbot adopted for this document is a
computer program communicating by text in a humanly manner and who providesservices to human users in order to accomplish a well-defined goal
1.4 Chatbot applications
Chatbot is created to support humans in the customer service field at the most basiclevel with repetitive, mundane tasks Therefore, businesses will be able to reduce
their human resources pressure, and their consultant teams can focus on solving
more complicated and urgent tasks
Chatbot is applied in different ways for different types of businesses Major
purposes of it includes:
e Consult and answer frequently asked questions from customers 24/7
e Support marketing campaigns (send information on promotions, discounts,
new products )
e Suggest, search, and report prices for products and services based on
customer demands
e Book appointments, tables, rooms, air tickets
e Receive declarations and information for opening cards, bank accounts
e Make payments for orders
1.5 Popular types of Chatbot
Chatbot includes 3 major types: Clicking Bot, NLP Bot, and NLP & Dialogue
Management Bot Figure 1.1 describes all 3 types of chatbot
Trang 15Dialogue
NLP
Clicking
Figure 1.4: Popular types of chatbot
e Clicking Bot is a type of Chatbot which allows users to interact via clicking
pre-designed cards inside the Bot, which will lead to topics they want to askabout This type of Chatbot cannot understand messages sent by customers
e NLP Bot is a Chatbot which uses Machine Learning and Natural Language
Processing (NLP) technologies With this type of Bot, users can type their
own questions, and the Bot will infer their intentions based on pre-traineddata Data realms of this kind includes: samples, intents, and entities
e NLP & Dialog Management Bot is the most comprehensive type of
Chatbot, with diverse processing methods (with tabs in Clicking Bot and
ability to understand users’ intents in NLP Bot) NLP & Dialog Management
Bot is also special in that it can remember the contexts of conversations with
users.
The project will be designed and implemented of all the above Chatbot types, withthe most popular being NLP & Dialog management Bot
1.6 Natural Language Processing in chatbot
Natural Language Processing (NLP) is a branch of artificial intelligence which is
focused on enabling the computers to understand and interpret the human language.NLP is a core AI feature of chatbot For chatbot to be able to understand user
sentences, bot builders will need to teach in specific knowledge, including:
e Sample: Users’ utterances to inquire about topics of interest.
Trang 16e Intent: Purposes of users’ utterances.
e Keyword: Important and necessary information in utterances, which help the
bot to understand topics to give suitable responses
e Entity type: Representing the meanings of Keywords.
e Dictionary: Add alternative words and synonyms for the Bot to better
identify intents1.7 Chatbot building process
Bot building is done in 3 steps as follows:
Step 1: Analysis and design
First, the bot designer will need to identify the roles, targets, and channels of the
it can answer all inquiries from end users
Correct identification of target customers will have saved time and effort inchatbot building, allowing focus on solving necessary matters for the customers
Trang 17applications with separate policies, the chatbot will be connected in a different
way.
e Identifying Chatbot topics
Conversation history can also be utilized to pinpoint areas where the customers
require support
Example: In finance — banking, customers may care about opening cards, savingpackages, consumer loans
e Creating Chatbot scenarios
In each large topic, users will have multiple interests with various differentquestions Therefore, the chatbot builders need to divide one big topic intosmaller problems, based on the conversation history Then, they will need toprepare a set of relevant samples to each of these problems for bot training
For example, card is a smaller topic of banking, and may include scenarios like:
How to open a card, Card functions, Date of Card issuance
To design your scenarios in the most logical way, describe the smaller ones anddivide them into a conversational stream in the form of a tree map
Step 2: Bot building deployment
Prepare the data
First, identify user intents in each small topic
For example, in phone information, the user intents can be: Asking for prices,
functions, warranty policies
Next, list the Samples These are questions that users will direct at the chatbot
Trang 18e Intent what_price: how much does the iphone cost, for how much do you sell
iphone10, what is the price for iphone at your shop
e Intent what_function: does the iphone10 have fingerprint lock, how many
cameras 1n iphonel1
e Intent warranty_policy: how long do warranties last for iphones at your shop,
do you offer manufacturer’s warranty or store’s warranty for iphone] 1, tell
me about your warranty policies
Identify the Entities in your Samples, and tag them
Example:
e Intent what_price: how much does the iphone (tag product) cost, what is
your current price for iphone
e Intent what_function: does the iphone10 (tag product) has fingerprint lock,
how long does iphone! 1 (tag product)’s battery last
e Intent warranty_policy: how long do warranties last for iphones (tag product)
at your shop, do you offer manufacturer’s warranty or store’s warranty for
iphone 11 (tag product)
Building chatbot
Building the chatbot based on scenario and data defined in the previous steps
This step can be done by using third-party platforms like Dialogflow, FPT.AI
Conversation Platform, or using frameworks like Rasa, Chatterbot,
Integrating conversational channels
After finishing chatbot building on one channel, it is easy to connect the chatbot
to various other messaging platforms like: Facebook Messenger, Zalo, Viber,
Website Live Chat
Trang 19Step 3: Chatbot update and monitoring
e Content update
Chatbot can automatically answer to thousands of users at a time Each user,
however, adopts different wordings and questions, even though they have the same
Intent Therefore, before launching your customer service chatbot, you have to
prepare as many Samples as possible Also, update, edit, and add Samples
frequently via History to further improve your bot’s understanding
You should pay close attention to Intents regularly used by customers, in order toprovide more Samples for those Intents, so that your chatbot can best satisfy
customer demands
e Live Support
Chatbots are growing smarter, however, they still cannot replace humans
completely This is why FPT.AI also introduces the Live Support function, whereagents can converse and directly support customers should the need arise
10
Trang 202 OBJECTIVES AND GOALS
2.1 Problem description
Currently in the school there are many types of information and information is
related to different departments, the subjects to use are also different (training room,equipment room, information systems department ) The information that schoolshave is often provided to students, but for parents and some other stakeholders is
very limited, so the process of providing and finding information is fragmented and
distracted
2.2 Evaluate and find solutions
e Information on websites is difficult to find and quite difficult for parents who
cannot easily remember the contact addresses of departments For example :The Information system Department has addressed (https://httt.uit.edu.vn)
and other Departments are hard to remember
e Desire to build a one-stop communication system between the school and
students (parents)
2.3 Goals
e Building Chatbot system based on social network (Facebook)
e Responsive on great access and responsive respond intelligently (Chatbot
understand meaning of questions and answers)
e Students communicate with departments in the school through a single
chatbot
e Reached 60% user support rate over all conversations
11
Trang 213 TECHNICAL BACKGROUND
3.1 Programming language
Considering the context of the project, Python was chosen as main programming
language to implement server for the chatbot for the following reasons:
e Simple syntax and transparent semantics
e Excellent support for integration with other languages and tools that
come in handy for techniques like machine learning
e Extensive collection of NLP tools and libraries
The system uses client-server model, client and server communicate with each
other through REST API In the backend (server), to simplify the process of using
machine learning models, we Flask framework to create APIs for the client Flask is
a lightweight WSGI web application framework It is designed to make getting
started quick and easy, with the ability to scale up to complex applications It began
as a simple wrapper around Werkzeug and Jinja and has become one of the most
popular Python web application frameworks
The server will be deployed on Heroku for production Heroku is a platform as aservice (PaaS) that enables developers to build, run, and operate applications
entirely in the cloud Heroku also provides PostgreSQL as SQL Database Service
PostgreSQL is one of the world's most popular relational database management
systems For that reason, PostgreSQL will be used as a database to store all the data
of the chatbot application
3.2 Rasa Framework
Rasa is an open source machine learning framework for building AI assistants and
chatbots Rasa’s API uses ideas from scikit-learn (focus on consistent APIs over
strict inheritance) and Keras (consistent APIs with different backend
12
Trang 22implementations), and indeed both of these libraries are (optional) components of aRasa application.
As with many other conversational systems, Rasa are split into natural
language understanding (Rasa NLU) and dialogue management (Rasa Core)
Rasa’s architecture is modular by design This allows easy integration with other
systems For example, Rasa Core can be used as a dialogue manager in conjunctionwith NLU services other than Rasa NLU While the code is implemented in Python,both services can expose HTTP APIs so they can be used easily by projects using
other programming languages
1.2.1 Rasa Architecture
Dialogue state is saved in a tracker object There is one tracker object per
conversation session, and this is the only stateful component in the system A
tracker stores slots, as well as a log of all the events that led to that state and have
occurred within a conversation The state of a conversation can be reconstructed byreplaying all the events When a user message is received Rasa takes a set of steps
as described in Figure 4.1 Step 1 is performed by Rasa NLU, all subsequent steps
are handled by Rasa Core
Trang 23Step 1 A message is received and passed to an Interpreter (e.g., Rasa NLU) to
extract the intent, entities, and any other structured information
Step 2 The Tracker maintains conversation state It receives a notification that a
new message has been received
Step 3 The policy receives the current state of the tracker
Step 4 The policy chooses which action to take next
Step 5 The chosen action is logged by the tracker
Step 6 The action is executed (this may include sending a message to the user)
Step 7 If the predicted action is not ‘listening’, go back to step 3
1.2.2 Actions
Dialogue management is framed as a classification problem At each iteration, Rasa
Core predicts which action to take from a predefined list An action can be a simple
utterance, i.e sending a message to the user, or it can be an arbitrary function to
execute When an action is executed, it is passed a tracker instance, and so can
make use of any relevant information collected over the history of the dialogue:
slots, previous utterances, and the results of previous actions
Actions cannot directly mutate the tracker, but when executed may return alist of events The tracker consumes these events to update its state
1.2.3 Natural Language Understanding
Rasa NLU is the natural language understanding module It comprises loosely
coupled modules combining a number of natural language processing and machine
learning libraries in a consistent API
1.2.3.1 Intent classification
Rasa uses the concept of intents to describe how user messages should be
categorized Rasa NLU will classify the user messages into one or also multiple
user intents
14
Trang 241.2.3.2 Entity Extraction
Understanding the user’s intent is only part of the problem It is equally important to
extract relevant information from a user’s message, such as dates and addresses.
This process of extracting the different required pieces of information is called
Entity Extraction
1.2.4 Policies
The job of a policy is to select the next action to execute given the tracker object Apolicy is instantiated along with a featurizer, which creates a vector representation
of the current dialogue state given the tracker
The standard featurizer concatenates features describing:
e@ what the last action was
e the intent and entities in the most recent user message
e which slots are currently defined
The featurization of a slot may vary In the simplest case, a slot is
represented by a single binary vector element indicating whether it is filled Slots
which are categorical variables are encoded as a one-of-k binary vector, those whichtake on continuous values can specify thresholds which affect their featurisation, or
simply be passed to the featurizer as a float.
There is a hyperparameter max_history which specifies the number ofprevious states to include in the featurization By default, the states are stacked to
form a two-dimensional array, which can be processed by a recurrent neural
network or similar sequence model
15
Trang 253.3 FPT.AI Conversation Platform
FPT.AI Conversation provides a platform to build and manage chatbots via a userinterface Equipped with the best natural language processing technology for
Vietnamese language, as well as an optimized conversation management system,
FPT.AI Conversation provides a comprehensive chatbot builder solution for
businesses
e Automation in sales and marketing: Easily build and manage customer
conversations, interact with customer journeys in using products/services;
automatically send promotional information to customers
e Improve customer service experience: Ready to support customers at any
time, from anywhere, ensures no waiting time even during rush hours
e Understand customers: Equipped with Machine Learning and NLP,
FPT.AI’s chatbot can understand intentions and requests of customers Bot
managers can track all conversations in History, and quickly train the botwith new information to provide customers with the most accurate advice
e Easy to be integrated in business systems: FPT.AI’s Chatbot can be easily
integrated in business systems via APIs
e Flexible scale expansion: The cloud platform allows FPT.AI’s Chatbot to
easily expand with the businesses’ growth, and can offer simultaneous
support for up to millions of customers
e Build once, deploy anywhere: Can integrate chatbot on popular messaging
channels like Facebook Messenger, Zalo, Viber, Live Chat on website or
any other chat interface your business has
e Multi-channel deployment: FPT.AI Chatbot can be integrated in popular
messaging channels like Live Chat on websites, Facebook Messenger, Zalo,Viber or any others utilized by businesses Therefore, businesses will only
need to build the chatbot once then easily deploy it on multiple channels
16
Trang 26The system supports two types of users with their respective processes:
Bot creator: This is someone who can perform the following actions:
e Provide data to train bots and can add data over time.
e Edit old scripts and update new scripts according to needs.
e Configuration settings and integration with media channels (Facebook,
Viber, Zalo, Livechat )
e Track bot history and adjust learning bot through real conversation with bot
users.
e Customers’ insights statistics.
Bot user (Customer): This is the person who can perform the following actions:
e Chat with bots to get support from bots
® Make conversations in several ways: ask input questions or click buttons,
Welcome to our Shop | 4
Figure 4.2: FPT.AI Conversation Platform working mechanism
After receiving messages from users, chatbot will respond using the following
mechanism:
17
Trang 27Step 1: User sends a message to chatbot.
Step 2: The message is received and redirected to FPT.AI system for processing
Step 3: Based on data trained beforehand at FPT.AI NLP Core engine (including
samples, intents, entities), Chatbot can identify the intention of the message and
estimate confidence level
Then, based on the Bot intent confidence scale, the intent will be listed as either
matched or unmatched
Step 4: After identifying the user’s intent, bot will either trigger a scenario (in case
of recognition), or give a default answer
In the context of the Chatbot application, FPT.AI Conversation will be used
as the main platform for building Chatbot for UIT Facebook Page
18
Trang 284 THEORETICAL BACKGROUND
This chapter will formally introduce the theory behind the main problems in the
conversational agents field as well as the machine learning and artificial intelligencetechniques used in practice for chatbots
4.1 Word segmentation
Lexical analysis, syntactic analysis, semantic analysis, disclosure analysis and
pragmatic analysis are five main steps in natural language processing While
morphology is a basic task in lexical analysis of English, word segmentation is
considered a basic task in lexical analysis of Vietnamese and other East Asian
languages processing [1] This task is to determine borders between words in a
sentence In other words, it is segmenting a list of tokens into a list of words such
that words are meaningful Word segmentation is the primary step in prior to othernatural language processing tasks 1 e., term extraction and linguistic analysis It
identifies the basic meaningful units in input texts which will be processed in the
next steps of several applications For Named Entity Recognition, word
segmentation chunks sentences in input documents into sequences of words beforethey are further classified into named entity classes For Vietnamese language,
words and candidate terms can be extracted from Vietnamese corpora (such as
books, novels, news, and so on) by using a word segmentation tool Conformed
features and context of these words and terms are used to identify named entity tags,topic of documents, or function words For linguistic analysis, several linguistic
features from dictionaries can be used either to annotate POS tags or to identify theanswer sentences Moreover, language models can be trained by using machine
learning approaches and be used in tagging systems
We will add Vietnamese word segmentation (VietnameseTokenizer) as a customcomponent to the Rasa NLU Pipeline, which will be discussed further in Section
6.3 This task can be done by the help of underthesea toolkit Underthesea is a suite
of open source Python modules data sets and tutorials supporting research and
development in Vietnamese Natural Language Processing They provide an
extremely easy API to quickly apply pretrained NLP models to your Vietnamese
text, such as word segmentation, part-of-speech tagging (PoS), Named Entity
Recognition (NER), text classification and dependency parsing
4.2 Text modelling and features
To understand natural language and analyze documents and text, computers need torepresent natural languages as linguistics models These models can be generated byusing machine learning methods
19
Trang 29Bag-of-words model
A bag-of-words model, or BoW for short, is a way of extracting features from text
for use in modeling, such as with machine learning algorithms The approach is
very simple and flexible, and can be used in a myriad of ways for extracting
features from documents [2]
A bag-of-words is a representation of text that describes the occurrence of words
within a document It involves two things:
e A vocabulary of known words
e A measure of the presence of known words.
It is called a “bag” of words, because any information about the order orstructure of words in the document is discarded The model is only concerned withwhether known words occur in the document, not where in the document
The intuition is that documents are similar if they have similar content Further, thatfrom the content alone we can learn something about the meaning of the document
The bag-of-words can be as simple or complex The complexity comes both
in deciding how to design the vocabulary of known words (or tokens) and how to
score the presence of known words
The bag-of-words model is commonly used in methods of documentclassification, where the frequency of occurrence of each word is used as an
attribute feature for training a classifier
To address text modelling issues, the bag-of-words model will be used for
approaches We will add CountVectorsFeaturizer as one of components of Rasa
NLU Pipeline to create features for intent classification The component creates a
bag-of-words representation of user message, intent, and response using sklearn's
CountVectorizer All tokens which consist only of digits (e.g 123 and 99 but not
a123d) will be assigned to the same feature
4.3 Intent classification
Upon receiving a new message, the conversational agent has to be able to identify
the goal the user is trying to accomplish This is usually modelled as a multi
classification problem whose labels are the names of the possible user intentions Inthe report function, we need a Classifier to classify incoming user input problems
into about 10 suitable Departments at UIT Techniques to solve this problem vary
20
Trang 30from simple keyword extraction to Bayesian inference in order to determine the
user’s request based on multiple messages.
Support Vector Machine
The objective of the Support Vector Machine algorithm (SVM) is to find a
hyperplane in an N-dimensional space (N - the number of features) that distinctly
classifies the data points
To separate the two classes of data points, there are many possiblehyperplanes that could be chosen The objective is to find a plane that has the
maximum margin, i.e the maximum distance between data points of both classes
Maximizing the margin distance provides some reinforcement so that future data
points can be classified with more confidence
Rasa NLU provides an EmbeddingIntentClassifier component in theirPipeline to fulfill this task The EmbeddingIntentClassifier trains a Linear SupportVector Machine (SVM) which gets optimized using a Grid Search The algorithm
determines the best decision boundary between vectors that belong to a given group(or category) and vectors that do not belong to it Linear kernel is well suited for
text classification, the Linear kernel is computationally very cheap (as opposed to
many other kernels) and usually works well for text classification problems Duringthe training of the SVM a hyperparameter search is run to find the best parameter
set It also provides rankings of the labels that did not “win” The
EmbeddingIntentClassifier needs to be preceded by a dense featurizer in the
pipeline This dense featurizer creates the features used for the classification
4.4 Named Entity Recognition
Named entities are phrases that contain the names of persons, organizations,
locations, times and quantities, monetary values, percentages, etc Named Entity
Recognition — NER is the task of recognizing named entities in documents NER is
an important subtask of Information Extraction, which has attracted researchers allover the world since the 1990s
Conditional Random Fields
Conditional random fields (CRFs) are a class of statistical modeling methods oftenapplied in pattern recognition and machine learning and used for structured
prediction Whereas a classifier predicts a label for a single sample without
considering "neighboring" samples, a CRF can take context into account [3] To do
so, the prediction is modeled as a graphical model, which implements dependencies
21
Trang 31between the predictions What kind of graph is used depends on the application Forexample, in natural language processing, linear chain CRFs are popular, which
implement sequential dependencies in the predictions In image processing the
graph typically connects locations to nearby and/or similar locations to enforce thatthey receive similar predictions
In the chatbot application, we need to recognizing NEs in three types, i.e ID
of students, course name, and datetime information The first two NEs (student’s
ID and course name) can be extracted using Rasa NLU component
CRFEntityExtractor This component implements a Conditional Random Fields
(CRF) to do named entity recognition CRFs can be thought of as an undirected
Markov chain where the time steps are words and the states are entity classes
Features of the words (capitalization, POS tagging, etc.) give probabilities to certainentity classes, as are transitions between neighbouring entity tags: the most likely
set of tags is then calculated and returned For datetime information, we will use
duckling Duckling is a Haskell library developed by Facebook that parses text into
structured data Duckling allows us to recognize dates, numbers, distances and otherstructured entities and normalizes them Duckling supports Vietnamese language
and in the context of the chatbot application, we will only use time dimension to
extract datetime information when students request timetable information by
datetime
22
Trang 325 SYSTEM DESIGN AND ANALYSIS
This chapter will talk about system design and analysis of the conversational agent
More specifically, section 5.1 describes the agent’s architecture in detail, section 5.2
will present the chatbot’s dataflow, and section 5.3 will describe the agent’s
database design, what purpose they fulfil and how they accomplish it Finally,
section 5.4 will describe all chatbot’s functionalities
5.1 Architecture
FPT.AI Conversation Platform
Figure 5.1: Architecture of chatbot application
The chatbot will be integrated with Facebook Fanpage messenger, more
specifically, the UIT Facebook page (Trường Dai hoc Công nghệ Thông tin - Đại
23