supporting voice communication in chatbot

107 0 0
Tài liệu đã được kiểm tra trùng lặp
supporting voice communication in chatbot

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

This is a neural network architecture known for its success in natural language processing tasks, and this task focuses on exploring, experimenting, and optimizing the Sequence-to-Sequen

Trang 1

HA MINH DUC

SUPPORTING VOICE COMMUNICATION IN CHATBOT

Major: COMPUTER SCIENCE Major code: 8480101

MASTER’S THESIS

HO CHI MINH CITY, January 2024

Trang 2

THIS THESIS IS COMPLETED AT

HO CHI MINH UNIVERSITY OF TECHNOLOGY – VNU-HCM

Supervisor: Le Thanh Van, Ph D

Examiner 1: Ton Long Phuoc, Ph D

Examiner 2: Vo Dang Khoa, Ph D

This master’s thesis is defended at Ho Chi Minh City University of Technology (HCMUT) – VNU-HCM on 23rd Jan 2024

Master’s Thesis Committee:

1 Assoc Prof Dr Tran Van Hoai, Ph D Chairman 2 Ton Long Phuoc, Ph D Examiner 1 3 Vo Dang Khoa, Ph D Examiner 2

4 Le Thanh Van, Ph D Commissioner 5 Assoc Prof Dr Tran Ngoc Thinh, Ph D Secretary

Approval of the Chairman of the Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)

CHAIRMAN OF THESIS COMMITTEE DEAN OF FACULTY OF

COMPUTER SCIENCE AND ENGINEERING

Trang 3

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

Independence – Freedom - Happiness

THE TASK SHEET OF MASTER’S THESIS

Full name: Ha Minh Duc Student code: 2270348 Date of birth: 20/03/1985 Place of birth: Kien Giang Major: Computer Science Major code: 8480101

I THESIS TITLE: Supporting voice communication in chatbot.

(Hỗ trợ giao tiếp bằng giọng nói trong phần mềm chatbot)

II TASKS AND CONTENTS:

 Task 1: Research and Experimentation for Sequence-to-Sequence Model Development

The primary objective is to create a powerful Sequence-to-Sequence model tailored for chatbot applications This is a neural network architecture known for its success in natural language processing tasks, and this task focuses on exploring, experimenting, and optimizing the Sequence-to-Sequence model to enhance its performance in the context of chatbot interactions

 Task 2: Research and Experimentation for Automatic Speech Recognition Model Development

During this phase, the primary emphasis is on thorough research and experimentation with diverse methods to craft high-performance automatic speech recognition models Exploring various techniques is essential to achieving precise conversion from audio to text The objective is to pinpoint

the most effective model that aligns with the project's requirement

 Task 3: Sequence-to-Sequence Model and Automatic Speech Recognition Evaluation and Future Work

After developing the Sequence-to-Sequence model for Automatic Speech Recognition, a comprehensive evaluation process will be conducted The achieved results will be analyzed in detail using appropriate metrics and techniques to assess accuracy performance The strengths and weaknesses of

Trang 4

each Sequence-to-Sequence model will be identified and assessed meticulously Based on this analysis, recommendations for future work will be provided, addressing potential improvements and further developments in Automatic Speech Recognition technology

III THESIS START DAY: Feb-06-2023

IV THESIS COMPLETION DAY: Dec-10-2023V SUPERVISOR: Le Thanh Van, Ph D

Ho Chi Minh City, January 22, 2024

SUPERVISOR

(Full name and signature)

CHAIR OF PROGRAM COMMITTEE

(Full name and signature)

DEAN OF FACULTY OF

COMPUTER SCIENCE AND ENGINEERING

(Full name and signature)

Trang 5

ACKNOWLEDGEMENT

First of all, I would like to appreciate the dedicated guidance and support of my lecturers during my work They not only enthusiastically shared with my business plan but also suggested me many great ideas that helped my thesis much more fruitful and interesting Especially to Dr Le Thanh Van who has been always willing to assist me in any way he could during my thesis

Furthermore, I would also like to acknowledge the Ho Chi Minh City University of Technology for its engagement and valuable learning experiences that have given to me as well as my classmates in Vietnam Thanks to HCMUT’s personalized programs, I have had an ideal Master of Computer Science course during my busy working schedule

The next, I am eternally grateful for the network in which my friends Pham Thanh Huu, Nguyen Thi Ty, Vo Thi Kim Nguyet, Le Duc Huy, Nguyen Tan Sang and Pham Dien Khoa They are not only experiencing the course with me, but also sharing life and working tips that are meaning for a person like me I am so lucky to have them in my entire life

The last, words cannot describe how thankful I am for IMP Academic Team’s kindly support Without their accompany, I could not complete my Master of Computer Science course

Sincerely, Ha Minh Duc

Ho Chi Minh City, Jan 2024

Trang 6

ABSTRACT

This master's thesis delves into the improvement of voice-based communication in healthcare chatbots through the integration of cutting-edge natural language processing and automatic speech recognition technologies The research centers on leveraging the GPT-3-based sequence-to-sequence architecture for enhancing natural language understanding and generation Additionally, it incorporates the innovative Way2vec 2.0 model to empower robust Automatic Speech Recognition capabilities The GPT-3 architecture is chosen for its adeptness in comprehending medical contexts, generating contextually relevant responses, and handling dynamic healthcare-related conversational flows The integration of Way2vec 2.0 ensures precise and context-aware transcription of voice inputs, enhancing the accuracy of healthcare-related information capture This research contributes to the field of healthcare technology by presenting a novel approach to improving patient engagement and satisfaction through voice interactions The combination of GPT-3 and Way2vec 2.0 not only strengthens the chatbot's ability to understand and generate natural language responses but also extends this proficiency to healthcare-focused voice interactions, thereby widening the applicability and accessibility of chatbot system in the aspect of speech recognition

Trang 7

TÓM TẮT LUẬN VĂN THẠC SĨ

Luận văn thạc sĩ này tập trung vào việc cải thiện giao tiếp dựa trên giọng nói trong chatbot chăm sóc sức khỏe thông qua sự kết hợp của các công nghệ xử lý ngôn ngữ tự nhiên và nhận dạng giọng nói tự động tiên tiến Nghiên cứu tập trung vào việc sử dụng kiến trúc dựa trên GPT-3 cho quá trình nâng cao hiểu biết và tạo ra ngôn ngữ tự nhiên Ngoài ra, nó kết hợp mô hình Way2vec 2.0 sáng tạo để cung cấp khả năng nhận dạng giọng nói tự động mạnh mẽ Kiến trúc GPT-3 được chọn vì khả năng hiểu biết về ngữ cảnh y tế, tạo ra các phản ứng liên quan đến ngữ cảnh và xử lý các luồng trò chuyện y tế động Sự tích hợp của Way2vec 2.0 đảm bảo việc chuyển đổi chính xác và nhận thức ngữ cảnh của đầu vào giọng nói, từ đó nâng cao độ chính xác của việc thu thập thông tin liên quan đến sức khỏe Nghiên cứu này đóng góp cho lĩnh vực công nghệ chăm sóc sức khỏe bằng cách trình bày một cách tiếp cận mới để cải thiện sự tương tác và sự hài lòng của bệnh nhân thông qua giao tiếp giọng nói Sự kết hợp giữa GPT-3 và Way2vec 2.0 không chỉ củng cố khả năng của chatbot trong việc hiểu và tạo ra phản ứng tự nhiên bằng ngôn ngữ, mà còn mở rộng khả năng áp dụng và tiếp cận của hệ thống chatbot trong phương diện nhận dạng tiếng nói

Trang 8

DECLARATION OF AUTHORSHIP

I hereby declare that this thesis was carried out by myself under the guidance and supervision of Le Thanh Van, Ph.D; and that the work contained and the results in it are true by author and have not violated research ethics The data and figures presented in this thesis are for analysis, comments, and evaluations from various resources by my own work and have been duly acknowledged in the reference part

In addition, other comments, reviews and data used by other authors, and organizations have been acknowledged, and explicitly cited

I will take full responsibility for any fraud detected in my thesis Ho Chi Minh City University of Technology (HCMUT) – VNU-HCM is unrelated to any copyright infringement caused on my work (if any)

Ho Chi Minh City, Jan 2024 Author

Ha Minh Duc

Trang 9

1.3 Target of the Thesis 4

1.4 Scope of the Thesis 5

1.5 Contribution 6

1.6 Thesis Structure 7

CHAPTER 2 BACKGROUND 9

2.1 Hidden Markov Model (HMM) 9

2.2 Deep Neural Networks 10

2.3 Artificial Neural Networks 11

2.4 Convolutional Neural Network 13

2.7 Recurrent Neural Networks 19

2.8 Long Short-Term Memory 21

2.8.1 The Long-Term Dependency Problem 22

Trang 10

2.11 The History of Chatbots 33

2.12 Using Luong’s Attention for Sequence 2 Sequence Model 35

2.12.1 Sequence to Sequence Model 35

2.12.2 Encoder 36

2.12.3 Decoder 38

2.13 Automatic Speech Recognition 41

2.13.1 Speech Recognition Using Recurrent Neural Networks 43

2.13.2 Speech-to-Text Using Deep Learning 43

CHAPTER 3 PROPOSED MODEL 47

Trang 11

LIST OF FIGURES

Figure 1.1 The History of IVR [17] 3

Figure 2.1 HMM-based phone model [1] 10

Figure 2.2 A Deep Neural Network 11

Figure 2.3 Neuron Anatomy [53] 12

Figure 2.4 A Simple Example of the Structure of a Neural Network 13

Figure 2.5 The McCulloch-Pitts Neuron [4] 13

Figure 2.6 The architecture of CNN [57] 14

Figure 2.7 Sigmoid Function and its Derivative by [43] 17

Figure 2.8 Underfitting, Optimal and Overfitting 18

Figure 2.9 Underfitting, Optimal weight decay and Overfitting 19

Figure 2.10 The Recurrent Neural Network [5] 20

Figure 2.11 LSTM Network Architecture [6] 22

Figure 2.12 RNN and Short-Term Dependencies [7] 22

Figure 2.13 RNN and Long-Term Dependencies [7] 23

Figure 2.14 The Repeating Modules in an RNN Contains One Layer [7] 24

Figure 2.15 The Repeating Modules of an LSTM Contain Four Layers [7] 24

Figure 2.16 The Architecture of GRU [9] 25

Figure 2.17 The CBOW Model with One Input [14] 30

Figure 2.18 The Skip-gram Model [14] 32

Figure 2.19 ELIZA – The First Chatbot in the World at MIT by Joseph Weizenbaum 34

Figure 2.20 The Conversational between Elize and Parry [19] 34

Figure 2.21 Seq2Seq Model with GRUs [15] 36

Trang 12

Figure 2.22 Bidirectional GRU [18] 37

Figure 2.23 Attention Mechanism in Seq2Seq Model [20] 39

Figure 2.24 Luong’s Global Attention [21] 40

Figure 2.25 Basic Voicebot Architecture [56] 43

Figure 2.26 Graph showing how the loss function change depending on the size of the trained network [50] 45

Figure 2.27 Graph showing how the loss function change depending on the size of the training set[50] 46

Figure 3.1 Taxonomy of Sequence to Sequence Models 47

Figure 3.2 The Transformer and GPT Architecture [22][23] 48

Figure 3.3 Input Embeddings 50

Figure 3.4 Multi-headed Attention 51

Figure 3.5 Dot Product of Query and Key 52

Figure 3.6 Scaling Down the Attention Scores 52

Figure 3.7 SoftMax of the Scaled Scores 53

Figure 3.8 Multiply SoftMax Output with Value Vector 54

Figure 3.9 Computing Multi-headed Attention 54

Figure 3.10 Multi-headed Attention Output 55

Figure 3.11 Residual Connection of the Input and Output 56

Figure 3.12 Decoder First Multi-Headed Attention 58

Figure 3.13 Adding Mask to Scaled Matrix 59

Figure 3.14 Applying SoftMax function to Attention Score 60

Figure 3.15 The Process Flow of Multi-headed Attention 60

Figure 3.16 Final Stage of Transformer’s Decoder 62

Trang 13

Figure 3.17 GPT’s Architecture 63

Figure 3.18 Transformer Architecture and Training Objectives [24] 65

Figure 3.19 Taxonomy of Speech Recognition 66

Figure 3.20 Wav2vec 2.0 Architecture 67

Figure 3.21 Wav2vec 2.0 Latent Feature Encoder 68

Figure 3.22 Wav2vec 2.0 Quantization Module 69

Figure 3.23 Wav2vec 2.0 Context Network (Transformer Encoder) 69

Figure 3.24 Wav2vec 2.0 Contrastive Loss 70

Figure 4.1 Example of datasets for Seq2seq model 75

Figure 4.2 Example of datasets for ASR model 76

Trang 14

LIST OF TABLES

Table 2.1 Example of Co-occurrence Matrix 27

Table 4.1 Intent Recognition Results 76

Table 4.2 Entity Recognition Results 77

Table 4.3 Chat Handoff and Fallback 77

Table 4.4 Conversation Log with Chatbot 78

Table 4.5 Result of Scenarios 81

Table 4.6 Result of Way2vec 2.0 on Vietnamese Audio Files 81

Trang 15

ACRONYMS

Trang 16

CHAPTER 1 THESIS INTRODUCTION 1.1 Overview

The idea of human-computer interaction through natural language has been created in Hollywood movies 3-CPO is one of the legends of the Revolutionary Army in the world of Star Wars movie This robot guy has served through many generations of Skywalkers and is one of the top personality robots in the universe In the series of this movie, we can see that 3-CPO not only has very similar gestures and communication to humans, but sometimes has great instructions for its owner This is a cinematic product that is ahead of another era when it comes to predicting the future of Artificial intelligence (AI) The Star Wars fictional movie universe is set in a galaxy where humans and alien creatures live in harmony with droids These robots are capable of assisting people in daily life or traveling across other planets In the movie Iron Man (2010), Tony Stark used his supercomputer assistant, JARVIS, to get support and help in everyday life and on trips to save the world with superheroes In particular, it can be mentioned the film AI (2001), adapted from the short story series Supertoys Last All Summer Long, which tells the perspective of the 22nd century, when sea level rise washed away coastal cities, causing a serious decline in population density The Mecha robot line simulates real people designed to integrate with humans They possess the ability to think well but do not know how to express emotions

The history of Interactive Voice Response (IVR) systems began in the 1930s when Voder machines were created The technology was the first to analyze the English language and produce human-like sounds The original speech recognition system was rudimentary, only understanding numerals because engineers thought human language was too complex In 1952, Alexander Graham Bell's Bell Laboratories designed "Audrey," a system for recognizing digits from a given voice Ten years later at the World's Fair, IBM demonstrated a "Shoebox" system that could recognize 16 different English words The vision from these projects is that users can communicate with computers through natural language and therefore not have to learn any specific language or prompts However, it turns out to be quite complicated

Trang 17

to understand the spoken language It can be said that only entities (humans) living in the real world can effectively understand language, arguing that without context, the meaning of a word is incomprehensible

IVRs were widely used for businesses in the 1990s Meanwhile, the use of call queuing and automated call routing became popular in the mid-1990s In the late 1990s, the move of multimedia to call centers led to companies investing in IVR systems with CTI This integration allows businesses to integrate their call centers into their marketing campaigns Moreover, continuous improvements in IVR make them cheaper to deploy the company Contemporary platforms emerged in the 2010s The emphasis on integrating IVR with comprehensive analytics, automated SMS messaging, and advanced call monitoring features was evident during this period Modern IVR systems are now part of a larger solution and enable seamless integration of customer communications across channels Unlike bulky and expensive standalone systems, these advanced all-inclusive platforms now offer options, giving customers the opportunity to choose their preferred method Today, IVR has been integrated into the overall customer experience It now comes with a personalized brand voice, protects customer data, and detects fraud and spam In addition to routing to the best department to address customer needs, the tool is now integrated into marketing efforts The self-service model has evolved significantly with the arrival of conversational IVR These AI-enabled technologies replicate the experience of talking to a live agent Today's IVR systems provide solutions to customers faster, even without a direct operator connection It is useful for many industries and uses It can help manage hotel reservations, pay bills, do market research, buy tickets, and present information about products and services These latest functions meet the needs of the market Zendesk's research shows that 69% of people try to solve their own problems before contacting customer service However, businesses must ensure that they are implementing IVR self-service best practices to increase customer satisfaction A poorly designed automated system can disrupt a business, especially if it wastes consumers' time without solving their problems

Trang 18

Figure 1.1 The History of IVR [17]

The basic and big problem with most IVR systems is that no matter how useful they may be in providing all possible options to fit a lot of customer queries and issues, most customers really just want to talk to one person in person Nothing can beat direct human-to-human communication This may seem counterintuitive if you are using an IVR system to reduce human resources and staffing costs, but adding the option to talk to an advisor on the main menu without forcing customers to search the verbal maze of menu options will make most customers more satisfied and less frustrated Automatic speech recognition (ASR) is a classic feature of many IVR systems and allows users to communicate with the IVR system by voice instead of pressing phone keys or on a laptop, which can be difficult for users to do But if ASR is incapable of recognizing what humans are saying, it will make the system

Trang 19

frustrating and, worse, useless Therefore, ensuring accuracy as well as optimizing design interfaces in ASR is essential in meeting the high demands of users

1.2 Research Problem

There are several challenges associated with the current implementation of the voice-to-text chatbot Firstly, quality and coherence of generative-based chatbots They generate responses based on statistical patterns learned from large datasets of text While they can produce more diverse and flexible responses compared to retrieval-based models, the quality and coherence of their output can vary widely They may generate nonsensical or contextually inappropriate responses, especially when faced with input they have not encountered during training

Secondly, the accuracy of ASR models ASR is a newer feature of many IVR systems It allows customers to communicate with the basic voice IVR system instead of clicking on phone or laptop keyboards The accuracy of a ASR needs to be high if it is to create any value However, achieving a high level of accuracy can be a challenge According to a recent survey [39], there are 73% of respondents cited accuracy as the biggest obstacle in adopting speech recognition technology Before diving into the barriers to accuracy, it would be appropriate to mention that WER is a metric commonly used to measure the accuracy and performance of speech recognition systems

Lastly, ASR models often exhibit domain-specific performance This implies that if a model is exclusively on a particular dataset or within a specific domain (such as healthcare, finance, or tourism), it might encounter challenges in recognizing and processing inputs beyond that training domain Consequently, this limitation can result in recognition errors and diminished accuracy when confronted with new or unfamiliar data

1.3 Target of the Thesis

The objective of this master's thesis is to develop and train an intelligent chatbot using freely available data sources from online forums, FAQs, and videos on Youtube The specific goals include:

Trang 20

Data Collection and Processing: Research and develop effective methods for

collecting dialogue data from free online sources This involves identifying appropriate data sources, filtering, and cleaning data to ensure quality and reliability

Analysis and Modeling: Analyze the characteristics of dialogue data, such as

structure, context, and linguistic diversity Develop suitable machine learning or deep learning models for training the chatbot, focusing on researching and constructing algorithms for the chatbot using deep learning methods and large language models in a sequence-to-sequence format

Chatbot Training: Apply advanced techniques in artificial intelligence and

machine learning to train the chatbot to understand and respond accurately and naturally

Evaluation and Improvement: Evaluate the performance of the chatbot

through testing methods and user feedback Use the evaluation results for continuous improvement of the chatbot model

Practical Application: Explore the potential application of chatbots in the

healthcare sector, emphasizing the integration of ASR Evaluate how this integration impacts user access and interaction to understand the changes in healthcare service delivery

1.4 Scope of the Thesis

The scope of this master's thesis encompasses several key areas Firstly, the primary focus of this thesis is on the healthcare sector, utilizing datasets gathered from FAQs on hospital websites in Vietnam Despite the potential applicability of the methodologies and technologies in other fields, the primary emphasis remains on healthcare This approach ensures specialized attention to the unique requirements of the healthcare industry but limits the immediate applicability to other domains

Secondly, in terms of methodology, the thesis employs an advanced Sequence to Sequence (Seq2Seq) model, integrating deep learning techniques such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) This model is structured with an encoder Recurrent Neural Network (RNN), which processes the

Trang 21

input sequence, and a decoder RNN, responsible for generating the output sequence Additionally, Luong's attention mechanism is employed to enhance the model's ability to focus on relevant parts of the input while generating each part of the output To streamline the training process, a Greedy decoding strategy is utilized, where the most probable next output is selected at each step While this combination is effective for complex dialogue modeling, it poses challenges in terms of technical complexity and computational resources

Thirdly, the foundational model used is Generative Pretrained Transformer-3 (GPT-3), based on the Transformer architecture This choice represents the cutting edge in AI technology but restricts flexibility in adapting to new methods and technologies

Lastly, the performance of the text-to-text model is evaluated using a combination of F1-score, precision, recall, and Bilingual Evaluation Understudy (BLEU) while speech-to-text model’s performance uses Word Error Rate (WER) and Perplexity (PER), providing a more comprehensive and widely accepted set of metrics However, relying solely on these may not fully capture all performance aspects in practical scenarios, as each metric emphasizes different aspects of model performance

1.5 Contribution

The contributions of this master’s thesis are threefold Firstly, it compares the performance of the traditional Seq2Seq models combined with Luong’s Attention mechanism against the encoder-only approach of GPT-3, customized for the Vietnamese language This comparative analysis provides insights into the strengths and weaknesses of each model architecture in the context of structured dialogue processing and language understanding, particularly in the healthcare domain

Secondly, it enhances the accuracy of the ASR model by selecting and evaluating experiments with the best results for the dataset This contribution aims to address the challenges of speech recognition in noisy environments and varying aucoustic conditions, thereby improving the overall performance and usability of the ASR technology

Trang 22

Lastly, it integrates the two models of text-to-text and speech-to-text to develop a chatbot supporting voice interactions usingVietnamese language in the healthcare domain This integration expands the chatbot’s functionality to accommodate users who prefer or require voice-based interactions, thereby enhancing accessibility and user experience in healthcare services

1.6 Thesis Structure

Thesis "Supporting Voice Communication in Chatbot" includes Five Chapters with the following main content:

Chapter 1, INTRODUCTION: presents an overview of the topic, reasons

for carrying out the research, and the practical significance of the problem, as well as the scope and limitations of the topic Finally, the tasks and structure of the thesis are described

Chapter 2, BACKGROUND: synthesizes the most relevant academic issues

to be applied to solve the problem, focusing mainly on the content of deep learning, the basic of HMM, from Artificial Neural Network to Recurrent Neural Network, LSTM, GRU in Seq2Seq model and ASR model This chapter also provides a general overview of related research that has been and is being conducted, as well as the current general trends in solving the problem (Luong’s attention mechanisms, the theory of encoders and decoders, and particularly the GPT-3’s architecture) This section also brings discussions and evaluations for the methods as they form an important basis for the student's research during the thesis process

Chapter 3, PROPOSED MODEL: introduces the proposed model for

Chatbot At the same time, it presents improvements and motivations for those proposals Finally, the student presents the steps to conduct experiments on the data set and evaluates the results of the improvements compared to the chosen model

Chapter 4, IMPLEMENTATION: involves selection, training, evaluation,

and integration of models to develop a robust and effective voice-to-text chatbot tailored for the Vietnamese language and healthcare domain

Trang 23

Chapter 5, CONCLUSION: synthesizes the results achieved during the

thesis process from the research and hypothesis construction to the experimental deployment This section also discusses the limitations and outstanding issues, and finally proposes solutions for future improvements The Table of Contents, List of Figures, List of Tables and Acronyms are provided at the beginning of the thesis The references will be presented at the end of the thesis

Trang 24

CHAPTER 2 BACKGROUND

In this chapter, we explore fundamental theories crucial for the implementation of both the Seq2Seq model and ASR model Our discussion within this chapter spans key topics, including the integration of LSTM and GRU with Word Embedding models like Word2Vec and Global Vectors (GloVe), specifically applied in the Seq2Seq model Simultaneously, we delve into the realm of the ASR model, covering diverse content ranging from the Hidden Markov Model to theories involving Deep Neural Networks (DNN), Artificial Neural Network (ANN), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), multilayer perceptron, the vanishing gradient problem, and various regularization techniques This comprehensive exploration sheds light on the theoretical foundations essential for effectively deploying these models in practical applications

2.1 Hidden Markov Model (HMM)

HMM is a statistical model In speech recognition, the HMM provides a statistical representation of the sound of words [1] The architecture of HMM in speech recognition is given in Figure 2.4 The HMM consists of a sequence of states In HMM, the current state is hidden and only the output from each state can be observed Each state in the HMM corresponds to a frame in the audio input The model parameters estimated in sound training are θ = [{aij}, {bj()}], where {aij} corresponds to the transition probability and {bj()} to the output observation distributions The aij transition probability is the probability of changing from state i to state j

Trang 25

Figure 2.1 HMM-based phone model [1]

An important feature with HMM is that self loops aii makes it possible for HMM to model the changing length of the phone When performing a transition and entering a new stage in the HMM, a feature vector is created using the distribution associated with that particular state in the HMM The first and last states in the HMM are called non-emitting states For example, in Figure 2.1.1 s1 is the in state and s5 is the exit state They are used as entrances and exits of the model and simplify the pairing of HMMs, phone models, to form words

2.2 Deep Neural Networks

An alternative to the Gaussians mixture models in speech recognition is to use DNN [2] A DNN is a transitional, artificial neural network with more than one hidden layer between the input layer and the output layer, as illustrated in Figure 2.3.1 Nodes along the weighted path are attached to them, and the output at every node is calculated by the activation function Typically, the input for a node at a layer in a DNN is calculated from the layer below

𝑦𝑢𝑡(𝑠) ≜ 𝑃(𝑠|𝑜𝑢𝑡) = 𝑒𝑥𝑝{𝑎𝑢𝑡(𝑠)}∑ 𝑒𝑥𝑝⁡{𝑎𝑠′ 𝑢𝑡(𝑠′)}

where bj is the bias of units j, i is an index on units in the lower class, and wij is weighted on a connection to units j from units i in the lower class The output for the upper layer is then calculated as below

𝑦𝑗 = 𝑎𝑢𝑡(𝑥𝑗)

The hidden layers make the DNN able to model non-linear and complex

relationships in the data For multiclass classification, output unit j converts its total input xj into a probability using a SoftMax function [2] In Kaldi the function used to

estimate the posterior probabilities for the HMM is as below 𝑦𝑢𝑡(𝑠) ≜ 𝑃(𝑠|𝑜𝑢𝑡) = 𝑒𝑥𝑝{𝑎𝑢𝑡(𝑠)}

∑ 𝑒𝑥𝑝⁡{𝑎𝑠′ 𝑢𝑡(𝑠′)}

where aut denotes the observation at time t in utterance u and aut is the activation

function at the output layer corresponding to state s [3]

Trang 26

The goal in training is to optimize an objective function and update the weights of internal nodes based on the information transmitted to the model In training, an important parameter is the learning rate The greater the learning rate, the faster but less accurate the training

Figure 2.2 A Deep Neural Network 2.3 Artificial Neural Networks

Basically, ANN is a computational model, they are built on the structure and function of the neural network in Biology (although the structure of the ANN will be affected by a flow of information) Therefore, this neural network will change, they depend on inputs and outputs We can think of ANN as nonlinear statistical data This means a complex, defined relationship between input and output As a result, we will have many different templates

ANN takes ideas from how the human brain works – making the right connections Therefore, ANN used silicones and wires to make neurons and dendrites live for themselves In the human body, 1 part of the brain is already composed of 86 billion neurons and they are connected to thousands of other cells through Axons Because humans have so many different information inputs from the senses, the body also has many dendrites to help transmit this information They will generate electrical impulses to move, transmitting information in this neural network And the

Trang 27

same goes for ANN when different problems need to be dealt with, the neuron sends a message to another neuron

Figure 2.3 Neuron Anatomy [53]

Therefore, we can say that the ANN will consist of many internal nodes, they mimic the biological neurons inside the human brain ANN networks connect these neurons by means of links, and they interact with each other Nodes in ANN are used to retrieve input data Moreover, performing operations on data is also very simple After performing these operations with data, these activities are transferred to other neurons The output at each node is called its trigger value or node value Every link in the ANN network is associated with weight In addition, they have the ability to learn That will take place by changing the weight values The figure 2.3.2 is an illustration of a simple ANN

Trang 28

Figure 2.4 A Simple Example of the Structure of a Neural Network

A classic but simple type of node in Neural Networks (NN) is McCulloch-Pitts Node [4] An illustration of this node, or neurons as McCulloch and Pitt like to call them, can be seen in Figure 2.3.3 The calculations performed in the McCulloch-Pitts node are essentially a sigmoid function Generally, they add inputs and if they are above a certain threshold, they generate 1, otherwise they generate 0 There are also much more complex representations of these nodes, but the McCulloch-Pitts neuron is a good starting point for understanding the basics of NN

Figure 2.5 The McCulloch-Pitts Neuron [4]

Along with the nodes, a neural network is also formed by the so-called "edges" in network weighting What they do is essentially multiply the output of a node by their weight before transferring it to the next neuron to which they are connected By updating these weights, depending on the generated output of an example, researchers can teach the network to distinguish which input data will produce which output The procedure for updating weights is called backpropagation and will be described in the next section

2.4 Convolutional Neural Network

CNN is one of the extremely advanced Deep Learning models CNN will allow researchers to build intelligent systems with extremely high accuracy Figure 2.4.1 is a basic architectural form of the CNN network

Trang 29

Figure 2.6 The architecture of CNN [57]

A convolutional is a type of sliding window located on a matrix Convolutional layers will have parameters learned to adjust and retrieve the most accurate information without having to select features Convolutional is the multiplication of elements in a matrix Sliding Window is also known as kernel, filter or feature detect and is a small type of matrix

CNN will compare images based on each piece and these pieces are called Feature Instead of having to match the photos together, CNN will see the similarity when searching for raw features that match each other with 2 better images Each feature is considered a mini-image which means they are small 2-dimensional arrays These features all correspond to certain aspects of the image, and they can fit together The followings are basic layers of CNN

Convolutional layer This is the most important class of CNN, which is tasked

with doing all the calculations The important elements of a convolutional layer are: stride, padding, filter map, feature map

 Stride means when the filter map is moved in pixels based on the value from left to right

 Padding is 0 values added with the input class

 CNN uses filters to apply to the area of the image These filter maps are called 3-dimensional matrices, inside which are numbers, and they are parameters

 The feature map represents the result of each filter map scan through the input After each scan will occur the calculation process

Trang 30

Rectified Linear Unit (ReLu) Layer ReLu layer is the activation function in

NN and this function is also called activation function The trigger function simulates neurons with pulse rate through axons In the activation function, it also means: ReLu, Leaky, Tanh, Sigmoid, Maxout, etc Currently, the ReLu function is commonly used and extremely common It is used a lot for neural network training needs, but ReLu brings a lot of outstanding advantages such as: the calculation will become faster, etc In the process of using ReLu, we need to pay attention to the problem of customizing learning rates and tracking dead units The ReLu layers were used after the filter map was calculated and

applied the ReLu function to the values of the filter map

Pooling Layer When the input is too large, the pooling layers will be placed

between the Convolutional layers to reduce the parameter Currently, pooling

layer has 2 main types, max pooling and average

Fully Connected Layer This layer is responsible for producing results after

the convolutional layer and pooling layer have received the transfer image At this point, the result is that the model has read the information of the image and to link them as well as produce more output, we use fully connected layers In addition, if the fully connected layer keeps the image, it will turn it into an undivided section This is quite similar to the votes they will evaluate to vote

for the highest quality image 2.5 Multilayer Perceptron

2.5.1 Backpropagation

The name backpropagation comes from the term used by Rosenblatt (1962) for attempts to generalize the perceptron learning algorithm into several layers, even if none of the many attempts to do so in the 1960s and 1970s were particularly successful [40] The backpropagation algorithm is one of the most important tools of the artificial neural network, it is especially the part that deals with the training of the network, i.e., where it learns During this process, the network updates the weighting of all edges to make it perform the correct output for a particular input

Trang 31

Stochastic Gradient Descent (SGD) is used to train NN in oreder to reduce repetitive cost functionality The principle of maximum likelihood is followed primarily by the loss function The goal is to reduce the cross entropy error between the output and the predicted output The gradient update equation is shown as below

The above equation activation is represented by øk How much l-th The j-th neuron of the output error control layer is denoted by the term δlj error

According to chain rule, δj can be written as

Hence, hidden layer’s neurons error can be obtained So the model weights are calculated as follow

Trang 32

2.5.2 The Vanishing Gradient Problem

The problem is in the many layers that use certain activation functions, such as the sigmoid function This function transforms a large input space into a small input number from 0 to 1 That means a significant change in input results in a small change in output; thus, its derivatives become small

Figure 2.7 Sigmoid Function and its Derivative by [43]

This is caused by string rules used in backpropagation It calculates gradients by moving each layer from the last layer to the original one According to the chain rule, each subsequent derivative is multiplied by the calculated value Therefore, when there are n hidden layers using sigmoid as activation functions, n small derivatives are multiplied together As a result, the gradient value decreases exponentially as the backpropagation algorithm progresses to the beginning of layers [43]

The following ways are the solutions to avoid this problem

 The first solution is to use another activation function, such as ReLu, which does not cause a small derivative

 The next solution is the residual network (ResNets) They provide the residual connections straight to the next layers, effectively bypassing the activation

Trang 33

functions That results in higher than derivatives, and it leads to the ability to train networks much deeper [44]

 The last one is mass normalization The batch normalization method normalizes the input on a predefined scale, where the sigmoid derivative is not small

2.6 Regularization

It is important to consider the problem of overfitting when building NN or machine learning algorithms Overfitting is when the model begins to learn features that are too specific to the training set Basically, the model not only learns the general rules that lead from input to output, but also more rules, which perhaps describe the training set, but are not necessarily valid at the general level This process leads to reduced training errors but also increases assessment errors As a result, model will perform worse on unknown data due to the specific rules that it has learned from the training set If overfitting occurs when model is too suitable for the training set, then the opposite phenomenon is called underfitting, i.e when the model learns too general rules We can find an illustration of the phenomena mentioned above in Figure 2.10.1

Figure 2.8 Underfitting, Optimal and Overfitting

2.6.1 Dropout

Dropout is the skipping of units (i.e 1 network node) during training 1 random way By omitting this, the unit will not be considered in the forward and backward process Accordingly, p is called the probability of retaining 1 network node in each training phase, so the probability of it being eliminated is (1 - p) This method helps to avoid the overfitting [45] If 1 fully connected layer has too many parameters and takes up most parameters, the network nodes in that layer are too interdependent

Trang 34

during training, which will limit the power of each node, leading to excessive combination

2.6.2 Weight Decay

L2 regularization, or weight decay, is a regularization technique applied to the weights of neural network This technique minimizes the loss function that affects both the loss function and the penalty for the Weight Norm Therefore, weight decay is simply a complement to the loss function of the network and can be described through the following equation

Lnew(w) = Loriginal(w) + λwTw

where λ is a value that determines the power of the penalty and L(w) is the chosen loss function If λ value is very small, weight decay will not help regularize the network In contrast, if λ is too large, the error function will gradually decrease, and network will only aim to keep the weight of the network at 0 This effect can be seen in figure 2.6.2.1

Figure 2.9 Underfitting, Optimal weight decay and Overfitting 2.7 Recurrent Neural Networks

RNNs [12] have revolutionized the field of customer service by enabling the creation of chatbots that can engage in more natural and effective dialogues Unlike traditional NNs that process inputs in isolation, RNNs are designed to recognize and remember patterns over sequences of words, making them ideal for parsing customer queries and maintaining the context of a conversation This sequential memory allows chatbots to provide more accurate and contextually relevant responses, improving the

Trang 35

customer experience RNNs can be trained on vast datasets of customer interactions, allowing them to understand a wide range of requests and issues However, they do face limitations with longer sequences, where they may struggle to maintain context

over extensive dialogues

Figure 2.10 The Recurrent Neural Network [5]

The image shows a RNN and how it unfolds through time At the heart of an RNN is a loop that allows information to persist In the diagram, the loop is unrolled to show the entire sequence of operations over time The following steps are how it works:

Input (x): At each time step t, the RNN takes an input xt and the previous

hidden state st−1

Hidden State (s): The hidden state st at time t is updated by applying a weight matrix U to the input xt and another weight matrix W to the previous hidden state st−1 The function of the hidden state is to capture and carry forward information through time

Output (o): The output ot at time t is then computed using the current hidden state st and a weight matrix V In many applications, this output is then passed

through a function, such as SoftMax, to make decisions or predictions based on the learned patterns

Weights: There are three sets of weights: U for input to hidden, W for hidden

to hidden, which is the recurrent connection, and V for hidden to output These

weights are shared across all time steps, which allows the RNN to generalize across sequences of varying lengths

Trang 36

RNNs are powerful because they can theoretically use their internal state, memory, to process sequences of inputs of any length However, they can be difficult to train due to issues like vanishing and exploding gradients, which can occur during backpropagation through the many layers of the network As a result, RNNs do face limitations with longer sequences, where they may struggle to maintain context over extensive dialogues This is often addressed by using advanced RNN architectures like LSTM or GRU, which are better at capturing long-term dependencies and can maintain context over longer conversations, a critical aspect of customer service interactions

2.8 Long Short-Term Memory

LSTM networks, a specialized form of RNNs, are crafted to capture long-term dependencies within data sequences effectively The architecture of an LSTM features a cell state, akin to a conveyor belt, which traverses the length of the network with minimal linear interaction, ensuring the preservation of information over time This cell state is modulated by a series of gates: the forget gate uses a sigmoid function to decide which information the cell state should discard, the input gate decides which values to update and integrates new candidate values created by a tanh layer, and the output gate determines the next hidden state by filtering the cell state through a tanh layer and applying the sigmoid function's output The hidden state, responsible for predictions, is updated with information from the cell state, providing the LSTM with the capability to maintain relevancy in sequential data over long periods This quality is particularly beneficial for language modeling, where understanding context from extensive data sequences is paramount These cells have

3 gates that control the flow of information as below [6]

Input Gate: Decides the degree to which new input should affect the memory

Forget Gate: Determines what portions of the existing memory should be

forgotten

Output Gate: Selects the parts of the memory to be used in the output

Trang 37

Figure 2.11 LSTM Network Architecture [6]

2.8.1 The Long-Term Dependency Problem

A key feature of RNNs is the idea of using preceding information to make predictions for the present, similar to how one uses previous scenes in a movie to understand the current scene If RNNs could effectively do this, they would be incredibly useful; however, whether they can accomplish this depends on the specific case Sometimes, just revisiting the recently obtained information is sufficient to understand the current situation For example, in the sentence: "Cách điều trị hen suyễn" once reading up to " Cách điều trị hen" it is enough to predict that the next word will be suyễn." In this scenario, the distance to the needed information for making a prediction is short, so RNNs are entirely capable of learning this

Figure 2.12 RNN and Short-Term Dependencies [7]

But in many situations, we need to use more context to infer For example, predicting the last word in the phrase: " Tôi bị bệnh hen suyễn… hen suyễn dị ứng, từ phấn hoa." Clearly, the recent information ("Hen suyễn dị ứng, từ phấn hoa") tells us that what follows will be the name of some disease, but it's impossible to know

Trang 38

exactly what that disease is To understand what it is, we need the additional context of " Hen suyễn dị ứng, từ phấn hoa" to make an inference Obviously, the information distance here may already be quite far Unfortunately, as the distance grows, RNNs start to struggle with remembering and learning.

Theoretically, RNNs are fully capable of handling "long-term dependencies," meaning that current information can be derived from a sequence of previous information However, in practice, RNNs lack this capability This issue has been highlighted by Hochreiter and Bengio, among others, as a challenge for the RNN

them to retain memories without any external intervention

All recurrent networks have the form of a sequence of neural network modules that are repeated In standard RNN networks, these modules have a very simple

structure, typically a single tanh layer

Trang 39

Figure 2.14 The Repeating Modules in an RNN Contains One Layer [7]

Similarly, LSTM also has a sequential architecture, but the modules within it have a different structure compared to standard RNN networks Instead of just having asingle neural network layer, they have four layers that interact in a very special way

Figure 2.15 The Repeating Modules of an LSTM Contain Four Layers [7] 2.9 GRU

The GRU is the newer generation of RNN and is pretty similar to an LSTM [8] GRU are designed to solve the vanishing gradient problem that can occur in standard RNN They do this by using gating mechanisms to control the flow of information The architecture consists of two gates: a reset gate and an update gate

Reset Gate: This gate determines how much of the past information needs to

be forgotten It can be thought of as a way to decide how much past information to discard, which helps the model to drop irrelevant information from the past

Update Gate: This gate decides how much of the past information will carry

over to the current state It is akin to a combination of the forget and input

Trang 40

gates in an LSTM, allowing the model to determine how much of the past information should influence the current state

Figure 2.16 The Architecture of GRU [9]

During its operation, the GRU first takes in the input and the previous hidden state to inform its gates; the reset gate uses this information to decide which parts of the past data should be forgotten, while the update gate determines the quantity of the previous hidden state that should be carried forward, effectively blending the old information with the new input to form a candidate hidden state, which is then combined with the old state, modulated by the update gate, to produce the final hidden state for the current time step.

This structure allows GRUs to keep relevant backpropagation error signals alive, making them capable of learning over many time steps, which is particularly useful for tasks that require the understanding of long-term dependencies, such as language modeling and time-series analysis GRUs offer a simpler and more

computationally efficient alternative to LSTMs while providing similar benefits 2.10 Word Embedding Model

Word Embedding is a general term for language models and feature-learning methods in Natural Language Processing (NLP), where words or phrases are mapped to numerical vectors (usually real numbers) This tool plays a crucial role in most

Ngày đăng: 22/05/2024, 11:06

Tài liệu cùng người dùng

Tài liệu liên quan