THESIS INTRODUCTION
Overview
The idea of human-computer interaction through natural language has been created in Hollywood movies 3-CPO is one of the legends of the Revolutionary Army in the world of Star Wars movie This robot guy has served through many generations of Skywalkers and is one of the top personality robots in the universe In the series of this movie, we can see that 3-CPO not only has very similar gestures and communication to humans, but sometimes has great instructions for its owner This is a cinematic product that is ahead of another era when it comes to predicting the future of Artificial intelligence (AI) The Star Wars fictional movie universe is set in a galaxy where humans and alien creatures live in harmony with droids These robots are capable of assisting people in daily life or traveling across other planets In the movie Iron Man (2010), Tony Stark used his supercomputer assistant, JARVIS, to get support and help in everyday life and on trips to save the world with superheroes
In particular, it can be mentioned the film AI (2001), adapted from the short story series Supertoys Last All Summer Long, which tells the perspective of the 22nd century, when sea level rise washed away coastal cities, causing a serious decline in population density The Mecha robot line simulates real people designed to integrate with humans They possess the ability to think well but do not know how to express emotions
The history of Interactive Voice Response (IVR) systems began in the 1930s when Voder machines were created The technology was the first to analyze the English language and produce human-like sounds The original speech recognition system was rudimentary, only understanding numerals because engineers thought human language was too complex In 1952, Alexander Graham Bell's Bell Laboratories designed "Audrey," a system for recognizing digits from a given voice Ten years later at the World's Fair, IBM demonstrated a "Shoebox" system that could recognize 16 different English words The vision from these projects is that users can communicate with computers through natural language and therefore not have to learn any specific language or prompts However, it turns out to be quite complicated to understand the spoken language It can be said that only entities (humans) living in the real world can effectively understand language, arguing that without context, the meaning of a word is incomprehensible
IVRs were widely used for businesses in the 1990s Meanwhile, the use of call queuing and automated call routing became popular in the mid-1990s In the late 1990s, the move of multimedia to call centers led to companies investing in IVR systems with CTI This integration allows businesses to integrate their call centers into their marketing campaigns Moreover, continuous improvements in IVR make them cheaper to deploy the company Contemporary platforms emerged in the 2010s The emphasis on integrating IVR with comprehensive analytics, automated SMS messaging, and advanced call monitoring features was evident during this period Modern IVR systems are now part of a larger solution and enable seamless integration of customer communications across channels Unlike bulky and expensive standalone systems, these advanced all-inclusive platforms now offer options, giving customers the opportunity to choose their preferred method Today, IVR has been integrated into the overall customer experience It now comes with a personalized brand voice, protects customer data, and detects fraud and spam In addition to routing to the best department to address customer needs, the tool is now integrated into marketing efforts The self-service model has evolved significantly with the arrival of conversational IVR These AI-enabled technologies replicate the experience of talking to a live agent Today's IVR systems provide solutions to customers faster, even without a direct operator connection It is useful for many industries and uses
IVR self-service empowers customers to manage tasks like reservations, payments, and information retrieval This aligns with consumer preferences, as Zendesk research indicates 69% attempt self-resolution before contacting support To maximize customer satisfaction, businesses should ensure their IVR systems adhere to best practices Poorly designed automation can lead to customer frustration if it fails to resolve issues or wastes time.
Figure 1.1 The History of IVR [17]
The basic and big problem with most IVR systems is that no matter how useful they may be in providing all possible options to fit a lot of customer queries and issues, most customers really just want to talk to one person in person Nothing can beat direct human-to-human communication This may seem counterintuitive if you are using an IVR system to reduce human resources and staffing costs, but adding the option to talk to an advisor on the main menu without forcing customers to search the verbal maze of menu options will make most customers more satisfied and less frustrated Automatic speech recognition (ASR) is a classic feature of many IVR systems and allows users to communicate with the IVR system by voice instead of pressing phone keys or on a laptop, which can be difficult for users to do But if ASR is incapable of recognizing what humans are saying, it will make the system frustrating and, worse, useless Therefore, ensuring accuracy as well as optimizing design interfaces in ASR is essential in meeting the high demands of users.
Research Problem
There are several challenges associated with the current implementation of the voice-to-text chatbot Firstly, quality and coherence of generative-based chatbots They generate responses based on statistical patterns learned from large datasets of text While they can produce more diverse and flexible responses compared to retrieval-based models, the quality and coherence of their output can vary widely They may generate nonsensical or contextually inappropriate responses, especially when faced with input they have not encountered during training
Accuracy is crucial for Automatic Speech Recognition (ASR) models in IVR systems, allowing customers to interact verbally instead of using keypads However, achieving high accuracy can be challenging, with 73% of respondents in a recent survey citing it as the primary obstacle to adopting speech recognition technology WER (Word Error Rate) serves as a standard metric for evaluating ASR system accuracy and performance.
Lastly, ASR models often exhibit domain-specific performance This implies that if a model is exclusively on a particular dataset or within a specific domain (such as healthcare, finance, or tourism), it might encounter challenges in recognizing and processing inputs beyond that training domain Consequently, this limitation can result in recognition errors and diminished accuracy when confronted with new or unfamiliar data.
Target of the Thesis
The objective of this master's thesis is to develop and train an intelligent chatbot using freely available data sources from online forums, FAQs, and videos on Youtube The specific goals include:
Data Collection and Processing: Research and develop effective methods for collecting dialogue data from free online sources This involves identifying appropriate data sources, filtering, and cleaning data to ensure quality and reliability
Analysis and Modeling: Analyze the characteristics of dialogue data, such as structure, context, and linguistic diversity Develop suitable machine learning or deep learning models for training the chatbot, focusing on researching and constructing algorithms for the chatbot using deep learning methods and large language models in a sequence-to-sequence format
Chatbot Training: Apply advanced techniques in artificial intelligence and machine learning to train the chatbot to understand and respond accurately and naturally
Evaluation and Improvement: Evaluate the performance of the chatbot through testing methods and user feedback Use the evaluation results for continuous improvement of the chatbot model
Practical Application: Explore the potential application of chatbots in the healthcare sector, emphasizing the integration of ASR Evaluate how this integration impacts user access and interaction to understand the changes in healthcare service delivery.
Scope of the Thesis
The scope of this master's thesis encompasses several key areas Firstly, the primary focus of this thesis is on the healthcare sector, utilizing datasets gathered from FAQs on hospital websites in Vietnam Despite the potential applicability of the methodologies and technologies in other fields, the primary emphasis remains on healthcare This approach ensures specialized attention to the unique requirements of the healthcare industry but limits the immediate applicability to other domains
Secondly, in terms of methodology, the thesis employs an advanced Sequence to Sequence (Seq2Seq) model, integrating deep learning techniques such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) This model is structured with an encoder Recurrent Neural Network (RNN), which processes the input sequence, and a decoder RNN, responsible for generating the output sequence Additionally, Luong's attention mechanism is employed to enhance the model's ability to focus on relevant parts of the input while generating each part of the output
To streamline the training process, a Greedy decoding strategy is utilized, where the most probable next output is selected at each step While this combination is effective for complex dialogue modeling, it poses challenges in terms of technical complexity and computational resources
Thirdly, the foundational model used is Generative Pretrained Transformer-3 (GPT-3), based on the Transformer architecture This choice represents the cutting edge in AI technology but restricts flexibility in adapting to new methods and technologies
To evaluate the performance of text-to-text and speech-to-text models, a combination of metrics is employed F1-score, precision, recall, and BLEU are used for text-to-text models, while WER and PER are utilized for speech-to-text models These metrics provide a comprehensive assessment, capturing various aspects of model performance However, it's important to note that relying solely on these metrics may not fully reflect real-world performance, as each metric emphasizes specific aspects of model behavior.
Contribution
The contributions of this master’s thesis are threefold Firstly, it compares the performance of the traditional Seq2Seq models combined with Luong’s Attention mechanism against the encoder-only approach of GPT-3, customized for the Vietnamese language This comparative analysis provides insights into the strengths and weaknesses of each model architecture in the context of structured dialogue processing and language understanding, particularly in the healthcare domain
Secondly, it enhances the accuracy of the ASR model by selecting and evaluating experiments with the best results for the dataset This contribution aims to address the challenges of speech recognition in noisy environments and varying aucoustic conditions, thereby improving the overall performance and usability of the ASR technology
Lastly, it integrates the two models of text-to-text and speech-to-text to develop a chatbot supporting voice interactions usingVietnamese language in the healthcare domain This integration expands the chatbot’s functionality to accommodate users who prefer or require voice-based interactions, thereby enhancing accessibility and user experience in healthcare services.
Thesis Structure
Thesis "Supporting Voice Communication in Chatbot" includes Five Chapters with the following main content:
Chapter 1 of the thesis provides a comprehensive introduction to the research topic, outlining its relevance, theoretical framework, and practical implications It establishes the scope and limitations of the study and describes the research objectives and structure This introduction serves as the foundation for the subsequent chapters, guiding the reader through the research process and its findings.
Chapter 2, BACKGROUND: synthesizes the most relevant academic issues to be applied to solve the problem, focusing mainly on the content of deep learning, the basic of HMM, from Artificial Neural Network to Recurrent Neural Network, LSTM, GRU in Seq2Seq model and ASR model This chapter also provides a general overview of related research that has been and is being conducted, as well as the current general trends in solving the problem (Luong’s attention mechanisms, the theory of encoders and decoders, and particularly the GPT-3’s architecture) This section also brings discussions and evaluations for the methods as they form an important basis for the student's research during the thesis process
Chapter 3, PROPOSED MODEL: introduces the proposed model for
Chatbot At the same time, it presents improvements and motivations for those proposals Finally, the student presents the steps to conduct experiments on the data set and evaluates the results of the improvements compared to the chosen model
Chapter 4, IMPLEMENTATION: involves selection, training, evaluation, and integration of models to develop a robust and effective voice-to-text chatbot tailored for the Vietnamese language and healthcare domain
Chapter 5, CONCLUSION: synthesizes the results achieved during the thesis process from the research and hypothesis construction to the experimental deployment This section also discusses the limitations and outstanding issues, and finally proposes solutions for future improvements The Table of Contents, List of Figures, List of Tables and Acronyms are provided at the beginning of the thesis The references will be presented at the end of the thesis.
BACKGROUND
Hidden Markov Model (HMM)
Hidden Markov Models (HMMs) are statistical models that offer a statistical representation of the sounds of words in speech recognition HMMs in speech recognition comprise a chain of states, as depicted in Figure 2.4.
In Hidden Markov Models (HMMs), the underlying state of the system is unknown (hidden), and only the output observations are available Each state in the HMM represents a frame in the audio input The HMM is characterized by two sets of parameters: the transition probabilities (aij) and the output probability distributions (bj()) The transition probability aij specifies the likelihood of moving from state i to state j, while the output probability distribution bj() describes the probability of observing the output given the current state.
Figure 2.1 HMM-based phone model [1]
An important feature with HMM is that self loops aii makes it possible for HMM to model the changing length of the phone When performing a transition and entering a new stage in the HMM, a feature vector is created using the distribution associated with that particular state in the HMM The first and last states in the HMM are called non-emitting states For example, in Figure 2.1.1 s1 is the in state and s5 is the exit state They are used as entrances and exits of the model and simplify the pairing of HMMs, phone models, to form words.
Deep Neural Networks
An alternative to the Gaussians mixture models in speech recognition is to use DNN [2] A DNN is a transitional, artificial neural network with more than one hidden layer between the input layer and the output layer, as illustrated in Figure 2.3.1 Nodes along the weighted path are attached to them, and the output at every node is calculated by the activation function Typically, the input for a node at a layer in a DNN is calculated from the layer below
∑ 𝑒𝑥𝑝{𝑎 𝑠 ′ 𝑢𝑡 (𝑠 ′ )} where bj is the bias of units j, i is an index on units in the lower class, and wij is weighted on a connection to units j from units i in the lower class The output for the upper layer is then calculated as below
𝑦 𝑗 = 𝑎 𝑢𝑡 (𝑥 𝑗 ) The hidden layers make the DNN able to model non-linear and complex relationships in the data For multiclass classification, output unit j converts its total input x j into a probability using a SoftMax function [2] In Kaldi the function used to estimate the posterior probabilities for the HMM is as below
∑ 𝑒𝑥𝑝{𝑎 𝑠 ′ 𝑢𝑡 (𝑠 ′ )} where a ut denotes the observation at time t in utterance u and a ut is the activation function at the output layer corresponding to state s [3]
The goal in training is to optimize an objective function and update the weights of internal nodes based on the information transmitted to the model In training, an important parameter is the learning rate The greater the learning rate, the faster but less accurate the training.
Artificial Neural Networks
Basically, ANN is a computational model, they are built on the structure and function of the neural network in Biology (although the structure of the ANN will be affected by a flow of information) Therefore, this neural network will change, they depend on inputs and outputs We can think of ANN as nonlinear statistical data This means a complex, defined relationship between input and output As a result, we will have many different templates
ANN takes ideas from how the human brain works – making the right connections Therefore, ANN used silicones and wires to make neurons and dendrites live for themselves In the human body, 1 part of the brain is already composed of 86 billion neurons and they are connected to thousands of other cells through Axons Because humans have so many different information inputs from the senses, the body also has many dendrites to help transmit this information They will generate electrical impulses to move, transmitting information in this neural network And the same goes for ANN when different problems need to be dealt with, the neuron sends a message to another neuron
Therefore, we can say that the ANN will consist of many internal nodes, they mimic the biological neurons inside the human brain ANN networks connect these neurons by means of links, and they interact with each other Nodes in ANN are used to retrieve input data Moreover, performing operations on data is also very simple After performing these operations with data, these activities are transferred to other neurons The output at each node is called its trigger value or node value Every link in the ANN network is associated with weight In addition, they have the ability to learn That will take place by changing the weight values The figure 2.3.2 is an illustration of a simple ANN
Figure 2.4 A Simple Example of the Structure of a Neural Network
A classic but simple type of node in Neural Networks (NN) is McCulloch-Pitts Node [4] An illustration of this node, or neurons as McCulloch and Pitt like to call them, can be seen in Figure 2.3.3 The calculations performed in the McCulloch-Pitts node are essentially a sigmoid function Generally, they add inputs and if they are above a certain threshold, they generate 1, otherwise they generate 0 There are also much more complex representations of these nodes, but the McCulloch-Pitts neuron is a good starting point for understanding the basics of NN
Figure 2.5 The McCulloch-Pitts Neuron [4]
Neural networks comprise both nodes and "edges," which assign weightings to network connections These edges multiply node outputs by their weights, transmitting the result to the next connected neuron By adjusting these weights based on output analysis, researchers train the network to map input data to the desired outputs This weight adjustment process, known as backpropagation, will be detailed in a subsequent section.
Convolutional Neural Network
CNN is one of the extremely advanced Deep Learning models CNN will allow researchers to build intelligent systems with extremely high accuracy Figure 2.4.1 is a basic architectural form of the CNN network
Figure 2.6 The architecture of CNN [57]
A convolutional is a type of sliding window located on a matrix Convolutional layers will have parameters learned to adjust and retrieve the most accurate information without having to select features Convolutional is the multiplication of elements in a matrix Sliding Window is also known as kernel, filter or feature detect and is a small type of matrix
CNN will compare images based on each piece and these pieces are called Feature Instead of having to match the photos together, CNN will see the similarity when searching for raw features that match each other with 2 better images Each feature is considered a mini-image which means they are small 2-dimensional arrays These features all correspond to certain aspects of the image, and they can fit together The followings are basic layers of CNN
Convolutional layer This is the most important class of CNN, which is tasked with doing all the calculations The important elements of a convolutional layer are: stride, padding, filter map, feature map
Stride means when the filter map is moved in pixels based on the value from left to right
Padding is 0 values added with the input class
CNN uses filters to apply to the area of the image These filter maps are called 3-dimensional matrices, inside which are numbers, and they are parameters
The feature map represents the result of each filter map scan through the input After each scan will occur the calculation process
Rectified Linear Unit (ReLu) Layer ReLu layer is the activation function in
NN and this function is also called activation function The trigger function simulates neurons with pulse rate through axons In the activation function, it also means: ReLu, Leaky, Tanh, Sigmoid, Maxout, etc Currently, the ReLu function is commonly used and extremely common It is used a lot for neural network training needs, but ReLu brings a lot of outstanding advantages such as: the calculation will become faster, etc In the process of using ReLu, we need to pay attention to the problem of customizing learning rates and tracking dead units The ReLu layers were used after the filter map was calculated and applied the ReLu function to the values of the filter map
Pooling layers play a crucial role in neural networks when handling large input data They are positioned between convolutional layers to reduce the number of parameters, making the model more efficient The two primary types of pooling layers are max pooling and average pooling Max pooling selects the maximum value from a region of the input, while average pooling computes the average value Both methods effectively reduce the dimensionality of the input, leading to a more manageable and less computationally intensive model.
Fully Connected Layer This layer is responsible for producing results after the convolutional layer and pooling layer have received the transfer image At this point, the result is that the model has read the information of the image and to link them as well as produce more output, we use fully connected layers
In addition, if the fully connected layer keeps the image, it will turn it into an undivided section This is quite similar to the votes they will evaluate to vote for the highest quality image.
Multilayer Perceptron
The name backpropagation comes from the term used by Rosenblatt (1962) for attempts to generalize the perceptron learning algorithm into several layers, even if none of the many attempts to do so in the 1960s and 1970s were particularly successful [40] The backpropagation algorithm is one of the most important tools of the artificial neural network, it is especially the part that deals with the training of the network, i.e., where it learns During this process, the network updates the weighting of all edges to make it perform the correct output for a particular input
Stochastic Gradient Descent (SGD) is used to train NN in oreder to reduce repetitive cost functionality The principle of maximum likelihood is followed primarily by the loss function The goal is to reduce the cross entropy error between the output and the predicted output The gradient update equation is shown as below θ ← θ − η∇θLn where the learning rate is denoted by η, the gradient of the loss function is denoted by ∇θLn, for θ parameters calculated on the n-th batch of the dataset
There are many forms of SGD have been developed that allow for better and faster convergence in the recent times For MLP parameters, the gradient of the target functions is calculated using the backpropagation algorithm [41] At each layer, the algorithm uses chain-rule for partial derivatives of the parameter The backpropagation algorithm is explained in the following nonvectorized manner, i.e., the output guide is calculated for the scalar Initially, a transition is performed by calculating the activation of each node b and predicting the output y The next loss L is calculated with the target output y and the predicted output y Back pass is initiated by taking the derivatives of the loss function [42]
The above equation activation is represented by ứk How much l-th The j-th neuron of the output error control layer is denoted by the term δlj error
According to chain rule, δ j l can be written as
Hence, hidden layer’s neurons error can be obtained So the model weights are calculated as follow
The sigmoid activation function, commonly employed in neural networks with multiple layers, compresses a wide range of input values into a narrow output range between 0 and 1 This compression results in diminished changes in output for significant changes in input, leading to the small derivatives characteristic of the sigmoid function.
Figure 2.7 Sigmoid Function and its Derivative by [43]
This is caused by string rules used in backpropagation It calculates gradients by moving each layer from the last layer to the original one According to the chain rule, each subsequent derivative is multiplied by the calculated value Therefore, when there are n hidden layers using sigmoid as activation functions, n small derivatives are multiplied together As a result, the gradient value decreases exponentially as the backpropagation algorithm progresses to the beginning of layers [43]
The following ways are the solutions to avoid this problem
The first solution is to use another activation function, such as ReLu, which does not cause a small derivative
The next solution is the residual network (ResNets) They provide the residual connections straight to the next layers, effectively bypassing the activation functions That results in higher than derivatives, and it leads to the ability to train networks much deeper [44]
The last one is mass normalization The batch normalization method normalizes the input on a predefined scale, where the sigmoid derivative is not small.
Regularization
It is important to consider the problem of overfitting when building NN or machine learning algorithms Overfitting is when the model begins to learn features that are too specific to the training set Basically, the model not only learns the general rules that lead from input to output, but also more rules, which perhaps describe the training set, but are not necessarily valid at the general level This process leads to reduced training errors but also increases assessment errors As a result, model will perform worse on unknown data due to the specific rules that it has learned from the training set If overfitting occurs when model is too suitable for the training set, then the opposite phenomenon is called underfitting, i.e when the model learns too general rules We can find an illustration of the phenomena mentioned above in Figure 2.10.1
Figure 2.8 Underfitting, Optimal and Overfitting
Dropout is the skipping of units (i.e 1 network node) during training 1 random way By omitting this, the unit will not be considered in the forward and backward process Accordingly, p is called the probability of retaining 1 network node in each training phase, so the probability of it being eliminated is (1 - p) This method helps to avoid the overfitting [45] If 1 fully connected layer has too many parameters and takes up most parameters, the network nodes in that layer are too interdependent during training, which will limit the power of each node, leading to excessive combination
L2 regularization, or weight decay, is a regularization technique applied to the weights of neural network This technique minimizes the loss function that affects both the loss function and the penalty for the Weight Norm Therefore, weight decay is simply a complement to the loss function of the network and can be described through the following equation
Lnew(w) = Loriginal(w) + λw T w where λ is a value that determines the power of the penalty and L(w) is the chosen loss function If λ value is very small, weight decay will not help regularize the network In contrast, if λ is too large, the error function will gradually decrease, and network will only aim to keep the weight of the network at 0 This effect can be seen in figure 2.6.2.1
Figure 2.9 Underfitting, Optimal weight decay and Overfitting
Recurrent Neural Networks
RNNs [12] have revolutionized the field of customer service by enabling the creation of chatbots that can engage in more natural and effective dialogues Unlike traditional NNs that process inputs in isolation, RNNs are designed to recognize and remember patterns over sequences of words, making them ideal for parsing customer queries and maintaining the context of a conversation This sequential memory allows chatbots to provide more accurate and contextually relevant responses, improving the customer experience RNNs can be trained on vast datasets of customer interactions, allowing them to understand a wide range of requests and issues However, they do face limitations with longer sequences, where they may struggle to maintain context over extensive dialogues
Figure 2.10 The Recurrent Neural Network [5]
The image shows a RNN and how it unfolds through time At the heart of an RNN is a loop that allows information to persist In the diagram, the loop is unrolled to show the entire sequence of operations over time The following steps are how it works:
Input (x): At each time step t, the RNN takes an input x t and the previous hidden state s t −1
Hidden State (s): The hidden state s t at time t is updated by applying a weight matrix U to the input x t and another weight matrix W to the previous hidden state s t −1 The function of the hidden state is to capture and carry forward information through time
Output (o): The output o t at time t is then computed using the current hidden state s t and a weight matrix V In many applications, this output is then passed through a function, such as SoftMax, to make decisions or predictions based on the learned patterns
Weights: There are three sets of weights: U for input to hidden, W for hidden to hidden, which is the recurrent connection, and V for hidden to output These weights are shared across all time steps, which allows the RNN to generalize across sequences of varying lengths
RNNs are powerful because they can theoretically use their internal state, memory, to process sequences of inputs of any length However, they can be difficult to train due to issues like vanishing and exploding gradients, which can occur during backpropagation through the many layers of the network As a result, RNNs do face limitations with longer sequences, where they may struggle to maintain context over extensive dialogues This is often addressed by using advanced RNN architectures like LSTM or GRU, which are better at capturing long-term dependencies and can maintain context over longer conversations, a critical aspect of customer service interactions.
Long Short-Term Memory
LSTM networks, a specialized form of RNNs, are crafted to capture long-term dependencies within data sequences effectively The architecture of an LSTM features a cell state, akin to a conveyor belt, which traverses the length of the network with minimal linear interaction, ensuring the preservation of information over time This cell state is modulated by a series of gates: the forget gate uses a sigmoid function to decide which information the cell state should discard, the input gate decides which values to update and integrates new candidate values created by a tanh layer, and the output gate determines the next hidden state by filtering the cell state through a tanh layer and applying the sigmoid function's output The hidden state, responsible for predictions, is updated with information from the cell state, providing the LSTM with the capability to maintain relevancy in sequential data over long periods This quality is particularly beneficial for language modeling, where understanding context from extensive data sequences is paramount These cells have
3 gates that control the flow of information as below [6]
Input Gate: Decides the degree to which new input should affect the memory
Forget Gate: Determines what portions of the existing memory should be forgotten
Output Gate: Selects the parts of the memory to be used in the output
2.8.1 The Long-Term Dependency Problem
RNNs' strength lies in leveraging past information to make present predictions, akin to how movie scenes build continuity Their efficacy depends on the context; sometimes, recent information suffices, as in the Vietnamese sentence "Cách điều trị hen suyễn" where "suyễn" can be predicted after "Cách điều trị hen." In such cases, the proximity of relevant information enables RNNs to learn effectively.
Figure 2.12 RNN and Short-Term Dependencies [7]
But in many situations, we need to use more context to infer For example, predicting the last word in the phrase: " Tôi bị bệnh hen suyễn… hen suyễn dị ứng, từ phấn hoa." Clearly, the recent information ("Hen suyễn dị ứng, từ phấn hoa") tells us that what follows will be the name of some disease, but it's impossible to know exactly what that disease is To understand what it is, we need the additional context of " Hen suyễn dị ứng, từ phấn hoa" to make an inference Obviously, the information distance here may already be quite far Unfortunately, as the distance grows, RNNs start to struggle with remembering and learning
Theoretically, RNNs are fully capable of handling "long-term dependencies," meaning that current information can be derived from a sequence of previous information However, in practice, RNNs lack this capability This issue has been highlighted by Hochreiter and Bengio, among others, as a challenge for the RNN model
Figure 2.13 RNN and Long-Term Dependencies [7]
LSTMs are a special kind of RNN capable of learning long-term dependencies LSTMs were introduced by Hochreiter & Schmidhuber and have since been refined and popularized by many researchers in the field They are incredibly effective across a variety of problems, which has led to their widespread popularity LSTMs are designed to avoid the long-term dependency problem Their ability to remember information for long periods is intrinsic to their design, meaning they do not require special training to have this capability In other words, their inherent structure allows them to retain memories without any external intervention
All recurrent networks have the form of a sequence of neural network modules that are repeated In standard RNN networks, these modules have a very simple structure, typically a single tanh layer
Figure 2.14 The Repeating Modules in an RNN Contains One Layer [7]
Similarly, LSTM also has a sequential architecture, but the modules within it have a different structure compared to standard RNN networks Instead of just having a single neural network layer, they have four layers that interact in a very special way.
GRU
The GRU is the newer generation of RNN and is pretty similar to an LSTM [8] GRU are designed to solve the vanishing gradient problem that can occur in standard RNN They do this by using gating mechanisms to control the flow of information The architecture consists of two gates: a reset gate and an update gate
The Reset Gate, a crucial component in recurrent neural networks, plays a pivotal role in determining the model's capacity to forget past information By regulating the amount of historical data to be discarded, the Reset Gate enables the model to shed irrelevant or obsolete information, facilitating its focus on the most relevant and up-to-date data This selective discarding process allows the network to learn effectively and dynamically adapt to changing input sequences over time.
Update Gate: This gate decides how much of the past information will carry over to the current state It is akin to a combination of the forget and input gates in an LSTM, allowing the model to determine how much of the past information should influence the current state
Figure 2.16 The Architecture of GRU [9]
During its operation, the GRU first takes in the input and the previous hidden state to inform its gates; the reset gate uses this information to decide which parts of the past data should be forgotten, while the update gate determines the quantity of the previous hidden state that should be carried forward, effectively blending the old information with the new input to form a candidate hidden state, which is then combined with the old state, modulated by the update gate, to produce the final hidden state for the current time step
This structure allows GRUs to keep relevant backpropagation error signals alive, making them capable of learning over many time steps, which is particularly useful for tasks that require the understanding of long-term dependencies, such as language modeling and time-series analysis GRUs offer a simpler and more computationally efficient alternative to LSTMs while providing similar benefits.
Word Embedding Model
Word Embedding is a general term for language models and feature-learning methods in Natural Language Processing (NLP), where words or phrases are mapped to numerical vectors (usually real numbers) This tool plays a crucial role in most algorithms and architectures in machine learning and deep learning for processing text-based input data, as these systems can only understand inputs in numerical form, enabling them to perform tasks like classification, regression, etc Word embeddings are primarily categorized into two types:
Frequency-based Embedding [13] relies on the frequency of word occurrences to create word vectors, with three most common types being:
t𝑓-id𝑓 Vector (Term Frequency-Inverse Document Frequency)
These methods enable the capture of word significance and relationships based on how often they appear, offering a foundational approach to representing text in numerical form for various NLP tasks
Count Vector is a basic form of Frequency-based Embedding For a vocabulary of length N and a corpus of D documents (d1, d2, , dD), a word's vector representation is an integer vector of length D Each element at position i reflects the word's frequency in document di Low-frequency words may be excluded or vector entries modified (e.g., using binary values for presence/absence) to suit specific requirements.
Unlike count vectors that only consider the frequency of words within a single document, t𝑓id𝑓 vectors take into account the frequency of words across the entire dataset This characteristic gives t𝑓id𝑓 vectors a higher discriminative power compared to Count vectors A t𝑓id𝑓 vector is a real-valued vector also with the length
D, where D is the number of documents It is calculated by multiplying two parts: t𝑓 and id𝑓 The formula for each element of the vector is computed as follows: t𝑓 i =
i: ranges from 1 to D, where D is the number of documents
n i : is the frequency of occurrence of the word in document i
N i : is the total number of words in document i id𝑓 i = log 2
D: represents the total number of documents in the dataset
d: is the number of documents in which the word appears t𝑓id𝑓 i = t𝑓 i x id𝑓 i [10] t𝑓id𝑓 vector have a higher discriminative power compared to Count Vector primarily because they are adjusted by the id𝑓 weight According to its formula, if a word appears in many documents, its value decreases, diminishing its impact on the final result However, a limitation of both these methods is their focus solely on the frequency of a word's occurrence, which often fails to capture contextual meaning Co-occurrence matrix partially address this issue [11] They preserve semantic relationships between words, constructed based on the frequency of word pairs appearing within a Context Window A Context Window is defined by its size and direction, allowing the matrix to capture more nuanced word associations based on their proximity and co-occurrence in the text We take “Dương chỉ không mập Dương chỉ gầy” sentence for an example of Context Window, and we will have the Co-occurrence Matrix as below
Table 2.1 Example of Co-occurrence Matrix
Dương chỉ không mập gầy
Dương 0 2 0 1 0 chỉ 2 0 1 0 1 không 0 1 0 1 0 mập 1 0 1 0 0 gầy 0 1 0 0 0
Due to extensive vocabulary size, word vector dimension reduction techniques, such as omitting stop words or employing Singular Value Decomposition (SVD), are commonly used These approaches enhance word representations while conserving memory in co-occurrence matrix storage, a potentially large matrix.
GloVe was introduced in 2014, is one of the newer methods for building word vectors and is essentially based on co-occurrence matrices GloVe fundamentally operates on probabilities, and its development idea stems from the following ratio:
𝑿 𝒊𝒌 : represents the number of times word k appears in the context of word i
𝑿 𝒊 : is the total number of times word i appears in the context of all other words except i
The fundamental concept of GloVe is centered on the idea that a word's meaning can be discerned through its co-occurrence probabilities with other words Essentially, the GloVe model posits that the ratio of these co-occurrence probabilities can reveal deep semantic relationships between words For example, words like "cây kem" and "lạnh" would have a high co-occurrence probability, indicating a strong semantic connection In contrast, "cây kem" and a less related word like "âm nhạc" would show a lower co-occurrence probability GloVe leverages this principle by transforming word co-occurrence counts into a matrix and then applying dimensionality reduction, effectively condensing complex semantic relationships into a more compact vector space The outcome is a set of vectors where the distances and directions between them can be used to deduce the similarities and differences in meaning among words
b i ,b j are corresponding biases (added during the simplification and optimization steps)
Xij: the entry corresponding to the word pair i,j in the co-occurrence matrix
The function f is known as the weighting function, introduced to mitigate the influence of word pairs that appear too frequently This function satisfies three properties:
It is a non-decreasing function
It takes small values when x is very large
In practice, there are many functions that satisfy these properties, but the following function is often chosen:
1 With = ắ The optimization of the cost function J to find the word vectors W i ,W j can be achieved through various methods, with the most standard approach being the use of Gradient Descent This technique involves iteratively adjusting the parameters
W i ,W j in the direction that most reduces the cost function, gradually converging to the set of parameters that minimize J Gradient Descent is widely used due to its effectiveness and simplicity in handling large datasets, which is common in natural language processing tasks like training GloVe models [13]
Prediction-based Embedding builds word vectors based on predictive models, with Word2vec being the most notable example Word2vec combines two models: Continuous Bag of Words (CBOW) and Skip-gram Both of these models are built on a three-layer neural network architecture comprising an input layer, a hidden layer, and an output layer, with the primary goal of learning the weights that represent word vectors
The CBOW model operates on the principle of predicting the probability of a word given its context, which can include one or more words The input to the model is one or more One-hot vectors of the context words, each with a length of V, where
The Continuous Bag-of-Words (CBOW) model in natural language processing predicts a missing word based on its context Given a vocabulary of size V, the CBOW model uses a single-word context to generate a probability vector of length V for the target word This probability vector represents the likelihood of each word in the vocabulary being the missing word The CBOW model employs a Hidden Layer of length N, where N is also the size of the word representation vector.
Figure 2.17 The CBOW Model with One Input [14]
When training a model like CBOW (Continuous Bag of Words) in Word2vec, the dataset used for training plays a crucial role The input to the model consists of two main components: context One-hot vectors and One-hot vectors of the target word
One-hot vectors are binary vectors that represent the context words surrounding a target word Each context word is assigned a unique vector where only the index corresponding to the word in the vocabulary is set to 1, while all other indices are set to 0 Consequently, the size of these vectors is equal to the size of the vocabulary (V).
One-hot Vectors of the Target Word: Similar to the context vectors, the target word (the word that the model aims to predict) is also represented as a
One-hot vector In the CBOW model, the goal is to predict this target word based on the context provided by the context words
During training, the model takes these One-hot vectors as input and learns to predict the target word from its context This learning process involves adjusting the weights of the neural network such that the output probabilities match the actual distribution of words in the given context Over time, this leads to the development of word vectors that capture semantic and syntactic relationships between words, making Word2vec a powerful tool for natural language processing tasks
The History of Chatbots
In 1966, computer scientist Joseph Weizenbaum created the first chatbot at the MIT AI lab and named it ELIZA ELIZA was created to simulate conversations with basic pattern matching and substitution DOCTOR, the name of the most famous program of the ELIZA chatbot that can interact with the user as a psychologist Eliza's responses were so convincing that some test users trusted and developed an emotional attachment to the program forgetting that they were just chatting with a chatbot In fact, ELIZA is programmed based on understanding natural language and reacts to keywords then returns conversations to the user
Figure 2.19 ELIZA – The First Chatbot in the World at MIT by Joseph
Another chatbot was developed by Stanford psychiatrist Kenneth Colby in
1972 named PARRY It is designed to accurately simulate as accurately as possible the thought patterns of a person with paranoid schizophrenia In September 1972, these 2 ELIZA and PARRY chatbots met and the figure 2.11.2 was part of an interesting conversation
Figure 2.20 The Conversational between Elize and Parry [19]
More than half a century after the first chatbot was born, it has become an integral part of today's life Not only does it provide 24/7 customer service, but it also helps businesses grow revenue and automate some processes in sales and customer care However, even this current time, Dec 2023, it can be said that people complaining that chatbots are not smart enough People always want to develop more and more advanced chatbots, but the problem lies with users, they always expect interacting with chatbots to be as great as talking to a real person Chatbots should now be treated like kids Therefore, it takes time for Machine Leaning algorithms to train chatbots to learn, practice and develop over time
Effective customer service requires extensive training for new employees, encompassing product knowledge, service protocols, and brand identity Similar to human agents, chatbots undergo a gradual learning process to become competent assistants Through continuous interaction with users, chatbots acquire a deeper understanding of the brand's offerings and enhance their ability to resolve customer inquiries This learning process mimics the natural progression of human agents and ensures that chatbots deliver a high level of service.
Regardless of what today's businesses use chatbots for, the first thing to think about is to analyze the chat data with previous customers to create sample chatbot scenarios and continuously experiment with their new customers This will help chatbots respond to requests and answer more customer questions Chatbot is one of the hottest industries in AI today and for years to come Surely Chatbot AI will be more and more widespread in life and more invisible in online services If we look back at the evolution and development of AI chatbots over the past 60 years, the biggest success is that algorithms have been able to open up a new world of "artificial intelligence" In the past few years we are no strangers to chatbot scenarios, in other words we have gradually become accustomed to it and consider it as an advanced technology solution.
Using Luong’s Attention for Sequence 2 Sequence Model
The Seq2Seq model operates on the principle of transforming sequences via dual RNNs, as identified by Sutskever and colleagues This process enables the chatbot to maintain a dialogue flow by considering the context of the conversation
Figure 2.21 Seq2Seq Model with GRUs [15]
Encoder: The encoder is a series of GRU (Gated Recurrent Unit) layers Each
GRU processes an input element Xi (such as a word in a sentence), captures its information, and passes it along to the next GRU The final GRU in the encoder sequence outputs a context vector representing the entire input sequence's information This context vector aims to encapsulate the semantic meaning of the input sequence in a fixed-length format [16]
Context Vector: This is the encoder's final hidden state and serves as the condensed representation of the input sequence It is used to initialize the decoder and theoretically contains all the semantic information needed to generate the output sequence
Decoder: Mirroring the encoder, the decoder is also a series of GRU layers It takes the context vector and begins the sequence generation process Starting with an initial input (typically a sequence starts token), the decoder GRU predicts the next element in the sequence (e.g., the next word in a sentence) [16]
Predictions: At each time step, the decoder outputs a prediction for the next sequence element and a hidden state, which is passed on to the next GRU in the decoder This process continues until a sequence end token is generated or a predefined sequence length is reached
The encoder RNN processes the input sentence token by token, generating an
"output" vector and a "hidden state" vector at each time step The hidden state vector is then passed to the next time step, while the output vector is stored This encoding step transforms the context observed at each point in the sequence into a collection of points in a high-dimensional space, which the decoder will later utilize to produce a meaningful output for the specific task at hand
The encoder employs a multi-layered GRU, originally introduced by Cho et al in 2014 [8] In the context of this model, a bidirectional variant of the GRU is utilized, which combines two independent RNNs One of these RNNs processes the input sequence in its standard sequential order, while the other processes it in reverse order The outputs from both of these networks are merged at each time step By employing a bidirectional GRU, the advantage gained is the encoding of information from both past and future contexts, thereby enhancing the model's understanding of the input sequence
Input Sequence: The image represents an input sequence, which could be a sequence of words in a sentence or any sequential data
Embedding Layer: Typically, input tokens are converted into numerical vectors using an embedding layer This step transforms each token into a dense vector representation suitable for neural network processing
Bidirectional GRU: Instead of a simple RNN, a Bidirectional GRU employs two independent GRU layers One processes the input sequence in the standard left-to-right order (the forward direction), while the other processes it in reverse (the backward direction) Each GRU layer is illustrated as a stack of cells or nodes connected horizontally o Forward GRU: This layer processes the sequence from left to right At each time step, it takes an input token and its hidden state from the previous time step in the forward direction It computes a new hidden state and passes it to the next cell This process continues until the entire sequence is processed o Backward GRU: In contrast, the backward GRU processes the sequence in reverse order, from right to left It follows similar principles as the forward GRU but operates in the opposite direction
Concatenation: After both the forward and backward GRU have processed the input sequence, their hidden states at each time step are concatenated (combined) together This results in a new representation for each time step that encodes information from both the past and future context
Concatenation: After both the forward and backward GRUs have processed the input sequence, their hidden states at each time step are concatenated, combining information from both past and future contexts This results in a new representation for each time step that encodes bidirectional context
Output Layer: The concatenated hidden states can then be fed into an output layer, which typically includes one or more fully connected (dense) layers, depending on the specific task The output layer's role is to make predictions or classifications based on the enriched representations obtained from the bidirectional GRU layers
The Seq2Seq model consists of two parts: an encoder and a decoder The encoder processes the input sequence and generates a series of context vectors, which represent the encoded information at each time step of the input In a vanilla Seq2Seq model without attention, only the final context vector is used to initialize the decoder, and at each step, the decoder generates a token and updates its hidden state This approach has limitations because the final context vector may not contain all the necessary information from the input sequence, especially if the sequence is long, leading to information loss
To address this issue, the attention mechanism introduced by Bahdanau et al allows the decoder to generate each token by selectively focusing on different parts of the input sequence The decoder does this by computing a set of attention weights that reflect the importance of each input token for the current decoder output For each token that the decoder generates, it calculates attention weights by comparing the decoder's current hidden state with each of the encoder's context vectors These attention weights are then applied to the encoder's context vectors to create a weighted combination, which results in an "attended" context vector that is specific to the current token being generated by the decoder
During sequence-to-sequence learning, the attended context vector and the decoder's hidden state collaborate to generate the next token in the sequence This iterative process continues until the decoder emits an end-of-sentence token, indicating the completion of the output sequence.
Figure 2.23 Attention Mechanism in Seq2Seq Model [20]
The Sean Robertsib’s figure likely demonstrates these steps visually, showing how attention weights are applied to the encoder's outputs to focus on relevant parts of the input sequence, thereby producing a more informed and accurate output sequence This mechanism is particularly effective for handling long input sequences where the relevance of each part of the sequence can vary significantly for different parts of the output
Automatic Speech Recognition
ASR has been undergoing active research for over fifty years It is an essential milestone in human-to-human and human-machine communication Due to the insufficient performance of past technologies, ASR has not become a desirable part of human-machine communication That was because the lack of computing power did not allow to cross the usability bar for real users, and other means of communication, such as keyboards and mice, are significantly superior to speech in most aspects of computer communication efficiency [46]
All this has changed in recent years Speech technology began to change the way we live and work, and some devices became the primary means of interacting with them By [46], there are several key areas in which progress allows for this trend
The first area is Moore's law The law states that approximately every two years, the number of transistors in a dense integrated circuit doubles approximately every two years [47] This results in the computing power of CPU/GPU clusters being doubled every two years That makes training more complex and powerful models possible, and therefore the error rate of ASR systems is lower
The advent of the Internet and cloud computing has led to an exponential increase in data availability This vast data enables the training of models on larger datasets, resulting in more robust and reliable models that are not merely hypothetical but grounded in real-world data and insights This data-driven approach allows models to capture complex relationships and patterns, improving their accuracy and overall effectiveness.
The third is that portable, wearable and smart living room devices and in-car infotainment [48] have become quite popular Since the use of alternative means of interaction, such as a keyboard and mouse, is almost impossible to do in these cases, verbal communication, which is natural for humans, becomes more convenient
There are several ASR approaches to different models, such as the Gaussian mixed model or the HMM Since this thesis concerns Neural Network (NN), so the next section will go through RNN models in ASR
2.13.1 Speech Recognition Using Recurrent Neural Networks
In this section, author discusses the research by Aditya Amerkar, Gaurav Deshmukh, Parikshit Awasarmol and Piyush Dave on "Speech Recognition using RNNs" [49] Thanks to this work, the differences of the two types of networks and their performance become more clearly
RNNs possess the ability to process information sequentially, unlike CNNs, which only consider the current state of a neuron This is due to the presence of memory in RNN neurons, enabling them to analyze individual input while retaining information from previous inputs Consequently, while CNNs excel in tasks such as speech recognition of single or few words, RNNs become essential for more complex projects involving longer speech sequences or sentences.
2.13.2 Speech-to-Text Using Deep Learning
While the primary focus of this thesis revolves around the examination of speech-to-text methodologies, Bredmar's work is truly remarkable It transcends the mere conversion from "voice" to "text" and delves into the realm of transforming voice into a comprehensive entity This entity possesses the capability to translate input voice directly into another language, bypassing the traditional intermediary step of textual representation [50] In contrast, many comparable projects, including the widely used Google Translate, execute language translation through a three-step process for speech-to-speech conversion
Initially, the speech-to-text model translates voice into written text
Subsequently, the written text undergoes translation into the desired language
Finally, the newly generated text is reconverted into audio format using a speech synthesis model, speech synthesizer, or text-to-speech model
The inherent flaw in this approach lies in the fact that the "language translation" phase occurs within the text domain Advances in NN and deep learning have ushered in a new era, enabling the complete eradication of the speaker's voice characteristics—such as emotion or accent—during the speech-to-text translation This loss is often unaccounted for in subsequent text-to-speech translations
Bredmar's groundbreaking work seeks to rectify this issue by eliminating the intermediate text-to-text translation step Instead, it constructs an LSTM neural network capable of directly processing audio files containing voices in specific languages as input The output is another audio file, preserving all the unique vocal characteristics of the original input, thereby achieving authentic speech-to-speech translation
A dedicated dataset was curated for the training of the model Bredmar advocates for the construction of a dataset enriched with diverse sentences in terms of lengths and tones, simulating common conversations These sentences were sourced from dubbed films, chosen for their abundance in French dubbing The dataset aimed to encapsulate the nuances of different voices while maintaining voice characteristics The final dataset comprised 16 movies, generating around 700 sentences per movie, ranging from 3 to 50 seconds, totaling 11,000 sentences The Long Short-Term Memory (LSTM) network, with two layers of 800 neurons each, was employed for this task using the Google Tensorflow library
Adjusting network size by increasing neurons enhances network performance Figure 2.13.2.1 demonstrates improved efficiency with larger networks, indicated by reduced loss function However, the performance gain is modest compared to the increase in network size Notably, larger networks consistently yield lower errors during training and evaluation.
Figure 2.26 Graph showing how the loss function change depending on the size of the trained network [50]
The second experiment investigates the impact of training set size on the loss function Figure 2.13.2.2 reveals significant observations The loss function decreases as the training set increases, and the reduction is more pronounced during the initial increases This demonstrates that a larger training set is crucial for a more accurate model.
Figure 2.27 Graph showing how the loss function change depending on the size of the training set[50]
The outcomes of this endeavor are particularly noteworthy The experiment indicates that the LSTM network architecture exhibits ample complexity for handling such sequences, as evidenced by training and test errors The author acknowledges a limitation: the network's current size struggles to effectively manage more extensive or intricate datasets, exhibiting signs of underfitting Nevertheless, the positive outcome underscores the potential of speech-to-speech translation without relying on intermediary speech-to-text or text-to-speech models It also highlights the feasibility of preserving some speech characteristics from the input source, pending the enhancement of the network's architecture to accommodate greater complexity.
PROPOSED MODEL
Transformer Model
The appearance of the attention mechanism in seq2seq models has revolutionized the field of NLP Initially, seq2seq models relied on encoder-decoder frameworks, typically using RNN, to transform sequences from one domain to another, such as translating a sentence from one language to another However, RNN and their advanced variants, LSTM and GRU, had limitations due to difficulty in capturing long-range dependencies and computational inefficiencies, leading to the vanishing gradient problem
The attention mechanism addressed these limitations by allowing the decoder to focus on different parts of the input sequence at each step of the output generation This meant that instead of forcing the entire source sentence into a fixed-size vector, the model could learn to attend to the most relevant parts of the input sequence, thereby capturing the nuances of longer sequences more effectively
One of the notable types of attention mechanisms proposed is by Luong et al., often referred to as Luong's attention Luong's attention, in contrast to the earlier
"Bahdanau attention", proposed a more efficient approach by simplifying the alignment function, which is the component that scores how well inputs around position j match with the output at position i Luong's attention mechanism offers several alignment functions, which are dot, general, and concat, providing a more versatile and computationally efficient method to learn the dependencies between source and target sequences
The development of the Transformer model leveraged the concept of attention further by using self-attention, also known as intra-attention, which allows each position in a sequence to attend to all positions within the same sequence This mechanism enabled the model to capture contextual information from the entire sequence, making it more effective than RNN-based seq2seq models, particularly for tasks requiring an understanding of long-range dependencies
GPT models built upon the Transformer architecture, using self-attention to predict each subsequent token in a sequence The distinction here is that GPT, as an autoregressive language model, uses masked self-attention in its decoders so that each token can only attend to previous tokens, thus ensuring causality in text generation
Figure 3.2 The Transformer and GPT Architecture [22][23]
Luong's attention mechanism is specifically a form of global attention, which considers all the hidden states of the encoder when generating each token In contrast, self-attention in the Transformer model is more generalized; it's not just about aligning two sequences (as in translation) but about modeling all pairwise interactions within a single sequence for a variety of tasks This kind of self-attention, which has been integral to the success of Transformer models, does not directly derive from Luong's attention but rather expands on the concept of attention as a whole, providing a more flexible and powerful way to model relationships in data
The potency of the attention mechanism was notably showcased in the paper
The Transformer neural network architecture, introduced in "Attention is All You Need," revolutionized the field of natural language processing Departing from traditional encoder-decoder models, Transformer relies exclusively on attention mechanisms, omitting recurrent or convolutional layers This architecture consists of two primary blocks: an Encoder and a Decoder block The Encoder block captures contextual representations of input sequences, while the Decoder block uses these representations to generate output sequences.
The encoder block of the Transformer architecture, as delineated in the accompanying figure, is a testament to the model's innovative approach to processing sequential data This block is the left block of the figure 3.1.1.1 and has the following parts
Firstly, input embeddings The operation commences with input embedding, where each token of the input sequence is translated into a high-dimensional vector, laying the groundwork for subsequent interpretative processing
Secondly, Positional encoding This step is then adjoined to these input embeddings to infuse the sequence with temporal structure, compensating for the Transformer's lack of inherent sequential processing capability This encoding ensures that the model maintains awareness of the order of tokens within the sequence Ashish Vaswani and colleagues ingeniously utilized sine and cosine functions to impart positional information within the Transformer model For each odd-indexed element in the input vector, a corresponding vector is generated through the cosine function Conversely, for each even-indexed element, the sine function is employed These vectors are then added to their respective input embeddings, effectively encoding the position of each vector within the sequence The selection of sine and cosine functions is strategic, due to their linear characteristics that are amenable to the model's learning process, allowing it to readily learn and attend to positional dependencies
Thirdly, the Encoder layer's role within the Transformer framework is to transform input sequences into a sophisticated continuous representation that encapsulates the information learned from the entire sequence This is accomplished through two core sub-modules: the multi-headed attention mechanism and a subsequent fully connected feed-forward network In addition to these, residual connections are strategically placed around each sub-module to preserve the flow of information, and these are immediately followed by layer normalization to ensure consistent learning dynamics across the network's depth
Within the encoder, the multi-headed attention mechanism employs a variant of attention known as self-attention This process enables the model to correlate each word in the input with other words For instance, it might discern a contextual relationship between the words "you," "how," and "are." The model could also recognize that a sequence patterned in such a manner generally constitutes a question, prompting a suitable response To realize self- attention, the input is passed through three separate fully connected layers to generate the corresponding query, key, and value vectors, which are fundamental to this attention process
The linear layer processes the query, key, and value vectors, and a dot product matrix multiplication is performed between queries and keys to calculate a score matrix This matrix signifies the attention or focus each sequence word receives from other words Each word obtains a score for every other word at a specific time-step, with higher scores indicating greater attention or relevance This scoring mechanism maps queries to corresponding keys, allowing the self-attention process within the network.
Figure 3.5 Dot Product of Query and Key
The attention scores are scaled down by dividing them by the square root of the dimensionality of the query and key vectors This scaling process is crucial to prevent the gradients in the network from becoming too large, a phenomenon known as the exploding gradient problem Multiplying large values in the attention mechanism could potentially lead to this issue, which would severely impede the model's ability to learn By scaling down, the model maintains more stable gradients, which allows for smoother optimization and a more reliable learning process
Figure 3.6 Scaling Down the Attention Scores
Following the scaling of the attention scores, a SoftMax function is applied to these adjusted values to derive the attention weights The SoftMax operation effectively normalizes the scaled scores into a distribution of probabilities ranging from 0 to 1 This transformation accentuates higher scores while diminishing lower ones, creating a contrast that amplifies the significance of more relevant words As a result, the model assigns higher probabilities to words that are deemed more important within the context of each query, thereby refining its focus The SoftMax function, in essence, enables the model to make probabilistic decisions about which words to attend to with greater confidence, enhancing the specificity and clarity of the attention mechanism
Figure 3.7 SoftMax of the Scaled Scores
The obtained attention weights are then multiplied by the value vectors, resulting in an output vector This operation effectively scales the value vectors by the attention probabilities, allowing the model to emphasize the representation of words it has deemed more significant through higher SoftMax scores Conversely, the representations of less relevant words are attenuated, as their corresponding lower attention weights lead to a diminishment in their influence on the output vector This filtered output, which is a weighted sum of the value vectors, is then passed through a linear layer for further processing The linear layer functions as a transformation step, integrating and refining the weighted information in preparation for subsequent stages in the model
Figure 3.8 Multiply SoftMax Output with Value Vector
Evaluation
models.
F1-Score, Precision, Recall: These metrics are particularly relevant for evaluating the accuracy and relevance of chatbot responses They are calculated as follows:
Precision for Chatbot Responses: Precision assesses the ratio of relevant and accurate responses generated by the chatbot to the total number of responses it produces It is calculated using the formula:
A high precision score implies that the chatbot predominantly provides correct and relevant responses
Recall for Chatbot Responses: Recall quantifies the ratio of relevant and accurate responses generated by the chatbot to the total number of correct and relevant responses present in the reference set It is calculated using the formula:
A high recall score indicates that the chatbot captures a significant portion of the relevant responses
F1-Score for Chatbot Responses: The F1-score is a harmonized measure that combines precision and recall providing a holistic evaluation of response quality It is calculated using the formula:
The F1-score offers insights into the relevance and comprehensiveness of the chatbot's responses
BLEU: Originally designed for machine translation, BLEU can be adapted to assess chatbot responses It evaluates the similarity between the chatbot's responses and reference responses provided by human evaluators BLEU is calculated using a formula that considers the precision of n-grams in the generated response compared to reference responses
BP stands for Brevity Penalty, which penalizes the score when the context of prompt is too short compared with the true answer was trained The mathematical expression for BP is given as follows:
Pn is the n-gram modified precision score It can be defined as follows:
Fallback Rate: This refers to the frequency at which a chatbot fails to understand or process a user's request and "falls back" on a default response This could be a message like "I'm sorry, I don't understand" or "Can you rephrase that?"
A lower fallback rate indicates that the chatbot is more capable of understanding and responding to a variety of inputs without needing to revert to a generic response
Containment Rate: This is the percentage of interactions that are successfully resolved within the chatbot without needing to escalate to a human agent A high containment rate is usually desirable, as it indicates the chatbot can handle a large volume of queries, which reduces the workload on human agents and can lead to cost savings
Handoff Rate: This refers to the frequency with which a chatbot must transfer a user to a human agent The handoff can occur because the chatbot has reached the limits of its capabilities, the user has requested to speak to a human, or the query is too complex or sensitive for the bot to handle A lower handoff rate can be a sign of a more effective chatbot, assuming user issues are being resolved satisfactorily
WER achieves this by synthesizing words that the system has missed or messed up through a following equation
S is the number of substitutions,
D is the number of deletions,
I is the number of insertions,
C is the number of correct words,
N is the number of words in the reference (N = S + D + C)
PER is a measure of how well a probability distribution or probability model predicts a sample In the context of language models, perplexity is often used to evaluate how well a model predicts a sequence of words
Perplexity is calculated using the probability assigned by the model to the actual data For a given sequence of words W = w1, w2, …, wn, the perplexity PP(W) is calculated as:
P(𝑤 𝑖 ) is the probability assigned by the model to the i-th word in the sequence,
n is the total number of words in the sequence
IMPLEMENTATION
Dataset
In the context of creating a Seq2seq system for a chatbot, numerous datasets have been made available for research and practical model development In this master's thesis, the student conducts experiments using seven datasets, which consist of approximately 20,000 question-answer pairs collected from advisory websites associated with different hospitals The following are the specifics
The general hospital in the Thu Duc region can be found at the following URL: http://www.benhvienkhuvucthuduc.vn/hoi-dap This website comprises approximately 250 pages, with each page containing 10 question-answer pairs
The Viet Duc Friendship Hospital can be found at the following URL: https://benhvienvietduc.org/hoi-dap It contains approximately 450 question- answer pairs
The MEDLATEC General Hospital can be found at the following URL: https://medlatec.vn/hoi-dap It contains approximately 590 pages, with 10 question-answer pairs on each page
The Thu Cuc International General Hospital is located at the following URL: https://benhvienthucuc.vn/hoi-dap-chuyen-gia It consists of approximately
200 pages, with 10 question-answer pairs on each page
The Vinmec International General Hospital Joint Stock Company is located at the following URL: https://www.vinmec.com/vi/tin-tuc/hoi-dap-bac-si/ It consists of approximately 560 pages, with 10 question-answer pairs on each page
The Binh Duong Obstetrics and Pediatrics Hospital is located at the following URL: https://benhvienpsnbd.com.vn/hoi-dap/trang/1.htm It consists of approximately 390 pages, with 5 question-answer pairs on each page
The Hanoi Traditional Medicine Multidisciplinary Hospital is located at the following URL: http://bvdkyhoccotruyenhanoi.vn/bacsituvan/index?page=1
It consists of approximately 95 pages, with 8 question-answer pairs on each page
Figure 4.1 Example of datasets for Seq2seq model
In the development of an Automatic Speech Recognition (ASR) system for a chatbot, a dataset exceeding 20 hours of audio recordings was compiled These recordings, initially sampled at a rate of 44.1 Hz, were sourced from news programs across diverse television channels in Vietnam Concurrently, subtitles were extracted from corresponding YouTube videos Through meticulous data preprocessing, the audio files underwent a conversion to a 16 Hz format and were segmented into shorter durations ranging from 3 to 4 seconds, aligning seamlessly with the extracted subtitles
The primary objective is to leverage this curated dataset for the comprehensive training of the ASR model It's noteworthy that the audio files, stemming from various news programs, do not necessitate a shared thematic focus, such as healthcare, to be instrumental in the ASR model training process.
Experiment
Experiment 1 for Seq2seq: Comparing the models used in the task of intent and entity recognition
In the experiment of the thesis, for each question’s intent recognition task, the system employs the two following main models
GPT-3 based on Transformer: utilize an encoder-decoder architecture and attention mechanism to understand and generate input and output sequences This model excels in overall performance but may be computationally expensive
Bidirectional GRU and Bidirectional LSTM with Luong’s Attention: understand the question from both directions and employs an attention mechanism to focus on important information during the decoding process, improving the model's ability to predict and comprehend the intent of the question
AVG Execution Time (seconds/pair)
AVG Execution Time (seconds/pair)
In terms of accuracy metrics (precision, F1-Score, BLEU-Score), GPT-3 stands out as the superior model, demonstrating a higher likelihood of providing correct answers Meanwhile, Bi-GRU, although not matching GPT-3 in accuracy, offers a more efficient processing speed, outperforming GPT-3 in execution time Bi- LSTM, with the lowest scores among the models, presents a trade-off between accuracy and efficiency, positioning itself as an intermediate option
Experiment 2 for Seq2seq: Chat handoff and fallback
Table 4.3 Chat Handoff and Fallback
Model Total number of interactions
GPT-3 outperforms other models in accuracy and reliability, demonstrating its capability for standalone interactions This is evident in its 0% fallback rate and 100% containment rate, significantly higher than Bi-GRU and Bi-LSTM, which have containment rates of 40% and 28%, respectively.
While GPT-3 does exhibit a relatively high handoff rate, it does not adversely affect overall performance, as all interactions are successfully contained On the contrary, the higher fallback and handoff rates observed in Bi-GRU and Bi-LSTM models indicate their comparative limitations as chatbot models within this experiment GPT-3's ability to provide accurate responses and handle interactions independently positions it as the more effective choice for the specified chatbot application
The following is the conversation with the chatbot after the final test
Table 4.4 Conversation Log with Chatbot
1 Cho em hỏi bệnh viện có khám nam khoa không ạ
Bệnh viện có khám nam khoa bạn đăng ký tại phòng khám nam khoa (số 2 – khoa khám bệnh – phòng số 2)
2 Bé nhà em nay được 14 tháng 22 ngày rồi, bé cao 80cm và nặng
13,4kg, bé đã biết đi vững rồi
Nhưng tới bây giờ bé chỉ mọc được 4 cái răng cửa trên và dưới từ lúc 6 tháng tới giờ Em không biết bé có bị thiếu chất gì không
Làm cách nào để kiểm tra xem bé bị gì không ạ bạn thân mến!cảm ơn bạn đã gửi câu hỏi tới icnm,trường hợp của bé có thể bị các bệnh lý về răng, nhưng không biết được bệnh lý gì bạn nên đưa bé đến bệnh viện để được thăm khám và tư vấn cụ thể.mọi chi tiết về dịch vụ cần được hỗ trợ thêm
3 Cho em hỏi Sau khi sanh mổ yêu cầu tối đa thì khoảng thời gian bao lâu mẹ được gặp bé của mình ạ chào bạn, sau khi sanh mổ yêu cầu, một số bệnh nhân không có thể điều trị được bạn nên đến khám trực tiếp để được bác sĩ tư vấn và điều trị cho mình bạn có thể đến tại các bệnh viện thuộc hệ thống y tế vinmec để được bác sĩ tư vấn thêm thân mến chào bạn
4 thai 4 tuần chi phí phá thai như thế nào ạ chào bạn,cảm ơn bạn đã gửi câu hỏi tới icnm,thai 4 tuần chi phí phá thai là 1,5 triệu đồng
5 Chi phí khám phụ khoa nữ bao nhiêu chào bạn,về chi phí khám phụ khoa nữ bạn có thể tham khảo chi phí tại đây https//www.facebook.com/bophetm.vn/ thông tin đến bạn!
6 em bị sâu răng số 6-7 chỉ còn chân răng ko bị nhức em có nên nhổ ko ạ tại em sợ nhổ sẽ bị tiêu xương hàm chào bạn!cảm ơn bạn đã gửi câu hỏi tới icnm,trường hợp của bạn có nhổ răng số 6-
Nếu bạn đang gặp vấn đề về răng miệng như đau chân răng hay nhức răng, việc đến bệnh viện để được thăm khám và điều trị là điều cần thiết Tại bệnh viện, bạn sẽ được các bác sĩ chuyên khoa kiểm tra tình trạng răng miệng, xác định nguyên nhân gây đau và đưa ra phương pháp điều trị phù hợp Đừng ngần ngại đến bệnh viện để được tư vấn và chăm sóc răng miệng tốt nhất, tránh để tình trạng trở nên nghiêm trọng hơn.
7 e có thai lần 3, 2 lần trước e sinh e bé xg nhưng có sữa mặc dù đã cố gắng đủ mọi cách, sau đó phải nuôi con bằng sữa công thức lần này e đc 22 tuần nhưng e thấy ngực không phát triển mấy e muốn hỏi bác sĩ biết có cách nào để e có thể nuôi con bằng sữa mẹ sau khi sinh con không? chào bạn!cảm ơn bạn đã gửi câu hỏi tới icnm,trường hợp của bạn có thai lần 3, có thể bạn cần đến bệnh viện để được thăm khám và điều trị bạn có thể đến tại các bệnh viện thuộc hệ thống y tế vinmec để được bác sĩ tư vấn thêm.mọi chi tiết về dịch vụ
Bạn đến viện thăm khám, bác sĩ chỉ định làm xét nghiệm máu, cho thấy chỉ số % mono là 9,5/8, chỉ số đông máu bình thường Sau khi tụ máu ở khuỷu tay 1 tuần và hơi nhức, bạn không can thiệp nhưng vết tụ máu nhạt dần thì nhấn vào thấy cục u tròn, mềm, gây đau Cánh tay bạn vẫn đau nhức Cần xác định rõ thời gian xuất hiện cục u: trước hay sau khi tụ máu.
9 chào bác sĩ bé nhà em từ lúc sinh ra bé bị mot vết bớt ở mặt, nó có màu nâu đen và nhiều lông nhờ bác sĩ tư vấn giúp em đó là bệnh gì và có nguy hiểm vói bé không? chào bạn,cảm ơn bạn đã gửi câu hỏi tới icnm,với câu hỏi của bạn, tôi xin được trả lời như sauvết bớt ở mặt là bệnh lý nhưng không phải là bệnh lý nào vết bớt ở mặt có thể do nhiều nguyên nhân như viêm màu, viêm màu, viêm màu nhỏ, viêm màu nhỏ,
10 xin chào bác sĩ! vợ chồng con cưới nhau 3 năm mà vẫn chưa có em bé, con thì bình thường nhưng chồng con bị tinh trùng yếu di động pr tiến tới 6% xin bác sĩ cho con lời khuyên với kết quả này con muốn bơm tinh trùng vào tử cung thì có khả năng ko hay phải đi thụ tinh ạ? chào chị!cảm ơn chị đã gửi câu hỏi đến icnm,vợ chồng chị đã 3 năm mà vẫn chưa có em bé, vẫn có tinh trùng yếu tinh trùng yếu có thể do nhiều nguyên nhân, nhưng không phải là do tinh trùng yếu với những nguyên nhân này, chị cần đến viện để được khám và làm các xét nghiệm đán
11 bác sĩ ơi cho e hỏi, khi có bầu mình có thể để tay lên bụng nhẹ nhàng được không ạ? mỗi khi con đạp e hay đặt nhẹ tay lên bụng , lấy ngón tay dí nhẹ vào chỗ con đạp thôi chứ không xoa bụng , vậy có ảnh hưởng gì không ạ em cảm ơn ạ bạn thân mến!cảm ơn bạn đã gửi câu hỏi tới icnm,bạn có thể để tay lên bụng, nhẹ nhàng, đặt nhẹ tay lên bụng, lấy ngón tay đạp thôi chứ không xoa bụng, có thể là do bất thường như bạn đã đi khám và điều trị bằng thuốc tuy nhiên, bạn cần đến khám trực tiếp
Discussion
From the experimental results, the following assessments can be made:
While the chatbot was successful in identifying the most significant intent, it became evident that the dataset's size and diversity are limiting factors Expanding the dataset with more varied content is crucial to enhance the chatbot's ability to engage in meaningful conversations and provide relevant responses to a broader range of user inquiries
The importance of topic-specific datasets and tailored training was underscored during the evaluation It became evident that training the chatbot on specific domains or industries is essential for achieving high accuracy in its responses A lack of domain-specific data may lead to incorrect or irrelevant answers
The experimental task shed light on the limitations of applying the chatbot in the healthcare sector The chatbot's performance was less than satisfactory for issues requiring a high degree of precision, especially those related to human health, medical advice, data security, and academic expertise Further development and fine-tuning are necessary to make the chatbot a reliable resource in such critical areas
The accuracy of Way2vec 2.0 is remarkably high when applied to Vietnamese datasets Training with less than 10 epochs produced suboptimal results for English-transcribed names and numbers, but scenario three led to significantly improved accuracy The experimental dataset used for evaluation is entirely distinct from the training dataset, consisting of 60 files with durations ranging from 2 to 3 seconds per file The model's response time is 10.4 seconds, showcasing its potential for integration with a seq2seq model to develop a voice-input-supported chatbot
Building a voice chatbot comes with several challenges that need to be addressed for successful implementation as below:
Effective chatbot communication hinges on coreference resolution, the ability to understand references to previously mentioned entities In Vietnamese particularly, speakers employ a variety of pronouns and colloquialisms that require this capability Without contextual information and a mechanism for resolving coreferences, chatbots struggle to identify the intended referents and can produce responses that lack precision, alignment with domain expertise, and relevance to the user's question This challenge is especially pronounced in extended conversations.
Due to resource limitations, the experimentation process has only been conducted on a dataset consisting of 20,000 question-answer pairs for Seq2seq model and 20 hours of audio files The experimentation process has not yet been able to collect large-scale datasets from various topics in real life for model training
As a result of time constraints during the thesis project and limited available efforts, the observed outcomes have thus far been limited to relatively simple and academic experiments They have not yet been extended to practical, state- of-the-art models to demonstrate the model's real-world applicability, stability, and effectiveness.
CONCLUSION
In conclusion, this master's thesis has addressed several key aspects in the development of voice-to-text chatbots with enhanced semantic understanding Through the comparative analysis of traditional Seq2Seq models and GPT-3 encoder- only approaches customized for the Vietnamese language, insights have been gained into the strengths and weaknesses of different model architectures, particularly in the context of structured dialogue processing within the healthcare domain
ASR model enhancement focuses on improving accuracy through careful experiment selection and evaluation This approach addresses the challenges of noisy environments and acoustic variability, ultimately boosting ASR performance By optimizing model accuracy, it enhances the usability of ASR technology, particularly in applications demanding accurate transcription of spoken input.
Furthermore, the integration of text-to-text and speech-to-text models holds promise in developing chatbots capable of supporting voice interactions in Vietnamese across healthcare domains By leveraging advanced natural language processing techniques, dialogue management systems, and machine learning algorithms, these chatbots aim to engage in meaningful real-time conversations, providing enhanced user experiences and utility
Future research in the realm of chatbots employing the GPT-3 model holds promising avenues for advancement While current iterations of GPT-3 have demonstrated impressive natural language understanding and generation capabilities, there are several intriguing directions for further exploration
Enhanced Domain Specificity: One key area of future research involves refining GPT-3 for domain-specific applications This entails fine-tuning the model on specific domains such as healthcare, finance, or legal, allowing chatbots to provide more accurate and specialized responses within these fields
Depth Semantic: the voice-to-text chatbot's capabilities will be enhanced to include deeper semantic understanding, enabling it to comprehend and interact with previous questions and answers in real-time conversations This enhancement involves integrating advanced natural language processing techniques, such as coreference resolution and sentiment analysis, to grasp context and extract relevant information Dialogue management systems will help maintain coherence and flow in conversations by tracking dialogue history and providing contextually relevant responses
Conversational Depth: Increasing the depth and coherence of chatbot conversations is an ongoing challenge Future research should aim to develop models that can engage in more extended and context-aware dialogues
Scalability: As GPT-3 models are resource-intensive, optimizing them for scalability is crucial Future research should focus on efficient model architectures and training methods
Overall, the contributions presented in this thesis pave the way for future research and advancements in voice-to-text chatbots, with a focus on deep semantic understanding and real-time interaction capabilities As technology continues to evolve, these developments will play a vital role in revolutionizing communication and accessibility in various domains, ultimately benefiting users and stakeholders alike
[1] Yacine Ikkache et al “Towards Quranic Reader Controlled by Speech” In:
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol 2, No 11, 2011
[2] Geoffrey Hinton et al “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups” In: IEEE Signal Processing Magazine, pp.82-97, 2012
[3] Karel Veselỳ et al “Sequence-discriminative training of deep neural networks” In: Interspeech.2013-548, 2013
[4] Warren S McCulloch and Walter Pitts “A logical calculus of the ideas immanent in nervous activity” In: Bulletin of Mathematical Biology, Vol 52,
[5] Fan Mei, Hong Chen and Yingke Lei “Blind Recognition of Forward Error Correction Codes Based on Recurrent Neural Network” In: Sensors 2021,
Vol 21, No 3884, 2021 https://doi.org/10.3390/s21113884, 2021
[6] Pavan Kumar T M et al “Predicting Volatility of Stocks for Time Series Data Using Statistical Forecasting Models” In: International Journal of Engineering Applied Sciences and Technology, Vol 6, Issue 8, ISSN No 2455-
[7] Colah’s blog “Understanding LSTM Networks” URL: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Accessed: Dec
[8] Kyunghyun Cho et al "Learning Phrase Representations using RNN Encoder- Decoder for Statistical Machine Translation" In: arXiv:1406.1078v1, 2014 [9] Michael Phi “Illustrated guide to LSTM’s and GRU’s: A step by step explanation” URL: https://towardsdatascience.com/illustrated-guide-to- lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21 Accessed: Dec 15
[10] Ho Chung Wu and Robert Wing Pong Luk "Interpreting TF-IDF term weights as making relevance decisions" In: ACM Transactions on Information System,
[11] Michael Cochez et al “A first experiment on including text literals in KgloVe” In: arXiv:1807.1176v1, 2018
[12] Seopp Hochreiter et al “Long short-term memory”.In: Neural Computation,
[13] Ronald J Williams “Simple statistical gradient-following algorithms for connectionist reinforcement learning” In: Machine learning, pp 229-256,
[14] Lalit R Bahl et al “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition” In: Proceedings – ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 1986
[15] Ilya Sutskever et al “Sequence to sequence learning with neural networks” In: arXiv:1409.3215v3, 2014
[16] Ping Jiang “Nowcasting Influenza Using Google Flu Trend and Deep Learning Model” In: Advances in Economics, Business and Management Research, Vol 155, 2020
[17] David Barry, “What’s New in Answering Services for Small Businesses?” URL: https://www.soundhound.com/voice-ai-blog/whats-new-in-answering- services-for-small-businesses/ Accessed: Dec 15 2023
[18] Colah’s blog “Neural Networks, Types, and Functional Programming” URL: https://colah.github.io/posts/2015-09-NN-Types-
FP/?ref=blog.paperspace.com Accessed: Dec 15 2023
[19] NIC: 13771 “PARRY Encouters the DOCTOR” URL: https://www.rfc- editor.org/rfc/rfc439 Accessed: Dec 15 2023
[20] Sean Robertson “Practical Pytorch: Translation with a Sequence to Sequence Network and Attention” URL: https://github.com/spro/practical- pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb
[21] Minh-Thang Luong et al “Effective Approaches to Attention-based Neural Machine Translation” In: arXiv:1508.04025v5, 2015
[22] Ashish Vaswani et al “Attention Is All You Need” In: arXiv: 1706.03762v7,
[23] Alec Radford et al “Improving language understanding by generative pre- training” In: OpenAI, 2018
[24] Peter J Liu et al “Generating wikipedia by summarizing long sequences” In: arXiv:1801.10198v1, 2018
[25] Zichao Yang et al “Hierarchical Attention Networks for Document Classification” In: Proceedings of NAACL-HLT, pp 1480-1489, 2016
[26] Dzmitry Bahdanau et al “Neural Machine Translation by Jointly Learning to Align and Translate” In: arXiv:1409.0473v7, 2014
[27] Jason Weston et al “Memory Networks” In: arXiv:1410.3916v11, 2015 [28] Jiasen Lu et al “Hierarchical question-image co-attention for visual question answering” In: arXiv:1606.00061v5, 2017
[29] Baosaong Yang et al “Convolutional Self-Attention Networks” In: arXiv:1904.03107v1, 2019
[30] Zhilin Yang et al “XLNet: Generalized Autoregressive Pretraining for Language Understanding” In: arXiv:1906.08237v2, 2020
[31] Jacob Devlin et al “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In: arXiv:1810.04805v2, 2019
[32] Guillaume Lample and Alexis Conneau “Cross-lingual Language Model Pretraining” In: arXiv:1901.07291v1, 2019
[33] Iz Beltagy et al “Longformer: The Long-Document Transformer” In: arXiv:2004.05150v2, 2020
[34] Angela Fan et al “Beyond English-Centric Multilingual Machine Translation” In: arXiv:2010.11125v1, 2020
[35] Colin Raffel et al “Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer” In: arXiv:1910.10683v4, 2019
[36] Teven Le Scao et al “Bloom: A 176b-parameter open-access multilingual language model” In: arXiv:2211.05100v4, 2023
[37] Hugo Touvron et al “Llama: Open and efficient foundation language models” In: arXiv:2392.13971v1, 2023
[38] Aakanksha Chowdhery et al “Palm: Scaling language modeling with pathways” In: arXiv:2204.02311v5, 2022
[39] Federia Laricchia “Barriers to voice technology adoption worldwide 2020” URL: https://www.linkedin com/pulse/problem-current-speech-recognition- technology-dani-cherkassky/ Accessed: Dec 15 2023
[40] David E Rumelhart et al “Backpropagation: Theory, Architectures, and Application” In: Mathematical Perspectives on Neaural Networks, 1996 [41] Karel Veselỳ et al “Sequence-discriminative training of deep neural networks” Proc Interspeech, pp 2345-2349, 2013
[42] Frederik Bredmar “Speech-to-speech translation using deep learning” In:
Department of Computer Science and Engineering at University of Gothenburg, 2017
[43] Connor Shorten “Introduction to ResNets” URL: https://towardsdatascience.com/introduction-to-resnets-c0a830a288a4
[44] Colah’s blog “Understanding LSTM Networks” URL: https://colah.github.aio/posts/2015-08-Understanding-LSTMs/ Accessed: Dec 15 2023
[45] Nitish Srivastava and Geoffrey et al Hinton “Dropout: A simple way to prevent neural networks from overfitting” Journal of Machine Learning Research, Vol 15, pp 1929-1958, 2014
[46] Dong Yu and Li Deng “Automatic Speech Recognition” Signals and Communication Technology, 2016
[47] Schaller, R R “Moore’s law: past, present and future” In: IEEE Spectrum,
[48] Richard Werner “Developing an automatic speech recognition system based on Czech spoken language” Master’s thesis Faculty of Information
[49] Aditya Amerkar et al "Speech Recognition using Recurrent Neural
Networks" In: IEEE International Conference on Current Trends toward
[50] Geoffrey Hinton et al “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups” In: IEEE Signal
Processing Magazine, Vol.29, No.6, pp.82-97, 2012
[51] Nataliya Shakhovska et al “Development of the Speech-to-Text Chatbot
Interface Based on Google API” Google, 2019
[52] Nitish Srivastava et al “Dropout: A simple way to prevent neural networks from overfitting” Jounal of Machine Learning Research, Vol 15, pp 1929-
[53] CollegeSidekick “Boundless Psychology” URL: https://www.collegesidekick.com/study-guides/boundless- psychology/neurons Accessed: Dec 15 2023
[54] Eric Jang et al “Categorical Reparameterization with Gumbel-Softmax” In: arXiv:1611.01144v5, 2017
[55] Alex Jruzgevsky et al “ImageNet Classification with Deep Convolutional
Neural Networks” University of Toronto, 2012
[56] Cobus Greyling, “General Chatbot Architecture, Design & Development
Overview” URL: https://cobusgreyling.medium.com/general-chatbot- architecture-design-development-overview-58e145398608 Accessed: Dec 15
[57] Lara Lloret Iglesias et al “A primer on deep learning and convolutional neural networks for clinicians” In: Insight Imaging, 2021.