This is a neural network architecture known for its success in natural language processing tasks, and this task focuses on exploring, experimenting, and optimizing the Sequence-to-Sequen
THESIS INTRODUCTION
Overview
The idea of human-computer interaction through natural language has been created in Hollywood movies 3-CPO is one of the legends of the Revolutionary Army in the world of Star Wars movie This robot guy has served through many generations of Skywalkers and is one of the top personality robots in the universe In the series of this movie, we can see that 3-CPO not only has very similar gestures and communication to humans, but sometimes has great instructions for its owner This is a cinematic product that is ahead of another era when it comes to predicting the future of Artificial intelligence (AI) The Star Wars fictional movie universe is set in a galaxy where humans and alien creatures live in harmony with droids These robots are capable of assisting people in daily life or traveling across other planets In the movie Iron Man (2010), Tony Stark used his supercomputer assistant, JARVIS, to get support and help in everyday life and on trips to save the world with superheroes
In particular, it can be mentioned the film AI (2001), adapted from the short story series Supertoys Last All Summer Long, which tells the perspective of the 22nd century, when sea level rise washed away coastal cities, causing a serious decline in population density The Mecha robot line simulates real people designed to integrate with humans They possess the ability to think well but do not know how to express emotions
The history of Interactive Voice Response (IVR) systems began in the 1930s when Voder machines were created The technology was the first to analyze the English language and produce human-like sounds The original speech recognition system was rudimentary, only understanding numerals because engineers thought human language was too complex In 1952, Alexander Graham Bell's Bell Laboratories designed "Audrey," a system for recognizing digits from a given voice Ten years later at the World's Fair, IBM demonstrated a "Shoebox" system that could recognize 16 different English words The vision from these projects is that users can communicate with computers through natural language and therefore not have to learn any specific language or prompts However, it turns out to be quite complicated to understand the spoken language It can be said that only entities (humans) living in the real world can effectively understand language, arguing that without context, the meaning of a word is incomprehensible
IVRs were widely used for businesses in the 1990s Meanwhile, the use of call queuing and automated call routing became popular in the mid-1990s In the late 1990s, the move of multimedia to call centers led to companies investing in IVR systems with CTI This integration allows businesses to integrate their call centers into their marketing campaigns Moreover, continuous improvements in IVR make them cheaper to deploy the company Contemporary platforms emerged in the 2010s The emphasis on integrating IVR with comprehensive analytics, automated SMS messaging, and advanced call monitoring features was evident during this period Modern IVR systems are now part of a larger solution and enable seamless integration of customer communications across channels Unlike bulky and expensive standalone systems, these advanced all-inclusive platforms now offer options, giving customers the opportunity to choose their preferred method Today, IVR has been integrated into the overall customer experience It now comes with a personalized brand voice, protects customer data, and detects fraud and spam In addition to routing to the best department to address customer needs, the tool is now integrated into marketing efforts The self-service model has evolved significantly with the arrival of conversational IVR These AI-enabled technologies replicate the experience of talking to a live agent Today's IVR systems provide solutions to customers faster, even without a direct operator connection It is useful for many industries and uses
It can help manage hotel reservations, pay bills, do market research, buy tickets, and present information about products and services These latest functions meet the needs of the market Zendesk's research shows that 69% of people try to solve their own problems before contacting customer service However, businesses must ensure that they are implementing IVR self-service best practices to increase customer satisfaction A poorly designed automated system can disrupt a business, especially if it wastes consumers' time without solving their problems
Figure 1.1 The History of IVR [17]
The basic and big problem with most IVR systems is that no matter how useful they may be in providing all possible options to fit a lot of customer queries and issues, most customers really just want to talk to one person in person Nothing can beat direct human-to-human communication This may seem counterintuitive if you are using an IVR system to reduce human resources and staffing costs, but adding the option to talk to an advisor on the main menu without forcing customers to search the verbal maze of menu options will make most customers more satisfied and less frustrated Automatic speech recognition (ASR) is a classic feature of many IVR systems and allows users to communicate with the IVR system by voice instead of pressing phone keys or on a laptop, which can be difficult for users to do But if ASR is incapable of recognizing what humans are saying, it will make the system frustrating and, worse, useless Therefore, ensuring accuracy as well as optimizing design interfaces in ASR is essential in meeting the high demands of users.
Research Problem
There are several challenges associated with the current implementation of the voice-to-text chatbot Firstly, quality and coherence of generative-based chatbots They generate responses based on statistical patterns learned from large datasets of text While they can produce more diverse and flexible responses compared to retrieval-based models, the quality and coherence of their output can vary widely They may generate nonsensical or contextually inappropriate responses, especially when faced with input they have not encountered during training
ASR models' accuracy is crucial for their effectiveness in IVR systems, allowing customers to interact via voice instead of keyboards High accuracy levels are challenging to achieve, as evidenced by the significant number of respondents (73%) in a survey citing it as the primary obstacle to adopting speech recognition technology WER (Word Error Rate) is a widely used metric for assessing the accuracy and performance of ASR systems.
Lastly, ASR models often exhibit domain-specific performance This implies that if a model is exclusively on a particular dataset or within a specific domain (such as healthcare, finance, or tourism), it might encounter challenges in recognizing and processing inputs beyond that training domain Consequently, this limitation can result in recognition errors and diminished accuracy when confronted with new or unfamiliar data.
Target of the Thesis
The objective of this master's thesis is to develop and train an intelligent chatbot using freely available data sources from online forums, FAQs, and videos on Youtube The specific goals include:
Data Collection and Processing: Research and develop effective methods for collecting dialogue data from free online sources This involves identifying appropriate data sources, filtering, and cleaning data to ensure quality and reliability
Analysis and Modeling: Analyze the characteristics of dialogue data, such as structure, context, and linguistic diversity Develop suitable machine learning or deep learning models for training the chatbot, focusing on researching and constructing algorithms for the chatbot using deep learning methods and large language models in a sequence-to-sequence format
Chatbot Training: Apply advanced techniques in artificial intelligence and machine learning to train the chatbot to understand and respond accurately and naturally
Evaluation and Improvement: Evaluate the performance of the chatbot through testing methods and user feedback Use the evaluation results for continuous improvement of the chatbot model
Practical Application: Explore the potential application of chatbots in the healthcare sector, emphasizing the integration of ASR Evaluate how this integration impacts user access and interaction to understand the changes in healthcare service delivery.
Scope of the Thesis
The scope of this master's thesis encompasses several key areas Firstly, the primary focus of this thesis is on the healthcare sector, utilizing datasets gathered from FAQs on hospital websites in Vietnam Despite the potential applicability of the methodologies and technologies in other fields, the primary emphasis remains on healthcare This approach ensures specialized attention to the unique requirements of the healthcare industry but limits the immediate applicability to other domains
Secondly, in terms of methodology, the thesis employs an advanced Sequence to Sequence (Seq2Seq) model, integrating deep learning techniques such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) This model is structured with an encoder Recurrent Neural Network (RNN), which processes the input sequence, and a decoder RNN, responsible for generating the output sequence Additionally, Luong's attention mechanism is employed to enhance the model's ability to focus on relevant parts of the input while generating each part of the output
To streamline the training process, a Greedy decoding strategy is utilized, where the most probable next output is selected at each step While this combination is effective for complex dialogue modeling, it poses challenges in terms of technical complexity and computational resources
GPT-3, the foundational model employed in this project, leverages the Transformer architecture, representing the forefront of AI technology However, this selection limits flexibility in accommodating novel approaches and technologies due to its fixed structure.
Lastly, the performance of the text-to-text model is evaluated using a combination of F1-score, precision, recall, and Bilingual Evaluation Understudy (BLEU) while speech-to-text model’s performance uses Word Error Rate (WER) and Perplexity (PER), providing a more comprehensive and widely accepted set of metrics However, relying solely on these may not fully capture all performance aspects in practical scenarios, as each metric emphasizes different aspects of model performance.
Contribution
This master's thesis contributes in three significant ways First, it compares the performance of traditional Seq2Seq models enhanced with Luong's Attention mechanism to the encoder-only approach of GPT-3, specifically tailored for Vietnamese This analysis offers valuable insights into the strengths and limitations of each model architecture for structured dialogue processing and language comprehension, particularly in the healthcare field.
Secondly, it enhances the accuracy of the ASR model by selecting and evaluating experiments with the best results for the dataset This contribution aims to address the challenges of speech recognition in noisy environments and varying aucoustic conditions, thereby improving the overall performance and usability of the ASR technology
Lastly, it integrates the two models of text-to-text and speech-to-text to develop a chatbot supporting voice interactions usingVietnamese language in the healthcare domain This integration expands the chatbot’s functionality to accommodate users who prefer or require voice-based interactions, thereby enhancing accessibility and user experience in healthcare services.
Thesis Structure
Thesis "Supporting Voice Communication in Chatbot" includes Five Chapters with the following main content:
Chapter 1, INTRODUCTION: presents an overview of the topic, reasons for carrying out the research, and the practical significance of the problem, as well as the scope and limitations of the topic Finally, the tasks and structure of the thesis are described
Chapter 2, BACKGROUND: synthesizes the most relevant academic issues to be applied to solve the problem, focusing mainly on the content of deep learning, the basic of HMM, from Artificial Neural Network to Recurrent Neural Network, LSTM, GRU in Seq2Seq model and ASR model This chapter also provides a general overview of related research that has been and is being conducted, as well as the current general trends in solving the problem (Luong’s attention mechanisms, the theory of encoders and decoders, and particularly the GPT-3’s architecture) This section also brings discussions and evaluations for the methods as they form an important basis for the student's research during the thesis process
Chapter 3, PROPOSED MODEL: introduces the proposed model for
Chatbot enhances customer engagement and streamlines business operations, showcasing significant improvements and motivations for its adoption The proposed enhancements further optimize chatbot performance, providing valuable insights The student outlines a comprehensive experimental methodology to evaluate the impact of these improvements against a chosen baseline model, enabling data-driven analysis and informed decision-making.
Chapter 4, IMPLEMENTATION: involves selection, training, evaluation, and integration of models to develop a robust and effective voice-to-text chatbot tailored for the Vietnamese language and healthcare domain
Chapter 5, "Conclusion," summarizes the research journey, presenting the accomplished results from the initial hypothesis to the experimental execution It acknowledges limitations and unresolved issues while recommending pathways for potential future advancements Additionally, the thesis includes a comprehensive Table of Contents, Lists of Figures and Tables, and Acronyms at the outset, with references meticulously listed at the conclusion.
BACKGROUND
Hidden Markov Model (HMM)
HMM is a statistical model In speech recognition, the HMM provides a statistical representation of the sound of words [1] The architecture of HMM in speech recognition is given in Figure 2.4 The HMM consists of a sequence of states
In HMM, the current state is hidden and only the output from each state can be observed Each state in the HMM corresponds to a frame in the audio input The model parameters estimated in sound training are θ = [{aij}, {bj()}], where {aij} corresponds to the transition probability and {bj()} to the output observation distributions The aij transition probability is the probability of changing from state i to state j
Figure 2.1 HMM-based phone model [1]
An important feature with HMM is that self loops aii makes it possible for HMM to model the changing length of the phone When performing a transition and entering a new stage in the HMM, a feature vector is created using the distribution associated with that particular state in the HMM The first and last states in the HMM are called non-emitting states For example, in Figure 2.1.1 s1 is the in state and s5 is the exit state They are used as entrances and exits of the model and simplify the pairing of HMMs, phone models, to form words.
Deep Neural Networks
DNNs (Deep Neural Networks) provide an alternative to Gaussian mixture models in speech recognition They comprise multiple hidden layers between the input and output layers, as shown in Figure 2.3.1 These networks employ nodes with weighted paths and activation functions to calculate outputs Each node's input is derived from the preceding layer, enabling complex and hierarchical learning for speech recognition tasks.
∑ 𝑒𝑥𝑝{𝑎 𝑠 ′ 𝑢𝑡 (𝑠 ′ )} where bj is the bias of units j, i is an index on units in the lower class, and wij is weighted on a connection to units j from units i in the lower class The output for the upper layer is then calculated as below
Deep Neural Networks (DNNs) utilize hidden layers to model intricate and non-linear data relationships In multiclass classification, output unit 'j' employs a SoftMax function to transform its input 'x_j' into a probability Specifically, in Kaldi, the SoftMax function is used to estimate posterior probabilities for the Hidden Markov Model (HMM).
∑ 𝑒𝑥𝑝{𝑎 𝑠 ′ 𝑢𝑡 (𝑠 ′ )} where a ut denotes the observation at time t in utterance u and a ut is the activation function at the output layer corresponding to state s [3]
The goal in training is to optimize an objective function and update the weights of internal nodes based on the information transmitted to the model In training, an important parameter is the learning rate The greater the learning rate, the faster but less accurate the training.
Artificial Neural Networks
Basically, ANN is a computational model, they are built on the structure and function of the neural network in Biology (although the structure of the ANN will be affected by a flow of information) Therefore, this neural network will change, they depend on inputs and outputs We can think of ANN as nonlinear statistical data This means a complex, defined relationship between input and output As a result, we will have many different templates
ANN takes ideas from how the human brain works – making the right connections Therefore, ANN used silicones and wires to make neurons and dendrites live for themselves In the human body, 1 part of the brain is already composed of 86 billion neurons and they are connected to thousands of other cells through Axons Because humans have so many different information inputs from the senses, the body also has many dendrites to help transmit this information They will generate electrical impulses to move, transmitting information in this neural network And the same goes for ANN when different problems need to be dealt with, the neuron sends a message to another neuron
Therefore, we can say that the ANN will consist of many internal nodes, they mimic the biological neurons inside the human brain ANN networks connect these neurons by means of links, and they interact with each other Nodes in ANN are used to retrieve input data Moreover, performing operations on data is also very simple After performing these operations with data, these activities are transferred to other neurons The output at each node is called its trigger value or node value Every link in the ANN network is associated with weight In addition, they have the ability to learn That will take place by changing the weight values The figure 2.3.2 is an illustration of a simple ANN
Figure 2.4 A Simple Example of the Structure of a Neural Network
The McCulloch-Pitts node, a fundamental Neural Network (NN) component, performs calculations similar to a sigmoid function It sums inputs and generates 1 if they exceed a threshold, otherwise producing 0 Despite more intricate node representations, the McCulloch-Pitts neuron remains a crucial starting point for comprehending NN basics.
Figure 2.5 The McCulloch-Pitts Neuron [4]
Along with the nodes, a neural network is also formed by the so-called "edges" in network weighting What they do is essentially multiply the output of a node by their weight before transferring it to the next neuron to which they are connected By updating these weights, depending on the generated output of an example, researchers can teach the network to distinguish which input data will produce which output The procedure for updating weights is called backpropagation and will be described in the next section.
Convolutional Neural Network
CNN is one of the extremely advanced Deep Learning models CNN will allow researchers to build intelligent systems with extremely high accuracy Figure 2.4.1 is a basic architectural form of the CNN network
Figure 2.6 The architecture of CNN [57]
A convolutional is a type of sliding window located on a matrix Convolutional layers will have parameters learned to adjust and retrieve the most accurate information without having to select features Convolutional is the multiplication of elements in a matrix Sliding Window is also known as kernel, filter or feature detect and is a small type of matrix
CNN will compare images based on each piece and these pieces are called Feature Instead of having to match the photos together, CNN will see the similarity when searching for raw features that match each other with 2 better images Each feature is considered a mini-image which means they are small 2-dimensional arrays These features all correspond to certain aspects of the image, and they can fit together The followings are basic layers of CNN
Convolutional layer This is the most important class of CNN, which is tasked with doing all the calculations The important elements of a convolutional layer are: stride, padding, filter map, feature map
Stride means when the filter map is moved in pixels based on the value from left to right
Padding is 0 values added with the input class
CNN uses filters to apply to the area of the image These filter maps are called 3-dimensional matrices, inside which are numbers, and they are parameters
The feature map represents the result of each filter map scan through the input After each scan will occur the calculation process
Rectified Linear Unit (ReLu) Layer ReLu layer is the activation function in
NN and this function is also called activation function The trigger function simulates neurons with pulse rate through axons In the activation function, it also means: ReLu, Leaky, Tanh, Sigmoid, Maxout, etc Currently, the ReLu function is commonly used and extremely common It is used a lot for neural network training needs, but ReLu brings a lot of outstanding advantages such as: the calculation will become faster, etc In the process of using ReLu, we need to pay attention to the problem of customizing learning rates and tracking dead units The ReLu layers were used after the filter map was calculated and applied the ReLu function to the values of the filter map
Pooling layers are intermediary layers in deep neural networks used to reduce the size of feature maps They are typically placed between convolutional layers and come in two main types: max pooling and average pooling Max pooling selects the maximum value from a defined area of the feature map, while average pooling calculates the average value within that area Both methods reduce the number of parameters in the network, making it more efficient and less prone to overfitting.
Fully Connected Layer This layer is responsible for producing results after the convolutional layer and pooling layer have received the transfer image At this point, the result is that the model has read the information of the image and to link them as well as produce more output, we use fully connected layers
In addition, if the fully connected layer keeps the image, it will turn it into an undivided section This is quite similar to the votes they will evaluate to vote for the highest quality image.
Multilayer Perceptron
The name backpropagation comes from the term used by Rosenblatt (1962) for attempts to generalize the perceptron learning algorithm into several layers, even if none of the many attempts to do so in the 1960s and 1970s were particularly successful [40] The backpropagation algorithm is one of the most important tools of the artificial neural network, it is especially the part that deals with the training of the network, i.e., where it learns During this process, the network updates the weighting of all edges to make it perform the correct output for a particular input
Stochastic Gradient Descent (SGD) is used to train NN in oreder to reduce repetitive cost functionality The principle of maximum likelihood is followed primarily by the loss function The goal is to reduce the cross entropy error between the output and the predicted output The gradient update equation is shown as below θ ← θ − η∇θLn where the learning rate is denoted by η, the gradient of the loss function is denoted by ∇θLn, for θ parameters calculated on the n-th batch of the dataset
There are many forms of SGD have been developed that allow for better and faster convergence in the recent times For MLP parameters, the gradient of the target functions is calculated using the backpropagation algorithm [41] At each layer, the algorithm uses chain-rule for partial derivatives of the parameter The backpropagation algorithm is explained in the following nonvectorized manner, i.e., the output guide is calculated for the scalar Initially, a transition is performed by calculating the activation of each node b and predicting the output y The next loss L is calculated with the target output y and the predicted output y Back pass is initiated by taking the derivatives of the loss function [42]
The above equation activation is represented by ứk How much l-th The j-th neuron of the output error control layer is denoted by the term δlj error
According to chain rule, δ j l can be written as
Hence, hidden layer’s neurons error can be obtained So the model weights are calculated as follow
The problem is in the many layers that use certain activation functions, such as the sigmoid function This function transforms a large input space into a small input number from 0 to 1 That means a significant change in input results in a small change in output; thus, its derivatives become small
Figure 2.7 Sigmoid Function and its Derivative by [43]
This is caused by string rules used in backpropagation It calculates gradients by moving each layer from the last layer to the original one According to the chain rule, each subsequent derivative is multiplied by the calculated value Therefore, when there are n hidden layers using sigmoid as activation functions, n small derivatives are multiplied together As a result, the gradient value decreases exponentially as the backpropagation algorithm progresses to the beginning of layers [43]
The following ways are the solutions to avoid this problem
The first solution is to use another activation function, such as ReLu, which does not cause a small derivative
The next solution is the residual network (ResNets) They provide the residual connections straight to the next layers, effectively bypassing the activation functions That results in higher than derivatives, and it leads to the ability to train networks much deeper [44]
The last one is mass normalization The batch normalization method normalizes the input on a predefined scale, where the sigmoid derivative is not small.
Regularization
It is important to consider the problem of overfitting when building NN or machine learning algorithms Overfitting is when the model begins to learn features that are too specific to the training set Basically, the model not only learns the general rules that lead from input to output, but also more rules, which perhaps describe the training set, but are not necessarily valid at the general level This process leads to reduced training errors but also increases assessment errors As a result, model will perform worse on unknown data due to the specific rules that it has learned from the training set If overfitting occurs when model is too suitable for the training set, then the opposite phenomenon is called underfitting, i.e when the model learns too general rules We can find an illustration of the phenomena mentioned above in Figure 2.10.1
Figure 2.8 Underfitting, Optimal and Overfitting
Dropout is a technique used in neural network training that involves randomly skipping units (network nodes) during each training pass This omission prevents the units from being considered in the forward and backward propagation processes The probability of retaining a node (p) determines the dropout rate, whereas (1-p) represents the probability of elimination Dropout effectively combats overfitting by preventing excessive interdependence between units in fully connected layers, enhancing the generalization capability of the network.
L2 regularization, or weight decay, is a regularization technique applied to the weights of neural network This technique minimizes the loss function that affects both the loss function and the penalty for the Weight Norm Therefore, weight decay is simply a complement to the loss function of the network and can be described through the following equation
Lnew(w) = Loriginal(w) + λw T w where λ is a value that determines the power of the penalty and L(w) is the chosen loss function If λ value is very small, weight decay will not help regularize the network In contrast, if λ is too large, the error function will gradually decrease, and network will only aim to keep the weight of the network at 0 This effect can be seen in figure 2.6.2.1
Figure 2.9 Underfitting, Optimal weight decay and Overfitting
Recurrent Neural Networks
RNNs have transformed customer service by enabling chatbots that engage in natural dialogues These networks recognize and remember patterns in sequences of words, enabling them to parse customer queries and maintain conversational context RNNs can be trained on vast datasets of customer interactions, understanding diverse requests and issues However, they may face challenges in maintaining context over extended dialogues due to limitations with longer sequences.
Figure 2.10 The Recurrent Neural Network [5]
The image shows a RNN and how it unfolds through time At the heart of an RNN is a loop that allows information to persist In the diagram, the loop is unrolled to show the entire sequence of operations over time The following steps are how it works:
Input (x): At each time step t, the RNN takes an input x t and the previous hidden state s t −1
Hidden State (s): The hidden state s t at time t is updated by applying a weight matrix U to the input x t and another weight matrix W to the previous hidden state s t −1 The function of the hidden state is to capture and carry forward information through time
Output (o): The output o t at time t is then computed using the current hidden state s t and a weight matrix V In many applications, this output is then passed through a function, such as SoftMax, to make decisions or predictions based on the learned patterns
RNNs employ three sets of shared weights: U (input to hidden), W (hidden to hidden), and V (hidden to output) These weights are shared across all time steps, enabling the RNN to generalize across sequences of varying lengths This shared weight architecture allows the RNN to learn patterns and relationships common to different sequences, making it suitable for processing variable-length inputs.
RNNs capitalize on their ability to leverage internal state, enhancing their performance in processing sequences of varying lengths Despite their prowess, they can be hindered in their training by challenges such as vanishing or exploding gradients This presents limitations when dealing with extensive sequences, making it difficult to sustain context over longer durations To overcome these hurdles, advanced RNN architectures like LSTM and GRU have emerged, boasting improved capabilities in capturing long-term dependencies and maintaining context over prolonged conversations, proving invaluable in customer service interactions.
Long Short-Term Memory
LSTM networks, a specialized form of RNNs, are crafted to capture long-term dependencies within data sequences effectively The architecture of an LSTM features a cell state, akin to a conveyor belt, which traverses the length of the network with minimal linear interaction, ensuring the preservation of information over time This cell state is modulated by a series of gates: the forget gate uses a sigmoid function to decide which information the cell state should discard, the input gate decides which values to update and integrates new candidate values created by a tanh layer, and the output gate determines the next hidden state by filtering the cell state through a tanh layer and applying the sigmoid function's output The hidden state, responsible for predictions, is updated with information from the cell state, providing the LSTM with the capability to maintain relevancy in sequential data over long periods This quality is particularly beneficial for language modeling, where understanding context from extensive data sequences is paramount These cells have
3 gates that control the flow of information as below [6]
Input Gate: Decides the degree to which new input should affect the memory
Forget Gate: Determines what portions of the existing memory should be forgotten
Output Gate: Selects the parts of the memory to be used in the output
2.8.1 The Long-Term Dependency Problem
A key feature of RNNs is the idea of using preceding information to make predictions for the present, similar to how one uses previous scenes in a movie to understand the current scene If RNNs could effectively do this, they would be incredibly useful; however, whether they can accomplish this depends on the specific case Sometimes, just revisiting the recently obtained information is sufficient to understand the current situation For example, in the sentence: "Cách điều trị hen suyễn" once reading up to " Cách điều trị hen" it is enough to predict that the next word will be suyễn." In this scenario, the distance to the needed information for making a prediction is short, so RNNs are entirely capable of learning this
Figure 2.12 RNN and Short-Term Dependencies [7]
But in many situations, we need to use more context to infer For example, predicting the last word in the phrase: " Tôi bị bệnh hen suyễn… hen suyễn dị ứng, từ phấn hoa." Clearly, the recent information ("Hen suyễn dị ứng, từ phấn hoa") tells us that what follows will be the name of some disease, but it's impossible to know exactly what that disease is To understand what it is, we need the additional context of " Hen suyễn dị ứng, từ phấn hoa" to make an inference Obviously, the information distance here may already be quite far Unfortunately, as the distance grows, RNNs start to struggle with remembering and learning
Theoretically, RNNs are fully capable of handling "long-term dependencies," meaning that current information can be derived from a sequence of previous information However, in practice, RNNs lack this capability This issue has been highlighted by Hochreiter and Bengio, among others, as a challenge for the RNN model
Figure 2.13 RNN and Long-Term Dependencies [7]
LSTMs are a special kind of RNN capable of learning long-term dependencies LSTMs were introduced by Hochreiter & Schmidhuber and have since been refined and popularized by many researchers in the field They are incredibly effective across a variety of problems, which has led to their widespread popularity LSTMs are designed to avoid the long-term dependency problem Their ability to remember information for long periods is intrinsic to their design, meaning they do not require special training to have this capability In other words, their inherent structure allows them to retain memories without any external intervention
All recurrent networks have the form of a sequence of neural network modules that are repeated In standard RNN networks, these modules have a very simple structure, typically a single tanh layer
Figure 2.14 The Repeating Modules in an RNN Contains One Layer [7]
Similarly, LSTM also has a sequential architecture, but the modules within it have a different structure compared to standard RNN networks Instead of just having a single neural network layer, they have four layers that interact in a very special way.
GRU
The GRU is the newer generation of RNN and is pretty similar to an LSTM [8] GRU are designed to solve the vanishing gradient problem that can occur in standard RNN They do this by using gating mechanisms to control the flow of information The architecture consists of two gates: a reset gate and an update gate
Reset Gate: This gate determines how much of the past information needs to be forgotten It can be thought of as a way to decide how much past information to discard, which helps the model to drop irrelevant information from the past
Update Gate: This gate decides how much of the past information will carry over to the current state It is akin to a combination of the forget and input gates in an LSTM, allowing the model to determine how much of the past information should influence the current state
Figure 2.16 The Architecture of GRU [9]
The Gated Recurrent Unit (GRU) incorporates the input and previous hidden state to regulate its operations The reset gate determines which past information to discard, while the update gate controls the proportion of the previous hidden state to retain By merging old information with new input, the GRU creates a candidate hidden state The final hidden state for the current time step is formed by combining the candidate hidden state with the old state, modulated by the update gate.
This structure allows GRUs to keep relevant backpropagation error signals alive, making them capable of learning over many time steps, which is particularly useful for tasks that require the understanding of long-term dependencies, such as language modeling and time-series analysis GRUs offer a simpler and more computationally efficient alternative to LSTMs while providing similar benefits.
Word Embedding Model
Word Embedding is a general term for language models and feature-learning methods in Natural Language Processing (NLP), where words or phrases are mapped to numerical vectors (usually real numbers) This tool plays a crucial role in most algorithms and architectures in machine learning and deep learning for processing text-based input data, as these systems can only understand inputs in numerical form, enabling them to perform tasks like classification, regression, etc Word embeddings are primarily categorized into two types:
Frequency-based Embedding [13] relies on the frequency of word occurrences to create word vectors, with three most common types being:
t𝑓-id𝑓 Vector (Term Frequency-Inverse Document Frequency)
These methods enable the capture of word significance and relationships based on how often they appear, offering a foundational approach to representing text in numerical form for various NLP tasks
Count Vector is the simplest form of Frequency-based Embedding Assuming there are D documents d 1 , d 2 ,…, d D and N is the length of the vocabulary, the vector representation of a word is an integer vector with length D, where the element at position i represents the frequency of that word appearing in document d i In some cases, words with low frequency might be omitted or the vector entries altered (for example, replacing frequency with a binary value to indicate the word's presence) depending on specific goals
Unlike count vectors that only consider the frequency of words within a single document, t𝑓id𝑓 vectors take into account the frequency of words across the entire dataset This characteristic gives t𝑓id𝑓 vectors a higher discriminative power compared to Count vectors A t𝑓id𝑓 vector is a real-valued vector also with the length
D, where D is the number of documents It is calculated by multiplying two parts: t𝑓 and id𝑓 The formula for each element of the vector is computed as follows: t𝑓 i =
i: ranges from 1 to D, where D is the number of documents
n i : is the frequency of occurrence of the word in document i
N i : is the total number of words in document i id𝑓 i = log 2
D: represents the total number of documents in the dataset
d: is the number of documents in which the word appears t𝑓id𝑓 i = t𝑓 i x id𝑓 i [10] t𝑓id𝑓 vector have a higher discriminative power compared to Count Vector primarily because they are adjusted by the id𝑓 weight According to its formula, if a word appears in many documents, its value decreases, diminishing its impact on the final result However, a limitation of both these methods is their focus solely on the frequency of a word's occurrence, which often fails to capture contextual meaning Co-occurrence matrix partially address this issue [11] They preserve semantic relationships between words, constructed based on the frequency of word pairs appearing within a Context Window A Context Window is defined by its size and direction, allowing the matrix to capture more nuanced word associations based on their proximity and co-occurrence in the text We take “Dương chỉ không mập Dương chỉ gầy” sentence for an example of Context Window, and we will have the Co-occurrence Matrix as below
Table 2.1 Example of Co-occurrence Matrix
Dương chỉ không mập gầy
Dương 0 2 0 1 0 chỉ 2 0 1 0 1 không 0 1 0 1 0 mập 1 0 1 0 0 gầy 0 1 0 0 0
In practice, due to the large vocabulary size, we often choose to omit unnecessary words (like stop words) or apply Singular Value Decomposition (SVD) to reduce the dimensionality of word vectors This approach clarifies word representations while saving memory used to store co-occurrence matrix, which can be quite large
GloVe was introduced in 2014, is one of the newer methods for building word vectors and is essentially based on co-occurrence matrices GloVe fundamentally operates on probabilities, and its development idea stems from the following ratio:
𝑿 𝒊𝒌 : represents the number of times word k appears in the context of word i
𝑿 𝒊 : is the total number of times word i appears in the context of all other words except i
The fundamental concept of GloVe is centered on the idea that a word's meaning can be discerned through its co-occurrence probabilities with other words Essentially, the GloVe model posits that the ratio of these co-occurrence probabilities can reveal deep semantic relationships between words For example, words like "cây kem" and "lạnh" would have a high co-occurrence probability, indicating a strong semantic connection In contrast, "cây kem" and a less related word like "âm nhạc" would show a lower co-occurrence probability GloVe leverages this principle by transforming word co-occurrence counts into a matrix and then applying dimensionality reduction, effectively condensing complex semantic relationships into a more compact vector space The outcome is a set of vectors where the distances and directions between them can be used to deduce the similarities and differences in meaning among words
b i ,b j are corresponding biases (added during the simplification and optimization steps)
Xij: the entry corresponding to the word pair i,j in the co-occurrence matrix
The function f is known as the weighting function, introduced to mitigate the influence of word pairs that appear too frequently This function satisfies three properties:
It is a non-decreasing function
It takes small values when x is very large
In practice, there are many functions that satisfy these properties, but the following function is often chosen:
1 With = ắ The optimization of the cost function J to find the word vectors W i ,W j can be achieved through various methods, with the most standard approach being the use of Gradient Descent This technique involves iteratively adjusting the parameters
W i ,W j in the direction that most reduces the cost function, gradually converging to the set of parameters that minimize J Gradient Descent is widely used due to its effectiveness and simplicity in handling large datasets, which is common in natural language processing tasks like training GloVe models [13]
Prediction-based Embedding employs predictive models to construct word vectors, with Word2vec as a prominent example Word2vec utilizes Continuous Bag of Words (CBOW) and Skip-gram models, each built on a three-layer neural network architecture These models target learning weights that represent word vectors, enhancing word representation in various tasks.
The CBOW model operates on the principle of predicting the probability of a word given its context, which can include one or more words The input to the model is one or more One-hot vectors of the context words, each with a length of V, where
V is the size of the vocabulary, and the output is a probability vector also of length V for the target or missing word The Hidden Layer has a length of N, where N is also the size of the word representation vector Below is the CBOW model with a single- word context:
Figure 2.17 The CBOW Model with One Input [14]
In Word2vec, the training dataset for models like CBOW (Continuous Bag of Words) is pivotal The model's input encompasses two key elements: One-hot vectors representing the context and One-hot vectors of the target word.
Context One-hot Vectors: These are binary vectors representing the context words surrounding the target word Each context word is represented by a One-hot vector, where only the index corresponding to the word in the vocabulary is set to 1, and all other indices are set to 0 The size of these vectors equals the size of the vocabulary (V)
One-hot Vectors of the Target Word: Similar to the context vectors, the target word (the word that the model aims to predict) is also represented as a
One-hot vector In the CBOW model, the goal is to predict this target word based on the context provided by the context words
During training, the model takes these One-hot vectors as input and learns to predict the target word from its context This learning process involves adjusting the weights of the neural network such that the output probabilities match the actual distribution of words in the given context Over time, this leads to the development of word vectors that capture semantic and syntactic relationships between words, making Word2vec a powerful tool for natural language processing tasks
The History of Chatbots
In 1966, computer scientist Joseph Weizenbaum created the first chatbot at the MIT AI lab and named it ELIZA ELIZA was created to simulate conversations with basic pattern matching and substitution DOCTOR, the name of the most famous program of the ELIZA chatbot that can interact with the user as a psychologist Eliza's responses were so convincing that some test users trusted and developed an emotional attachment to the program forgetting that they were just chatting with a chatbot In fact, ELIZA is programmed based on understanding natural language and reacts to keywords then returns conversations to the user
Figure 2.19 ELIZA – The First Chatbot in the World at MIT by Joseph
Another chatbot was developed by Stanford psychiatrist Kenneth Colby in
1972 named PARRY It is designed to accurately simulate as accurately as possible the thought patterns of a person with paranoid schizophrenia In September 1972, these 2 ELIZA and PARRY chatbots met and the figure 2.11.2 was part of an interesting conversation
Figure 2.20 The Conversational between Elize and Parry [19]
Despite their widespread adoption, chatbots have faced criticism for their limitations While they offer round-the-clock customer service and revenue-generating opportunities, their intelligence falls short of user expectations Users yearn for chatbots that mimic human conversation, but such expectations are unrealistic Chatbots, akin to children, require time and training to enhance their capabilities through machine learning algorithms As a result, chatbots are undergoing continuous development to improve their performance and meet user demands.
In the same vein as customer service representatives, chatbots also require a learning process Initially, newly hired customer service agents are not expected to be proficient from the outset They must undergo training on products, services, and processes Through continuous learning and practice, they gradually develop into competent agents Similarly, chatbots benefit from a learning process that enables them to gradually acquire knowledge about the brand's products, services, and processes This learning process is essential for chatbots to provide effective and personalized customer service.
Regardless of what today's businesses use chatbots for, the first thing to think about is to analyze the chat data with previous customers to create sample chatbot scenarios and continuously experiment with their new customers This will help chatbots respond to requests and answer more customer questions Chatbot is one of the hottest industries in AI today and for years to come Surely Chatbot AI will be more and more widespread in life and more invisible in online services If we look back at the evolution and development of AI chatbots over the past 60 years, the biggest success is that algorithms have been able to open up a new world of "artificial intelligence" In the past few years we are no strangers to chatbot scenarios, in other words we have gradually become accustomed to it and consider it as an advanced technology solution.
Using Luong’s Attention for Sequence 2 Sequence Model
The Seq2Seq model operates on the principle of transforming sequences via dual RNNs, as identified by Sutskever and colleagues This process enables the chatbot to maintain a dialogue flow by considering the context of the conversation
Figure 2.21 Seq2Seq Model with GRUs [15]
Encoder: The encoder is a series of GRU (Gated Recurrent Unit) layers Each
GRU processes an input element Xi (such as a word in a sentence), captures its information, and passes it along to the next GRU The final GRU in the encoder sequence outputs a context vector representing the entire input sequence's information This context vector aims to encapsulate the semantic meaning of the input sequence in a fixed-length format [16]
Context Vector: This is the encoder's final hidden state and serves as the condensed representation of the input sequence It is used to initialize the decoder and theoretically contains all the semantic information needed to generate the output sequence
Decoder: Mirroring the encoder, the decoder is also a series of GRU layers It takes the context vector and begins the sequence generation process Starting with an initial input (typically a sequence starts token), the decoder GRU predicts the next element in the sequence (e.g., the next word in a sentence) [16]
Predictions: At each time step, the decoder outputs a prediction for the next sequence element and a hidden state, which is passed on to the next GRU in the decoder This process continues until a sequence end token is generated or a predefined sequence length is reached
The encoder RNN processes the input sentence token by token, generating an
"output" vector and a "hidden state" vector at each time step The hidden state vector is then passed to the next time step, while the output vector is stored This encoding step transforms the context observed at each point in the sequence into a collection of points in a high-dimensional space, which the decoder will later utilize to produce a meaningful output for the specific task at hand
The encoder employs a multi-layered GRU, originally introduced by Cho et al in 2014 [8] In the context of this model, a bidirectional variant of the GRU is utilized, which combines two independent RNNs One of these RNNs processes the input sequence in its standard sequential order, while the other processes it in reverse order The outputs from both of these networks are merged at each time step By employing a bidirectional GRU, the advantage gained is the encoding of information from both past and future contexts, thereby enhancing the model's understanding of the input sequence
Input Sequence: The image represents an input sequence, which could be a sequence of words in a sentence or any sequential data
Embedding Layer: Typically, input tokens are converted into numerical vectors using an embedding layer This step transforms each token into a dense vector representation suitable for neural network processing
Bidirectional GRU: Instead of a simple RNN, a Bidirectional GRU employs two independent GRU layers One processes the input sequence in the standard left-to-right order (the forward direction), while the other processes it in reverse (the backward direction) Each GRU layer is illustrated as a stack of cells or nodes connected horizontally o Forward GRU: This layer processes the sequence from left to right At each time step, it takes an input token and its hidden state from the previous time step in the forward direction It computes a new hidden state and passes it to the next cell This process continues until the entire sequence is processed o Backward GRU: In contrast, the backward GRU processes the sequence in reverse order, from right to left It follows similar principles as the forward GRU but operates in the opposite direction
Concatenation: After both the forward and backward GRU have processed the input sequence, their hidden states at each time step are concatenated (combined) together This results in a new representation for each time step that encodes information from both the past and future context
Concatenation: After both the forward and backward GRUs have processed the input sequence, their hidden states at each time step are concatenated, combining information from both past and future contexts This results in a new representation for each time step that encodes bidirectional context
Output Layer: The concatenated hidden states can then be fed into an output layer, which typically includes one or more fully connected (dense) layers, depending on the specific task The output layer's role is to make predictions or classifications based on the enriched representations obtained from the bidirectional GRU layers
The Seq2Seq model employs an encoder-decoder architecture The encoder generates a series of context vectors, capturing the encoded information of the input sequence However, in a vanilla Seq2Seq model without attention, the decoder relies solely on the final context vector for initialization and token generation This approach poses limitations, as the final context vector might not adequately represent all the relevant information from the input, particularly in lengthy sequences, leading to potential information loss.
To address this issue, the attention mechanism introduced by Bahdanau et al allows the decoder to generate each token by selectively focusing on different parts of the input sequence The decoder does this by computing a set of attention weights that reflect the importance of each input token for the current decoder output For each token that the decoder generates, it calculates attention weights by comparing the decoder's current hidden state with each of the encoder's context vectors These attention weights are then applied to the encoder's context vectors to create a weighted combination, which results in an "attended" context vector that is specific to the current token being generated by the decoder
This attended context vector is then used along with the decoder's current hidden state to generate the next token in the sequence The process repeats token by token until the decoder outputs an end-of-sentence (EOS) token, signaling the end of the output sequence
Figure 2.23 Attention Mechanism in Seq2Seq Model [20]
The Sean Robertsib's figure visually illustrates how the encoder's outputs are weighted to focus on relevant parts of an input sequence This weighting mechanism enhances the output sequence's accuracy and effectiveness, particularly for long input sequences where the relevance of sequence elements varies significantly across output components.
Automatic Speech Recognition
ASR has been undergoing active research for over fifty years It is an essential milestone in human-to-human and human-machine communication Due to the insufficient performance of past technologies, ASR has not become a desirable part of human-machine communication That was because the lack of computing power did not allow to cross the usability bar for real users, and other means of communication, such as keyboards and mice, are significantly superior to speech in most aspects of computer communication efficiency [46]
All this has changed in recent years Speech technology began to change the way we live and work, and some devices became the primary means of interacting with them By [46], there are several key areas in which progress allows for this trend
The first area is Moore's law The law states that approximately every two years, the number of transistors in a dense integrated circuit doubles approximately every two years [47] This results in the computing power of CPU/GPU clusters being doubled every two years That makes training more complex and powerful models possible, and therefore the error rate of ASR systems is lower
Access to more data through the Internet and cloud computing has revolutionized modeling capabilities The availability of vast data collections enables the training of robust and realistic models that enhance accuracy and reduce hypothetical assumptions This advancement has transformed the field of modeling, leading to more dependable and data-driven outcomes.
Advancements in technology have witnessed the surge in popularity of smart devices for mobile and residential use, as well as in-vehicle entertainment systems Given the impracticality of traditional input methods like keyboards and mice in these scenarios, voice-based communication emerges as a natural and convenient alternative, mirroring the inherent preference of humans for verbal communication.
There are several ASR approaches to different models, such as the Gaussian mixed model or the HMM Since this thesis concerns Neural Network (NN), so the next section will go through RNN models in ASR
2.13.1 Speech Recognition Using Recurrent Neural Networks
In this section, author discusses the research by Aditya Amerkar, Gaurav Deshmukh, Parikshit Awasarmol and Piyush Dave on "Speech Recognition using RNNs" [49] Thanks to this work, the differences of the two types of networks and their performance become more clearly
The main difference between the two models is that a convolutional neural network is not capable of processing information sequentially This is because the network processes every input considering only the current state of neuron, which is not modified by previous inputs But, a RNN has this capability Each neuron has a type of memory, which allows them to analyze and process individual input taking into account some of the information contained in the previously processed input If a CNN does a remarkable job for processes such as speech recognition for one and several words, for more complex speech recognition projects, wanting to process more words or even complex sentences, then the use of a RNN is necessary
2.13.2 Speech-to-Text Using Deep Learning
Bredmar's speech-to-text research extends beyond basic conversion; it transforms voice into a comprehensive entity capable of direct language translation without requiring textual representation This contrasts with Google Translate's three-step process for speech-to-speech conversion, demonstrating Bredmar's impactful contribution to the field.
Initially, the speech-to-text model translates voice into written text
Subsequently, the written text undergoes translation into the desired language
Finally, the newly generated text is reconverted into audio format using a speech synthesis model, speech synthesizer, or text-to-speech model
The conventional approach to text-to-speech translation suffers from a flaw where speaker characteristics like emotion and accent are lost during the text-to-speech conversion This is overlooked in subsequent text-to-speech translations However, advancements in natural language processing and deep learning have paved the way for a solution that eliminates this loss, preserving the speaker's voice characteristics throughout the translation process.
Bredmar's innovative approach revolves around eliminating the need for intermediate text-to-text translation in speech translation Utilizing an LSTM neural network, it allows direct processing of audio input containing voices in specific languages The resulting output is an audio file that maintains the original vocal characteristics, leading to seamless speech-to-speech translation This breakthrough eliminates distortions and inaccuracies often associated with traditional methods, ensuring the preservation of true speech qualities throughout the translation process.
A dedicated dataset was curated for the training of the model Bredmar advocates for the construction of a dataset enriched with diverse sentences in terms of lengths and tones, simulating common conversations These sentences were sourced from dubbed films, chosen for their abundance in French dubbing The dataset aimed to encapsulate the nuances of different voices while maintaining voice characteristics The final dataset comprised 16 movies, generating around 700 sentences per movie, ranging from 3 to 50 seconds, totaling 11,000 sentences The Long Short-Term Memory (LSTM) network, with two layers of 800 neurons each, was employed for this task using the Google Tensorflow library
The initial experiment conducted by the paper's author focuses on assessing network performance through the systematic adjustment of network size by augmenting neurons in both layers The results distinctly demonstrate enhanced performance as the network dimensions increase, as evident in Figure 2.13.2.1 This enhancement is illustrated by the significant reduction in the loss function, indicating improved network efficiency with larger sizes However, the reduction in the loss function, relative to the increased number of neurons in the network, is relatively modest It is unequivocally evident that larger networks yield lower training and evaluation errors
Figure 2.26 Graph showing how the loss function change depending on the size of the trained network [50]
Increasing the size of the training set leads to a significant reduction in the loss function, particularly during the initial stages of augmentation This correlation highlights the importance of large datasets for building accurate models.
Figure 2.27 Graph showing how the loss function change depending on the size of the training set[50]
The outcomes of this endeavor are particularly noteworthy The experiment indicates that the LSTM network architecture exhibits ample complexity for handling such sequences, as evidenced by training and test errors The author acknowledges a limitation: the network's current size struggles to effectively manage more extensive or intricate datasets, exhibiting signs of underfitting Nevertheless, the positive outcome underscores the potential of speech-to-speech translation without relying on intermediary speech-to-text or text-to-speech models It also highlights the feasibility of preserving some speech characteristics from the input source, pending the enhancement of the network's architecture to accommodate greater complexity.
PROPOSED MODEL
Transformer Model
The appearance of the attention mechanism in seq2seq models has revolutionized the field of NLP Initially, seq2seq models relied on encoder-decoder frameworks, typically using RNN, to transform sequences from one domain to another, such as translating a sentence from one language to another However, RNN and their advanced variants, LSTM and GRU, had limitations due to difficulty in capturing long-range dependencies and computational inefficiencies, leading to the vanishing gradient problem
The attention mechanism overcomes limitations by enabling the decoder to concentrate on specific input sequence sections during each output generation step Instead of summarizing the entire input sentence into a fixed-size vector, the model learns to focus on the most relevant sections, capturing the intricacies and context of longer sequences more accurately.
One of the notable types of attention mechanisms proposed is by Luong et al., often referred to as Luong's attention Luong's attention, in contrast to the earlier
"Bahdanau attention", proposed a more efficient approach by simplifying the alignment function, which is the component that scores how well inputs around position j match with the output at position i Luong's attention mechanism offers several alignment functions, which are dot, general, and concat, providing a more versatile and computationally efficient method to learn the dependencies between source and target sequences
The development of the Transformer model leveraged the concept of attention further by using self-attention, also known as intra-attention, which allows each position in a sequence to attend to all positions within the same sequence This mechanism enabled the model to capture contextual information from the entire sequence, making it more effective than RNN-based seq2seq models, particularly for tasks requiring an understanding of long-range dependencies
GPT models built upon the Transformer architecture, using self-attention to predict each subsequent token in a sequence The distinction here is that GPT, as an autoregressive language model, uses masked self-attention in its decoders so that each token can only attend to previous tokens, thus ensuring causality in text generation
Figure 3.2 The Transformer and GPT Architecture [22][23]
Luong's attention mechanism is specifically a form of global attention, which considers all the hidden states of the encoder when generating each token In contrast, self-attention in the Transformer model is more generalized; it's not just about aligning two sequences (as in translation) but about modeling all pairwise interactions within a single sequence for a variety of tasks This kind of self-attention, which has been integral to the success of Transformer models, does not directly derive from Luong's attention but rather expands on the concept of attention as a whole, providing a more flexible and powerful way to model relationships in data
The potency of the attention mechanism was notably showcased in the paper
"Attention is All You Need," where the authors unveiled an innovative neural network architecture known as the Transformer This architecture operates on the principles of attention, eschewing traditional encoder-decoder frameworks, and represents a significant departure from previous models by relying solely on attention mechanisms without recurrent or convolutional layers In Transformer’s architecture, we have 2 main blocks include Encoder and Decoder block
The encoder block of the Transformer architecture, as delineated in the accompanying figure, is a testament to the model's innovative approach to processing sequential data This block is the left block of the figure 3.1.1.1 and has the following parts
Firstly, input embeddings The operation commences with input embedding, where each token of the input sequence is translated into a high-dimensional vector, laying the groundwork for subsequent interpretative processing
Positional encoding compensates for Transformers' lack of sequential processing by infusing positional information into input embeddings Sine and cosine functions are used to encode positions for odd-indexed and even-indexed elements, respectively These vectors are then added to input embeddings, ensuring the model retains awareness of the sequence order The choice of sine and cosine functions aligns with the model's learning process, facilitating the identification and attention to positional dependencies.
The Encoder layer transforms input sequences into an intricate continuous representation through its multi-headed attention mechanism and fully connected feed-forward network Residual connections safeguard information flow, complemented by layer normalization to stabilize learning dynamics throughout the network.
Within the encoder, the multi-headed attention mechanism uses self-attention to correlate each input word with others, enabling the model to identify relationships like "you," "how," and "are." Through three fully connected layers, the model generates query, key, and value vectors, which are essential for self-attention This process allows the model to recognize patterns in the input, such as a question, and generate appropriate responses.
Once the query, key, and value vectors are processed through a linear layer, a matrix multiplication via dot product is performed between the queries and keys to yield a score matrix This score matrix quantifies the degree of attention or focus that each word in the sequence should receive from other words Consequently, every word is assigned a score that corresponds to every other word at that specific time-step, with higher scores indicating a greater degree of focus or relevance This scoring mechanism effectively maps the queries to their corresponding keys, thus facilitating the self-attention process within the network
Figure 3.5 Dot Product of Query and Key
The attention scores are scaled down by dividing them by the square root of the dimensionality of the query and key vectors This scaling process is crucial to prevent the gradients in the network from becoming too large, a phenomenon known as the exploding gradient problem Multiplying large values in the attention mechanism could potentially lead to this issue, which would severely impede the model's ability to learn By scaling down, the model maintains more stable gradients, which allows for smoother optimization and a more reliable learning process
Figure 3.6 Scaling Down the Attention Scores
To refine the attention mechanism, the attention scores are scaled and subjected to a SoftMax function, a process that normalizes the scores, creating a probability distribution between 0 and 1 This transformation amplifies higher scores, accentuating their significance The model assigns higher probabilities to relevant words, enhancing its focus Consequently, the SoftMax function allows the model to make probabilistic decisions, assigning greater confidence to words that warrant attention, improving the specificity and clarity of the attention mechanism.
Figure 3.7 SoftMax of the Scaled Scores
The obtained attention weights are then multiplied by the value vectors, resulting in an output vector This operation effectively scales the value vectors by the attention probabilities, allowing the model to emphasize the representation of words it has deemed more significant through higher SoftMax scores Conversely, the representations of less relevant words are attenuated, as their corresponding lower attention weights lead to a diminishment in their influence on the output vector This filtered output, which is a weighted sum of the value vectors, is then passed through a linear layer for further processing The linear layer functions as a transformation step, integrating and refining the weighted information in preparation for subsequent stages in the model
Figure 3.8 Multiply SoftMax Output with Value Vector
Evaluation
In the evaluation of chatbot performance, it is crucial to employ appropriate metrics that measure the quality of its responses Four key metrics include F1-score, precision, recall, and BLEU, play a significant role in assessing different aspects of Seq2seq model In addition, WER and PER are key metrics to evaluate ASR model
F1-Score, Precision, Recall: These metrics are particularly relevant for evaluating the accuracy and relevance of chatbot responses They are calculated as follows:
Precision measures the effectiveness of chatbot responses by calculating the proportion of relevant and accurate replies among all responses generated It is calculated as the number of accurate and relevant responses divided by the total number of responses A higher precision score indicates that the chatbot is providing accurate and pertinent information to users.
A high precision score implies that the chatbot predominantly provides correct and relevant responses
Recall for Chatbot Responses: Recall quantifies the ratio of relevant and accurate responses generated by the chatbot to the total number of correct and relevant responses present in the reference set It is calculated using the formula:
A high recall score indicates that the chatbot captures a significant portion of the relevant responses
F1-Score for Chatbot Responses: The F1-score is a harmonized measure that combines precision and recall providing a holistic evaluation of response quality It is calculated using the formula:
The F1-score offers insights into the relevance and comprehensiveness of the chatbot's responses
BLEU: Originally designed for machine translation, BLEU can be adapted to assess chatbot responses It evaluates the similarity between the chatbot's responses and reference responses provided by human evaluators BLEU is calculated using a formula that considers the precision of n-grams in the generated response compared to reference responses
BP stands for Brevity Penalty, which penalizes the score when the context of prompt is too short compared with the true answer was trained The mathematical expression for BP is given as follows:
Pn is the n-gram modified precision score It can be defined as follows:
Fallback Rate: This refers to the frequency at which a chatbot fails to understand or process a user's request and "falls back" on a default response This could be a message like "I'm sorry, I don't understand" or "Can you rephrase that?"
A lower fallback rate indicates that the chatbot is more capable of understanding and responding to a variety of inputs without needing to revert to a generic response
Containment Rate: This is the percentage of interactions that are successfully resolved within the chatbot without needing to escalate to a human agent A high containment rate is usually desirable, as it indicates the chatbot can handle a large volume of queries, which reduces the workload on human agents and can lead to cost savings
Handoff Rate: This refers to the frequency with which a chatbot must transfer a user to a human agent The handoff can occur because the chatbot has reached the limits of its capabilities, the user has requested to speak to a human, or the query is too complex or sensitive for the bot to handle A lower handoff rate can be a sign of a more effective chatbot, assuming user issues are being resolved satisfactorily
WER achieves this by synthesizing words that the system has missed or messed up through a following equation
S is the number of substitutions,
D is the number of deletions,
I is the number of insertions,
C is the number of correct words,
N is the number of words in the reference (N = S + D + C)
PER is a measure of how well a probability distribution or probability model predicts a sample In the context of language models, perplexity is often used to evaluate how well a model predicts a sequence of words
Perplexity is calculated using the probability assigned by the model to the actual data For a given sequence of words W = w1, w2, …, wn, the perplexity PP(W) is calculated as:
P(𝑤 𝑖 ) is the probability assigned by the model to the i-th word in the sequence,
n is the total number of words in the sequence
IMPLEMENTATION
Dataset
To develop a robust Seq2seq chatbot system, a vast selection of datasets is crucial This study utilized seven datasets, each containing around 20,000 question-answer pairs obtained from hospital advisory websites.
The general hospital in the Thu Duc region can be found at the following URL: http://www.benhvienkhuvucthuduc.vn/hoi-dap This website comprises approximately 250 pages, with each page containing 10 question-answer pairs
The Viet Duc Friendship Hospital can be found at the following URL: https://benhvienvietduc.org/hoi-dap It contains approximately 450 question- answer pairs
The MEDLATEC General Hospital can be found at the following URL: https://medlatec.vn/hoi-dap It contains approximately 590 pages, with 10 question-answer pairs on each page
The Thu Cuc International General Hospital is located at the following URL: https://benhvienthucuc.vn/hoi-dap-chuyen-gia It consists of approximately
200 pages, with 10 question-answer pairs on each page
The Vinmec International General Hospital Joint Stock Company is located at the following URL: https://www.vinmec.com/vi/tin-tuc/hoi-dap-bac-si/ It consists of approximately 560 pages, with 10 question-answer pairs on each page
The Binh Duong Obstetrics and Pediatrics Hospital is located at the following URL: https://benhvienpsnbd.com.vn/hoi-dap/trang/1.htm It consists of approximately 390 pages, with 5 question-answer pairs on each page
The Hanoi Traditional Medicine Multidisciplinary Hospital is located at the following URL: http://bvdkyhoccotruyenhanoi.vn/bacsituvan/index?page=1
It consists of approximately 95 pages, with 8 question-answer pairs on each page
Figure 4.1 Example of datasets for Seq2seq model
In the development of an Automatic Speech Recognition (ASR) system for a chatbot, a dataset exceeding 20 hours of audio recordings was compiled These recordings, initially sampled at a rate of 44.1 Hz, were sourced from news programs across diverse television channels in Vietnam Concurrently, subtitles were extracted from corresponding YouTube videos Through meticulous data preprocessing, the audio files underwent a conversion to a 16 Hz format and were segmented into shorter durations ranging from 3 to 4 seconds, aligning seamlessly with the extracted subtitles
The primary objective is to leverage this curated dataset for the comprehensive training of the ASR model It's noteworthy that the audio files, stemming from various news programs, do not necessitate a shared thematic focus, such as healthcare, to be instrumental in the ASR model training process.
Experiment
Experiment 1 for Seq2seq: Comparing the models used in the task of intent and entity recognition
In the experiment of the thesis, for each question’s intent recognition task, the system employs the two following main models
GPT-3 based on Transformer: utilize an encoder-decoder architecture and attention mechanism to understand and generate input and output sequences This model excels in overall performance but may be computationally expensive
Bidirectional GRU and Bidirectional LSTM with Luong’s Attention: understand the question from both directions and employs an attention mechanism to focus on important information during the decoding process, improving the model's ability to predict and comprehend the intent of the question
AVG Execution Time (seconds/pair)
AVG Execution Time (seconds/pair)
In terms of accuracy metrics (precision, F1-Score, BLEU-Score), GPT-3 stands out as the superior model, demonstrating a higher likelihood of providing correct answers Meanwhile, Bi-GRU, although not matching GPT-3 in accuracy, offers a more efficient processing speed, outperforming GPT-3 in execution time Bi- LSTM, with the lowest scores among the models, presents a trade-off between accuracy and efficiency, positioning itself as an intermediate option
Experiment 2 for Seq2seq: Chat handoff and fallback
Table 4.3 Chat Handoff and Fallback
Model Total number of interactions
According to the presented data, GPT-3 emerges as the superior model, showcasing its reliability in delivering accurate responses with a notably low fallback rate of 0% Moreover, GPT-3 exhibits robust standalone interaction capabilities, evidenced by a high containment rate of 100%, surpassing the performance of Bi- GRU and Bi-LSTM, which report containment rates of 40% and 28%, respectively
While GPT-3 does exhibit a relatively high handoff rate, it does not adversely affect overall performance, as all interactions are successfully contained On the contrary, the higher fallback and handoff rates observed in Bi-GRU and Bi-LSTM models indicate their comparative limitations as chatbot models within this experiment GPT-3's ability to provide accurate responses and handle interactions independently positions it as the more effective choice for the specified chatbot application
The following is the conversation with the chatbot after the final test
Table 4.4 Conversation Log with Chatbot
1 Cho em hỏi bệnh viện có khám nam khoa không ạ
Bệnh viện có khám nam khoa bạn đăng ký tại phòng khám nam khoa (số 2 – khoa khám bệnh – phòng số 2)
2 Bé nhà em nay được 14 tháng 22 ngày rồi, bé cao 80cm và nặng
13,4kg, bé đã biết đi vững rồi
Nhưng tới bây giờ bé chỉ mọc được 4 cái răng cửa trên và dưới từ lúc 6 tháng tới giờ Em không biết bé có bị thiếu chất gì không
Làm cách nào để kiểm tra xem bé bị gì không ạ bạn thân mến!cảm ơn bạn đã gửi câu hỏi tới icnm,trường hợp của bé có thể bị các bệnh lý về răng, nhưng không biết được bệnh lý gì bạn nên đưa bé đến bệnh viện để được thăm khám và tư vấn cụ thể.mọi chi tiết về dịch vụ cần được hỗ trợ thêm
3 Cho em hỏi Sau khi sanh mổ yêu cầu tối đa thì khoảng thời gian bao lâu mẹ được gặp bé của mình ạ chào bạn, sau khi sanh mổ yêu cầu, một số bệnh nhân không có thể điều trị được bạn nên đến khám trực tiếp để được bác sĩ tư vấn và điều trị cho mình bạn có thể đến tại các bệnh viện thuộc hệ thống y tế vinmec để được bác sĩ tư vấn thêm thân mến chào bạn
4 thai 4 tuần chi phí phá thai như thế nào ạ chào bạn,cảm ơn bạn đã gửi câu hỏi tới icnm,thai 4 tuần chi phí phá thai là 1,5 triệu đồng
Để biết chi phí khám phụ khoa nữ, bạn có thể tham khảo thông tin tại https://www.facebook.com/bophetm.vn/ Chúng tôi cung cấp thông tin chi tiết về các khoản phí liên quan đến việc khám phụ khoa.
6 em bị sâu răng số 6-7 chỉ còn chân răng ko bị nhức em có nên nhổ ko ạ tại em sợ nhổ sẽ bị tiêu xương hàm chào bạn!cảm ơn bạn đã gửi câu hỏi tới icnm,trường hợp của bạn có nhổ răng số 6-
7, chân răng và nhức bạn nên đến bệnh viện để được thăm khám và điều trị bạn có thể đến bệnh viện để được thăm khám và tư vấn cụ thể hơn.mọi chi tiết về dịch vụ
7 e có thai lần 3, 2 lần trước e sinh e bé xg nhưng có sữa mặc dù đã cố gắng đủ mọi cách, sau đó phải nuôi con bằng sữa công thức lần này e đc 22 tuần nhưng e thấy ngực không phát triển mấy e muốn hỏi bác sĩ biết có cách nào để e có thể nuôi con bằng sữa mẹ sau khi sinh con không? chào bạn!cảm ơn bạn đã gửi câu hỏi tới icnm,trường hợp của bạn có thai lần 3, có thể bạn cần đến bệnh viện để được thăm khám và điều trị bạn có thể đến tại các bệnh viện thuộc hệ thống y tế vinmec để được bác sĩ tư vấn thêm.mọi chi tiết về dịch vụ
8 chào bác sĩ, em có bị tụ máu ở khuỷu tay 1 tuần, hơi nhức, em cũng ko động vào, sau đó em thấy vết tụ máu nhạt dần thì nhấn vào lại có cục u tròn mềm, đau gần chỗ vết tụ cánh tay em vẫn bị nhức sau đó em có đến viện kiểm tra bác sĩ chỉ định xn máu, chỉ số % mono 9,5/8 chỉ số đông máu bình thường, vậy cục u có sao ko, cần kt gì thêm chào bạn,cảm ơn bạn đã gửi câu hỏi tới icnm,đông máu là một trong những dấu hiệu của bệnh lý tuyến giáp tuy nhiên, đông máu có thể là dấu hiệu của bệnh lý tuyến giáp, nhưng không phải là dấu hiệu của bệnh lý tuyến giáp ạ? ko rõ u xuất hiện trước hay tụ máu rồi hình thành u ạ?
9 chào bác sĩ bé nhà em từ lúc sinh ra bé bị mot vết bớt ở mặt, nó có màu nâu đen và nhiều lông nhờ bác sĩ tư vấn giúp em đó là bệnh gì và có nguy hiểm vói bé không? chào bạn,cảm ơn bạn đã gửi câu hỏi tới icnm,với câu hỏi của bạn, tôi xin được trả lời như sauvết bớt ở mặt là bệnh lý nhưng không phải là bệnh lý nào vết bớt ở mặt có thể do nhiều nguyên nhân như viêm màu, viêm màu, viêm màu nhỏ, viêm màu nhỏ,
10 xin chào bác sĩ! vợ chồng con cưới nhau 3 năm mà vẫn chưa có em bé, con thì bình thường nhưng chồng con bị tinh trùng yếu di động pr tiến tới 6% xin bác sĩ cho con lời khuyên với kết quả này con muốn bơm tinh trùng vào tử cung thì có khả năng ko hay phải đi thụ tinh ạ? chào chị!cảm ơn chị đã gửi câu hỏi đến icnm,vợ chồng chị đã 3 năm mà vẫn chưa có em bé, vẫn có tinh trùng yếu tinh trùng yếu có thể do nhiều nguyên nhân, nhưng không phải là do tinh trùng yếu với những nguyên nhân này, chị cần đến viện để được khám và làm các xét nghiệm đán
11 bác sĩ ơi cho e hỏi, khi có bầu mình có thể để tay lên bụng nhẹ nhàng được không ạ? mỗi khi con đạp e hay đặt nhẹ tay lên bụng , lấy ngón tay dí nhẹ vào chỗ con đạp thôi chứ không xoa bụng , vậy có ảnh hưởng gì không ạ em cảm ơn ạ bạn thân mến!cảm ơn bạn đã gửi câu hỏi tới icnm,bạn có thể để tay lên bụng, nhẹ nhàng, đặt nhẹ tay lên bụng, lấy ngón tay đạp thôi chứ không xoa bụng, có thể là do bất thường như bạn đã đi khám và điều trị bằng thuốc tuy nhiên, bạn cần đến khám trực tiếp
Discussion
From the experimental results, the following assessments can be made:
While the chatbot was successful in identifying the most significant intent, it became evident that the dataset's size and diversity are limiting factors Expanding the dataset with more varied content is crucial to enhance the chatbot's ability to engage in meaningful conversations and provide relevant responses to a broader range of user inquiries
The importance of topic-specific datasets and tailored training was underscored during the evaluation It became evident that training the chatbot on specific domains or industries is essential for achieving high accuracy in its responses A lack of domain-specific data may lead to incorrect or irrelevant answers
The experimental task shed light on the limitations of applying the chatbot in the healthcare sector The chatbot's performance was less than satisfactory for issues requiring a high degree of precision, especially those related to human health, medical advice, data security, and academic expertise Further development and fine-tuning are necessary to make the chatbot a reliable resource in such critical areas
The accuracy of Way2vec 2.0 is remarkably high when applied to Vietnamese datasets Training with less than 10 epochs produced suboptimal results for English-transcribed names and numbers, but scenario three led to significantly improved accuracy The experimental dataset used for evaluation is entirely distinct from the training dataset, consisting of 60 files with durations ranging from 2 to 3 seconds per file The model's response time is 10.4 seconds, showcasing its potential for integration with a seq2seq model to develop a voice-input-supported chatbot
Building a voice chatbot comes with several challenges that need to be addressed for successful implementation as below:
Addressing coreference resolution is crucial In both spoken and written language, people often use shortened references for previously mentioned objects For example, in Vietnamese, various pronouns or regional colloquialisms might be employed by speakers and writers When a chatbot lacks contextual information and a coreference resolution mechanism, it faces significant difficulties in determining the specific objects to which these words are referring Failure to accurately identify these referents can result in the chatbot misinterpreting the user's conversation This challenge becomes particularly evident in lengthy conversations, leading to responses from the chatbot that may lack precision, alignment with domain expertise, and relevance to the user's specific question
Due to resource limitations, the experimentation process has only been conducted on a dataset consisting of 20,000 question-answer pairs for Seq2seq model and 20 hours of audio files The experimentation process has not yet been able to collect large-scale datasets from various topics in real life for model training
As a result of time constraints during the thesis project and limited available efforts, the observed outcomes have thus far been limited to relatively simple and academic experiments They have not yet been extended to practical, state- of-the-art models to demonstrate the model's real-world applicability, stability, and effectiveness.
CONCLUSION
This thesis explores the development of voice-to-text chatbots with enhanced semantic understanding, comparing traditional Seq2Seq models with GPT-3 encoder-only approaches customized for Vietnamese The analysis provides insights into the strengths and weaknesses of these architectures, especially for structured dialogue processing in the healthcare domain.
Additionally, efforts have been made to improve the accuracy of Automatic Speech Recognition (ASR) models by carefully selecting and evaluating experiments to overcome challenges in noisy environments and varying acoustic conditions This contributes significantly to enhancing the overall performance and usability of ASR technology, especially in critical applications where accurate transcription of spoken input is essential
Furthermore, the integration of text-to-text and speech-to-text models holds promise in developing chatbots capable of supporting voice interactions in Vietnamese across healthcare domains By leveraging advanced natural language processing techniques, dialogue management systems, and machine learning algorithms, these chatbots aim to engage in meaningful real-time conversations, providing enhanced user experiences and utility
Future research in the realm of chatbots employing the GPT-3 model holds promising avenues for advancement While current iterations of GPT-3 have demonstrated impressive natural language understanding and generation capabilities, there are several intriguing directions for further exploration
To enhance the usefulness of GPT-3, future research will focus on refining it for specific domains like healthcare, finance, or legal By fine-tuning the model within these domains, chatbots will be able to deliver more accurate and specialized responses, catering to the unique information needs and requirements of each field.
Depth Semantic: the voice-to-text chatbot's capabilities will be enhanced to include deeper semantic understanding, enabling it to comprehend and interact with previous questions and answers in real-time conversations This enhancement involves integrating advanced natural language processing techniques, such as coreference resolution and sentiment analysis, to grasp context and extract relevant information Dialogue management systems will help maintain coherence and flow in conversations by tracking dialogue history and providing contextually relevant responses
Conversational Depth: Increasing the depth and coherence of chatbot conversations is an ongoing challenge Future research should aim to develop models that can engage in more extended and context-aware dialogues
Scalability: As GPT-3 models are resource-intensive, optimizing them for scalability is crucial Future research should focus on efficient model architectures and training methods
Overall, the contributions presented in this thesis pave the way for future research and advancements in voice-to-text chatbots, with a focus on deep semantic understanding and real-time interaction capabilities As technology continues to evolve, these developments will play a vital role in revolutionizing communication and accessibility in various domains, ultimately benefiting users and stakeholders alike
[1] Yacine Ikkache et al “Towards Quranic Reader Controlled by Speech” In:
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol 2, No 11, 2011
[2] Geoffrey Hinton et al “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups” In: IEEE Signal Processing Magazine, pp.82-97, 2012
[3] Karel Veselỳ et al “Sequence-discriminative training of deep neural networks” In: Interspeech.2013-548, 2013
[4] Warren S McCulloch and Walter Pitts “A logical calculus of the ideas immanent in nervous activity” In: Bulletin of Mathematical Biology, Vol 52,
[5] Fan Mei, Hong Chen and Yingke Lei “Blind Recognition of Forward Error Correction Codes Based on Recurrent Neural Network” In: Sensors 2021,
Vol 21, No 3884, 2021 https://doi.org/10.3390/s21113884, 2021
[6] Pavan Kumar T M et al “Predicting Volatility of Stocks for Time Series Data Using Statistical Forecasting Models” In: International Journal of Engineering Applied Sciences and Technology, Vol 6, Issue 8, ISSN No 2455-
[7] Colah’s blog “Understanding LSTM Networks” URL: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Accessed: Dec
[8] Kyunghyun Cho et al "Learning Phrase Representations using RNN Encoder- Decoder for Statistical Machine Translation" In: arXiv:1406.1078v1, 2014 [9] Michael Phi “Illustrated guide to LSTM’s and GRU’s: A step by step explanation” URL: https://towardsdatascience.com/illustrated-guide-to- lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21 Accessed: Dec 15
[10] Ho Chung Wu and Robert Wing Pong Luk "Interpreting TF-IDF term weights as making relevance decisions" In: ACM Transactions on Information System,
[11] Michael Cochez et al “A first experiment on including text literals in KgloVe” In: arXiv:1807.1176v1, 2018
[12] Seopp Hochreiter et al “Long short-term memory”.In: Neural Computation,
[13] Ronald J Williams “Simple statistical gradient-following algorithms for connectionist reinforcement learning” In: Machine learning, pp 229-256,
[14] Lalit R Bahl et al “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition” In: Proceedings – ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 1986
[15] Ilya Sutskever et al “Sequence to sequence learning with neural networks” In: arXiv:1409.3215v3, 2014
[16] Ping Jiang “Nowcasting Influenza Using Google Flu Trend and Deep Learning Model” In: Advances in Economics, Business and Management Research, Vol 155, 2020
[17] David Barry, “What’s New in Answering Services for Small Businesses?” URL: https://www.soundhound.com/voice-ai-blog/whats-new-in-answering- services-for-small-businesses/ Accessed: Dec 15 2023
[18] Colah’s blog “Neural Networks, Types, and Functional Programming” URL: https://colah.github.io/posts/2015-09-NN-Types-
FP/?ref=blog.paperspace.com Accessed: Dec 15 2023
[19] NIC: 13771 “PARRY Encouters the DOCTOR” URL: https://www.rfc- editor.org/rfc/rfc439 Accessed: Dec 15 2023
[20] Sean Robertson “Practical Pytorch: Translation with a Sequence to Sequence Network and Attention” URL: https://github.com/spro/practical- pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb
[21] Minh-Thang Luong et al “Effective Approaches to Attention-based Neural Machine Translation” In: arXiv:1508.04025v5, 2015
[22] Ashish Vaswani et al “Attention Is All You Need” In: arXiv: 1706.03762v7,
[23] Alec Radford et al “Improving language understanding by generative pre- training” In: OpenAI, 2018
[24] Peter J Liu et al “Generating wikipedia by summarizing long sequences” In: arXiv:1801.10198v1, 2018
[25] Zichao Yang et al “Hierarchical Attention Networks for Document Classification” In: Proceedings of NAACL-HLT, pp 1480-1489, 2016
[26] Dzmitry Bahdanau et al “Neural Machine Translation by Jointly Learning to Align and Translate” In: arXiv:1409.0473v7, 2014
[27] Jason Weston et al “Memory Networks” In: arXiv:1410.3916v11, 2015 [28] Jiasen Lu et al “Hierarchical question-image co-attention for visual question answering” In: arXiv:1606.00061v5, 2017
[29] Baosaong Yang et al “Convolutional Self-Attention Networks” In: arXiv:1904.03107v1, 2019
[30] Zhilin Yang et al “XLNet: Generalized Autoregressive Pretraining for Language Understanding” In: arXiv:1906.08237v2, 2020
[31] Jacob Devlin et al “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In: arXiv:1810.04805v2, 2019
[32] Guillaume Lample and Alexis Conneau “Cross-lingual Language Model Pretraining” In: arXiv:1901.07291v1, 2019
[33] Iz Beltagy et al “Longformer: The Long-Document Transformer” In: arXiv:2004.05150v2, 2020
[34] Angela Fan et al “Beyond English-Centric Multilingual Machine Translation” In: arXiv:2010.11125v1, 2020
[35] Colin Raffel et al “Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer” In: arXiv:1910.10683v4, 2019
[36] Teven Le Scao et al “Bloom: A 176b-parameter open-access multilingual language model” In: arXiv:2211.05100v4, 2023
[37] Hugo Touvron et al “Llama: Open and efficient foundation language models” In: arXiv:2392.13971v1, 2023
[38] Aakanksha Chowdhery et al “Palm: Scaling language modeling with pathways” In: arXiv:2204.02311v5, 2022
[39] Federia Laricchia “Barriers to voice technology adoption worldwide 2020” URL: https://www.linkedin com/pulse/problem-current-speech-recognition- technology-dani-cherkassky/ Accessed: Dec 15 2023
[40] David E Rumelhart et al “Backpropagation: Theory, Architectures, and Application” In: Mathematical Perspectives on Neaural Networks, 1996 [41] Karel Veselỳ et al “Sequence-discriminative training of deep neural networks” Proc Interspeech, pp 2345-2349, 2013
[42] Frederik Bredmar “Speech-to-speech translation using deep learning” In:
Department of Computer Science and Engineering at University of Gothenburg, 2017
[43] Connor Shorten “Introduction to ResNets” URL: https://towardsdatascience.com/introduction-to-resnets-c0a830a288a4
[44] Colah’s blog “Understanding LSTM Networks” URL: https://colah.github.aio/posts/2015-08-Understanding-LSTMs/ Accessed: Dec 15 2023
[45] Nitish Srivastava and Geoffrey et al Hinton “Dropout: A simple way to prevent neural networks from overfitting” Journal of Machine Learning Research, Vol 15, pp 1929-1958, 2014
[46] Dong Yu and Li Deng “Automatic Speech Recognition” Signals and Communication Technology, 2016
[47] Schaller, R R “Moore’s law: past, present and future” In: IEEE Spectrum,
[48] Richard Werner “Developing an automatic speech recognition system based on Czech spoken language” Master’s thesis Faculty of Information
[49] Aditya Amerkar et al "Speech Recognition using Recurrent Neural
Networks" In: IEEE International Conference on Current Trends toward
[50] Geoffrey Hinton et al “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups” In: IEEE Signal
Processing Magazine, Vol.29, No.6, pp.82-97, 2012
[51] Nataliya Shakhovska et al “Development of the Speech-to-Text Chatbot
Interface Based on Google API” Google, 2019
[52] Nitish Srivastava et al “Dropout: A simple way to prevent neural networks from overfitting” Jounal of Machine Learning Research, Vol 15, pp 1929-
[53] CollegeSidekick “Boundless Psychology” URL: https://www.collegesidekick.com/study-guides/boundless- psychology/neurons Accessed: Dec 15 2023
[54] Eric Jang et al “Categorical Reparameterization with Gumbel-Softmax” In: arXiv:1611.01144v5, 2017
[55] Alex Jruzgevsky et al “ImageNet Classification with Deep Convolutional
Neural Networks” University of Toronto, 2012
[56] Cobus Greyling, “General Chatbot Architecture, Design & Development
Overview” URL: https://cobusgreyling.medium.com/general-chatbot- architecture-design-development-overview-58e145398608 Accessed: Dec 15
[57] Lara Lloret Iglesias et al “A primer on deep learning and convolutional neural networks for clinicians” In: Insight Imaging, 2021.