(LUẬN văn THẠC sĩ) advanced deep learning methods and applications in opendomain question answering, các phương pháp học sâu tiên tiến và ứng dụng vào bài toán hệ hỏi đáp miền mở

Open-domain Question Answering

Problem Statement

In QA systems, questions are formulated as natural language sentences and can be categorized semantically into various types, including factoid, list, causal, confirmation, and hypothetical questions Among these, factoid questions, which typically start with Wh- words like What, When, Where, and Who, are the most prevalent in research literature Open-domain QA allows users to pose questions without limitations to a specific domain, enabling them to inquire about any topic they choose The answers provided are factual and are usually presented in text format.

An open-domain QA system operates with a straightforward input-output structure, where the input is an unrestricted natural language question and the output is a coherent answer presented as text These systems leverage web resources or databases to provide answers, but they are often divided into smaller sub-tasks for efficiency The primary sub-tasks include document retrieval and document comprehension, each handled by dedicated components Typically, an open-domain QA system consists of two main modules: a Document Retriever for the retrieval task and a Document Reader for comprehension These modules can be integrated in a pipeline to create a complete open-domain QA system.

Figure 1.2: The pipeline architecture of an Open-domain QA system.

In an open-domain question-answering system, the input is a question (q) and the output is an answer (a) The Document Retriever identifies the top-k relevant documents from a vast search space, which ideally encompasses global knowledge However, due to the impracticality of an unlimited search space, common knowledge sources like the Internet, particularly Wikipedia, are utilized A document is deemed relevant if it contains the answer to q, but it must also be understandable and semantically aligned with the question The relevance of documents is quantified by the Retriever, allowing for effective ranking of the top-k highest-scoring documents.

(1.1) where f(ã)is the scoring function After obtaining a workable list of documents,

The Document Reader processes a question (q) and a set of documents (D) to identify the most likely text span that answers the question Unlike the Retriever, which handles a larger volume of documents, the Reader focuses on a smaller selection but requires a deeper analysis to accurately locate the precise answer within the text This task demands strong comprehension skills and the ability to reason and deduce effectively.

Difficulties and Challenges

Open-domain Question Answering (QA) presents significant challenges due to its inherent complexities While the goal of an open-domain QA system is to answer any question, achieving this is hindered by our limited understanding of the world and the constraints of information retrieval (IR) systems, which can only process digitized content This content spans various formats, including text, videos, images, and audio, each requiring unique processing methods Despite the vast amount of data available online, the sheer scale poses a challenge for open-domain QA systems, particularly in their retrieval modules, compounded by the ever-changing nature of online content.

The vast number of documents in the search space necessitates a fast retrieval process, often at the expense of accuracy Many Document Retrievers struggle to select relevant documents due to a lack of sophisticated comprehension capabilities This can lead to situations where relevant documents do not contain the required answers, stemming from imprecise information sourced from the unreliable web or a failure to grasp the semantic meaning of the question For instance, a retrieving model might return documents based on isolated keywords like "diamond" and "hardest gem," while overlooking the overall meaning of the query Conversely, a document may align with the semantic intent but still provide incorrect information.

Table 1.1: An example of problems encountered by the Document Retriever.

Question: What is the second hardest gem after diamond?

(1) Diamond is a native crystalline carbon that is the hardest gem.

(2) Corundum is the the main ingredient of ruby, is the second hardest material known after diamond.

(3) After graphite, diamond is the second most stable form of carbon.

Open-domain question-answering (QA) systems typically operate in a pipeline structure, which leads to a significant issue known as cascading errors In this setup, the performance of the Reader is heavily reliant on the effectiveness of the Retriever Consequently, if the Retriever performs poorly, it can create a substantial bottleneck, adversely affecting the overall efficiency of the system.

Deep learning

Deep learning has emerged as a leading trend in machine learning research, owing to its effectiveness in addressing practical challenges Although it has gained widespread attention recently, its roots trace back to the 1940s with the pioneering work of Walter Pitts and Warren McCulloch, who developed the first mathematical model of a neural network The rapid advancements in deep learning can be attributed to the vast amounts of training data available on the Internet, coupled with significant improvements in computer hardware and software As a result, deep learning has achieved remarkable success across various fields, including computer vision, speech recognition, and natural language processing.

Figure 1.3: The relationship among three related disciplines.

For a machine learning system to function effectively, raw data must be processed into feature vectors through multiple feature extractors Traditional machine learning methods struggle to automatically learn these extractors, often relying on domain experts to identify useful features, a process known as "feature engineering." As Andrew Ng noted, "Coming up with features is difficult, time consuming, and requires expert knowledge," highlighting that "applied machine learning" fundamentally revolves around this essential task of feature engineering.

Deep learning, a subset of machine learning, distinguishes itself by requiring minimal to no hand-crafted features and automatically generating useful features from input data This process creates new representations of the data, enabling deep learning to function as both a computational model and a representation-learning technique with multiple levels of abstraction Significantly, the representations learned from one task can be effectively reused for various similar tasks, a concept known as "transfer learning."

Supervised learning is the most prevalent approach in both machine learning and deep learning, applicable across various domains In this method, each training instance comprises input data paired with its corresponding label, which signifies the desired output for that input In classification tasks, labels denote the specific class to which each data point belongs, resulting in a finite number of label values The training dataset is represented as T = {(xi, yi) | xi ∈ X, yi ∈ Y, 1 ≤ i ≤ n}, where X is the set of input data and Y is the set of labels To facilitate learning, a loss function must be established to quantify the discrepancy between predicted and actual labels The learning process involves adjusting the model's parameters to minimize this loss function, typically utilizing the back-propagation algorithm, which computes the gradient vector to indicate how the loss function varies with the parameters, allowing for appropriate updates to be made.

A deep learning model, specifically a multi-layer neural network, effectively represents complex non-linear functions, denoted as h W (x), where x represents input data and W signifies trainable parameters As illustrated in Figure 1.4, this model consists of an input layer with four units (x1, x2, x3, x4), a hidden layer with three units (a1, a2, a3), and an output layer with two units (y1, y2) This architecture is classified as a fully-connected feed-forward neural network, characterized by the absence of cycles in the connections, ensuring that each unit in one layer connects to every unit in the subsequent layer The output from one layer serves as the input for the next, with the value of each unit in the k-th layer (for k ≥ 2) being determined by the inputs from the previous layer.

Figure 1.4: The architecture of a simple feed-forward neural network. layer (including the bias), is calculated as follows: a k j = g z k j

In a neural network, the forward-propagation process involves calculating the output vector for each layer while keeping the parameters fixed For each unit in the k-th layer, the weight values between the units of the current and previous layers are applied, using a non-linear activation function such as the sigmoid function The resulting vector from the k-th layer is then passed as input to the next layer, continuing this process until the output layer is reached Ultimately, the predicted vector for the input data x is obtained as ˆy = h W (x).

Objectives and Thesis Outline

Despite the existence of various models for machine comprehension tasks, advanced document retrieval models in open-domain question answering (QA) have been underexplored, even though the performance of the retriever is crucial for the system's success To address this gap, Dhingra et al introduced the QUASAR dataset, which aims to enhance open-domain QA research by enabling the retrieval of relevant documents from a vast corpus based solely on the provided question Building on this foundation and previous research, this thesis focuses on developing an advanced model for document retrieval, contributing to the field's progress.

• The thesis proposes a method for learning question-aware self-attentive document encodings that, to the best of our knowledge, is the first to be applied in document retrieval.

• The Reader from DrQA [7] is utilized and combined with the Retriever to form a pipeline system for open-domain QA.

• The system is thoroughly evaluated on QUASAR-T dataset and achieves ex- ceeding performance compared to other state-of-the-art methods.

The structure of the thesis includes:

Chapter 1: The thesis introduces Question Answering and focuses on Open- domain Question Answering systems as well as their difficulties and challenges.

A brief introduction about Deep learning is presented and the objectives of the thesis are stated.

Chapter 2 provides an overview of the background knowledge and relevant research related to the thesis, highlighting various deep learning techniques employed throughout the study It introduces the pairwise learning-to-rank approach and reviews significant related works in the existing literature, establishing a foundation for the research presented.

Chapter 3 provides a comprehensive overview of the proposed Retriever, which consists of four key components: an Embedding Layer, a Question and Document Encoding Layer, and a Scoring Function This framework is integrated with the Reader from DrQA to create an effective open-domain Question Answering (QA) system Additionally, the chapter outlines the training procedures for both the Retriever and the Reader models, ensuring a clear understanding of their functionalities and interactions.

Chapter 4 delves into the implementation of the models, highlighting specific hyperparameter configurations A comprehensive evaluation of both the Retriever and the overall system is conducted using the QUASAR-T dataset The performance of these models is then compared to baseline models, including several state-of-the-art options, to showcase the system's effectiveness.

Conclusions: The summary of the thesis and future work.

Deep learning in Natural Language Processing

Distributed Representation

In natural language processing (NLP), unlike computer vision, models cannot directly process raw images or numerical tensors; instead, they work with sequences of words or characters To enable deep learning models to understand these textual inputs, a mapping technique is essential to convert words or characters into vector representations at the initial layer.

The embedding look-up mechanism, illustrated in Figure 2.1, utilizes an embedding matrix consisting of embedding vectors that can be initialized randomly or learned through representation learning methods When embeddings are acquired through preliminary tasks before being applied to the model, they are referred to as pre-trained embeddings Depending on the specific problem, these pre-trained embeddings can either remain fixed or be fine-tuned during training Regardless of whether word embeddings or character embeddings are employed, the look-up mechanism operates similarly; however, the effects of each type of embedding can vary significantly.

Figure 2.1: Embedding look-up mechanism.

Word embedding refers to a distributional vector assigned to a word, and while it can be generated randomly, this approach fails to provide meaningful representations for effective learning To enhance the learning process, it is essential to develop word embeddings that accurately capture the similarities between words, and there are various methods available to achieve this.

Word embeddings gained prominence through the introduction of two key models: continuous bag-of-words (CBOW) and skip-gram, as highlighted in studies [35] and [34] These models are based on the distributional hypothesis, which posits that similar words appear in similar contexts The CBOW model computes the conditional probability of a target word based on its surrounding context words using a sliding window approach For instance, with a window size of 2, the probability P(wi | wi−2, wi−1, wi+1, wi+2) is calculated, where the context words serve as input and the target word as output In contrast, the skip-gram model reverses this process, using a single word as input to predict its context words The primary goal of these models is to generate meaningful word embeddings rather than merely predicting words, focusing on the vector representations produced by the hidden layer after training Due to their efficiency, word embeddings have become essential in natural language processing (NLP) tasks and are a significant factor behind many state-of-the-art results in the field.

Character embedding focuses on the morphological representation of words rather than just capturing syntactic and semantic information like word embedding This approach offers several advantages, including a smaller character vocabulary size compared to word vocabulary size in many languages, such as English and Vietnamese, which reduces the number of embedding vectors that need to be learned Since all words are made up of characters, character embedding effectively addresses the out-of-vocabulary issues that often plague word embedding methods, even when large word vocabularies are used Moreover, combining character embedding with word embedding has shown significant improvements in various applications.

[10, 13, 32] Some other methods use only character embedding and still achieve positive results [6, 25].

Long Short-Term Memory network

Natural Language Processing (NLP) typically involves processing input as a stream of tokens, such as sentences or paragraphs By converting these tokens into embedding vectors, we create a list of input features for analysis However, using a traditional fully-connected feed-forward neural network presents challenges, as each input feature requires a unique set of parameters, making it difficult for the model to capture the position-independent nature of language For instance, regardless of the sentence structure—whether it’s "I need to find my key," "My key is what I need to find," or the question "What do I need to find?"—the expected answer should consistently be "my key."

Recurrent Neural Networks (RNNs) are designed to handle sequential data by utilizing parameter sharing across time steps This approach significantly reduces the number of parameters, allowing RNNs to effectively process variable-length sequences, such as sentences, even if they were not included in the training dataset Consequently, RNNs require less training data while maintaining strong statistical power that can be applied to each input feature.

Recurrent Neural Networks (RNNs) can be illustrated through two diagrams: the first shows the network's implementation at time step t, with input \( x_t \), output \( h_t \), and a single function \( A \) that processes both current input and previous output, accumulating information up to time step t The second diagram unfolds this representation, displaying all time steps flattened out, where each step performs similar computations at different states Typically, RNNs operate in one direction, where the state at time t is influenced solely by past states However, to enable the output \( h_t \) to consider both past and future information, we can reverse the input sequence and apply another RNN, resulting in a combined output from both networks, known as a bi-directional recurrent neural network.

Although recurrent neural networks (RNNs) are often considered the optimal choice for natural language processing (NLP), the standard RNN faces significant training challenges due to the vanishing and exploding gradient problems To address these issues, long short-term memory (LSTM) networks were introduced, incorporating a gating mechanism that enables the retention of gradient flow over extended periods This innovative approach enhances the model's ability to learn from long sequences in NLP tasks.

[15] proposed a weighted gating mechanism that can be learned rather than fixed.

In the traditional RNN shown previously, function A is just a simple non-linear x x x

Ct ht ft it ot

LSTM networks incorporate an internal loop within their cells, as illustrated in Figure 2.3, allowing them to effectively learn long-term dependencies This design significantly enhances their ability to manage sequential data compared to traditional recurrent networks.

The operation illustrated in Figure 2.3 can be expressed through formulas for each time step, including the input gate (it), forget gate (ft), output gate (ot), and cell state (ct) The equations demonstrate how these gates interact with parameters (U and W) to manage the flow of information Specifically, it controls the intake of new data, ft dictates the retention of past information, and ot governs the output at each time step The robustness of Long Short-Term Memory (LSTM) networks has led to their successful application in various fields, including machine translation, speech recognition, and image caption generation.

Attention Mechanism

The attention mechanism in humans allows us to effectively process information by focusing on what is important and filtering out irrelevant details, as our brains have limited processing power compared to the vast amount of information available Similarly, while the attention mechanism in machine learning differs from human cognition, it shares the fundamental concept of concentrating on specific parts of input data.

Recurrent Neural Networks (RNNs) are designed to handle sequential data, but when the input sequences become excessively long, they can lead to information saturation, rendering the data ineffective or misleading This challenge persists even in advanced models like Long Short-Term Memory (LSTM) networks.

The encoder-decoder architecture is widely used in machine translation and text summarization, where the encoder compresses the input into a meaningful intermediate representation for the decoder However, encoding longer input sequences poses challenges The attention mechanism addresses this by allowing the model to focus on specific parts of the input sequence that are most relevant to each part of the output, rather than considering the entire input at once.

Attention mechanism was introduced by [3] for machine translation task.

The authors of the paper propose an innovative extension to the traditional encoder-decoder architecture, enabling automatic (soft-)searches for relevant segments within the input sequence This concept is visually represented in Figure 2.4 They introduce a conditional probability for each output, expressed as p(y_t | y_1, , y_{t-1}, x) = g(y_{t-1}, s_t, c_t) Here, s_t represents the hidden state at time step t in the decoder, defined by the equation s_t = f(s_{t-1}, y_{t-1}, c_t) Additionally, the context vector c_t at time step t is calculated as the weighted sum of previous hidden states and outputs, incorporating attention weights α_t,i for each relevant input segment.

Figure 2.4: Attention mechanism in the encoder-decoder architecture. all the hidden states in the encoder: c t n Õ j = 1 α t j h j (2.9)

The weight for the hidden state \( h_j \) at time step \( t \) is determined by the equation \( \alpha_t^j = \frac{\exp(e_t^j)}{\sum_{k=1}^n \exp(e_t^k)} \), where \( e_t^j = A(s_{t-1}, h_j) \) Here, \( A \) represents the alignment model, which utilizes feed-forward neural networks This attention mechanism allows the encoder to avoid compressing all input information into a fixed-length vector, enabling the decoder to focus on the most relevant parts of the input during each generation step.

The attention mechanism, introduced in [3], serves as a foundational framework in the realm of artificial intelligence While research in this area continues to evolve [50], the effectiveness of attention mechanisms has led to their widespread application beyond natural language processing (NLP), extending into various domains including computer vision [49].

The self-attention mechanism, as proposed in [30], represents a variation of the traditional attention framework aimed at enhancing sentence embeddings and providing insights into their formation To create a fixed-size vector from variable-length input sentences, conventional methods often rely on the final hidden states of RNNs or utilize pooling techniques However, RNNs struggle to maintain semantic information over extended time steps To address this limitation, the authors introduce a self-attention mechanism that autonomously identifies and encodes the semantically significant parts of the input sentence into its embedding By applying an RNN to the input sequence, a series of hidden states is generated.

In the context of attention mechanisms, the hidden states are represented as a matrix \( H \in \mathbb{R}^{u \times n} \), where \( u \) denotes the size of each hidden state and \( n \) represents the length of the sequence The attention weights for this hidden state matrix \( H \) are computed using the formula \( \alpha = \text{softmax}(w^T \tanh(WH)) \), where \( W \in \mathbb{R}^{r \times u} \) is the weight matrix, \( w \in \mathbb{R}^{1 \times r} \) is the weight vector, and \( r \) is a hyperparameter.

The embedding that represents an input sentence is the weighted sum of the hidden states: e n Õ i = 1 α i hi (2.14)

The proposed self-attention mechanism has been thoroughly evaluated across three distinct tasks, demonstrating its effectiveness Additionally, this technique is applicable to longer inputs, including entire paragraphs and documents.

Employed Deep learning techniques

Rectified Linear Unit activation function

Deep learning's strength lies in its ability to model complex non-linear functions through the use of non-linear activation functions Without these activation functions, even the deepest neural networks would merely perform linear transformations from input to output, significantly limiting their usefulness.

Figure 2.5: The Rectified Linear Unit function.

The Rectified Linear Unit (ReLU) is a widely used activation function known for its simplicity and rapid computation speed Unlike the sigmoid function, which saturates large values to 1, ReLU remains unbounded, allowing it to handle large inputs efficiently This characteristic makes ReLU a preferred choice for constructing neural networks The ReLU function is mathematically defined as y = max(0, x).

The graphical representation of Equation 2.15 is illustrated in Figure 2.5 A notable variant of the Rectified Linear Unit (ReLU) is the Noisy Rectified Linear Unit (NReLU), defined by the formula y = max(0, x + N(0, σ(x))), where N(0, σ(x)) represents Gaussian noise with a mean of zero and a variance of σ(x).

Mini-batch gradient descent

Model training involves a systematic procedure to adjust parameters in order to minimize the loss function, effectively treating it as an optimization problem The core mechanisms of back-propagation are gradient descent and the chain rule In this context, let E(W) denote the loss function, where W represents the real-valued parameters; the gradient descent algorithm can then be articulated accordingly.

2 UpdateWuntil the loss value,E(W), is acceptable using the following equation:

Wk = Wk− 1 −∇ W E(W) (2.17) where k is an iterator, k ≥ 1; , which is a hyperparameter, is a scalar that determines the step size (or learning speed) when updating.

Batch gradient descent utilizes the entire training dataset to update model weights (W), providing a stable direction towards optimal points However, this method can be computationally expensive and impractical for large datasets, which are common today, especially when memory constraints are a concern In contrast, stochastic gradient descent (SGD) updates W using only a single training example at a time, resulting in faster computations per step While SGD converges more slowly and requires more iterations, it is well-suited for large datasets and dynamic models, such as those used in online learning, since it does not necessitate loading the entire dataset at once.

Mini-batch gradient descent combines elements from both batch gradient descent and stochastic gradient descent (SGD) by splitting the training data into small batches, each containing more than one data point but significantly fewer than the total dataset An epoch refers to one complete pass through all the batches Similar to SGD, mini-batch gradient descent requires shuffling the data prior to training The updating process for mini-batch gradient descent involves using each batch in a single step, making it faster than batch gradient descent while offering a more stable convergence than SGD.

Adaptive Moment Estimation optimizer

Adaptive Moment Estimation (Adam) is a powerful variant of gradient descent known for its simplicity and efficiency It requires minimal memory, utilizes only the first-order derivative, and effectively scales data and parameters Unlike traditional gradient descent, which employs a fixed learning rate, Adam dynamically adjusts the learning rate for each parameter, enhancing its adaptability The algorithm is detailed in Algorithm 2.1.

Input :Step size α; exponential decay rates for moment estimates β 1 , β 2 ∈ [0,1); stochastic objective function f(θ); the initial parameter vector θ 0

The hyperparameters α, β1, β2 are commonly set to default values of 0.001, 0.9, and 0.999, respectively, along with a small constant of 10^-8, which are recommended for various machine learning tasks In practical applications, the Adam optimizer demonstrates superior performance compared to other optimization methods.

Dropout

In situations where training data is limited, a neural network with excessive layers and parameters may memorize the training instances, resulting in high accuracy on the training dataset However, this leads to poor performance on new examples, a phenomenon known as "overfitting." Overfitting negatively impacts the model's performance by increasing its sensitivity to noise and reducing its ability to generalize effectively.

Dropout is an effective technique for reducing overfitting in neural networks It works by randomly and temporarily disabling certain units and their connections during training, resulting in a new, simplified network with fewer connections This process generates a variety of network architectures, enhancing the model's ability to generalize and perform better on unseen data.

2 n However, all the parameters are shared among these networks According to

[43], training a neural network with Dropout is equivalent with training2 n smaller networks where not every network is guarantee to be trained.

Dropout is an effective technique for reducing overfitting in neural networks by diminishing the co-adaptation of units, encouraging them to learn independently while still allowing for collaboration with randomly selected units This method has shown remarkable success across various applications, including object classification, speech recognition, and biomedical data analysis, often leading to significant improvements and establishing state-of-the-art performance in these fields.

Early Stopping

In addition to Dropout, early stopping is an effective technique to prevent overfitting in machine learning models A clear indicator of overfitting can be observed by plotting training and validation loss over time; specifically, overfitting occurs when the validation loss reaches its global minimum while the training loss continues to decline.

The early stopping strategy is based on monitoring model performance, allowing for the retention of the best model parameters and reverting to them when training shows no improvement over a specified period This approach is formally outlined in Algorithm 2.2.

Early stopping is a highly effective regularization technique that offers numerous advantages over other methods It is simple to implement, as demonstrated in Algorithm 2.2, and does not necessitate alterations to the training process, unlike some techniques that modify the objective function Instead, early stopping serves as an additional strategy that can seamlessly integrate with other optimization approaches.

Algorithm 2.2:The early stopping algorithm [17].

Input :The number of training steps before evaluationn; the number of times willing to suffer lower validating error before giving up; the initial parameters θ 0

Output:Best parameters θ ∗ , best number of training stepsi ∗

Pairwise Learning to Rank approach

Information Retrieval (IR) encompasses various ranking problems, including document retrieval and sentiment categorization, making ranking methods essential to the field Researchers have explored numerous algorithms over the decades, leading to the emergence of a significant topic known as "learning to rank." This approach utilizes machine learning to develop and train ranking models aimed at sorting instances based on criteria such as relevance or importance In document retrieval, a typical solution involves converting queries and documents into feature vectors, applying a similarity metric, and subsequently sorting the documents according to their scores Notably, documents and queries can take various forms, including text, images, audio, and web pages, as long as they can be represented as vectors.

There are three main approaches to learning to rank: pointwise, pairwise, and listwise, each characterized by distinct input/output spaces and objective functions The pairwise approach is the most widely used and will be explored in greater detail.

In pairwise methods, the model processes two documents simultaneously during training, outputting scores for each The preferred order is determined by the defined metric, typically favoring the document with the higher score, which is labeled as positive, while the other is labeled negative The ranking model is expressed as a scoring function f(q,d), where q and d represent the embeddings of the query and the respective documents The model aims to ensure that the score for the positive document (d+) exceeds that of the negative document (d−), formalized by the input tuple (q,d+,d−) This objective is addressed through the margin ranking loss function.

In the dataset D, the expression (q,d + ,d −) is utilized to establish a scoring system, where the margin value α ensures a minimum score difference between positive and negative documents This approach enables the model to effectively learn to distinguish between positive and negative documents by a margin of at least α.

The pairwise learning to rank approach is utilized in various fields, including natural language processing for tasks such as question answering, as well as in computer vision, particularly for face verification In the context of face verification, the triplet loss function is employed instead of the traditional margin ranking loss function, providing a similar yet distinct method for optimizing model performance.

The margin ranking loss function, denoted as f(ã), utilizes the embedding function g(ã) to map an anchor image (x i a), a positive image (x i p), and a negative image (x i n) into a unified vector space, despite having different formulas This approach aims to minimize the Euclidean distance between the anchor and positive images while maximizing the distance from the negative image, indicating that a smaller distance signifies a higher preference for the object With N representing the total number of training instances and α as the margin value, the function effectively models both the embedding process and the scoring metric.

When implementing pairwise ranking, it is crucial to choose the right training instances to enhance model efficiency The loss function drives the model to learn the condition f(q,d+) > f(q,d−) If a training example (q,d+,d−) already meets this criterion, it won't contribute to model improvement and may hinder the training process To optimize training speed, it is advisable to select only those instances that can genuinely influence learning, defined as T = {(q,d+,d−) | f(q,d+) − α < f(q,d−)}.

Related work

Open-domain question answering (QA) differs from closed-domain QA by not being limited to a specific area and not requiring manually built knowledge bases Instead, it seeks to answer a wide range of questions using extensive world knowledge sourced from large corpora like Wikipedia Numerous datasets, including SQuAD, WikiReading, and the recently introduced QUASAR dataset, have been developed to support the advancement of open-domain QA systems Among these, SQuAD is the most prominent, featuring over 100,000 questions sourced from Wikipedia.

To develop models capable of understanding and reasoning for open-domain questions, the SQuAD dataset is utilized, as it provides context documents for each question with guaranteed answers However, SQuAD alone is insufficient for building a complete open-domain QA system, which requires both a document retriever and a machine reader Recognizing this gap, Dhingra et al introduced the QUASAR dataset, which comprises two sub-datasets targeting different question-answering styles The QUASAR-S dataset features over 37,000 fill-in-the-gap queries sourced from Stack Overflow, categorizing it as a closed-domain dataset In contrast, the QUASAR-T dataset contains approximately 43,000 open-domain trivia questions from various sources, facilitating both document retrieval and reading by providing associated documents for each question-answer pair.

The document retriever can be trained to rank these documents and return only some highest-scored ones.

The advancement of deep learning, particularly the attention mechanism, has significantly enhanced machine reading comprehension tasks Wang et al introduced the Gated Attention-based recurrent networks, which effectively extract key evidence from documents through a self-matching layer, achieving top results on the SQuAD leaderboard Similarly, Dhingra et al utilized a bi-directional Gated Recurrent Unit for encoding questions and documents, incorporating a Gated-Attention module in their multi-hop architecture Cui et al presented the Attention-over-Attention (AoA) reader, which emphasizes the interactive information between queries and documents, yielding superior outcomes compared to existing systems Furthermore, Seo et al developed the Bi-directional Attention Flow (BiDAF) network, a hierarchical architecture that integrates various levels of document representation, producing state-of-the-art results on the SQuAD dataset.

In addition to methods focused solely on machine comprehension tasks, there are comprehensive open-domain question-answering (QA) systems that integrate both document retrieval and machine reading capabilities A prominent example is DrQA, which features a reader that includes a paragraph encoding layer using a multi-layer bi-directional long short-term memory (BiLSTM) on a selective feature set, alongside a question encoding layer that generates a single vector representation of the question DrQA employs two independently trained classifiers to predict answer span boundaries To enhance retrieval speed, it utilizes a TF-IDF weighted bag-of-words technique for document selection, although this approach constrains retrieval performance and highlights potential for improvement This thesis builds on DrQA's reader for the machine reading module while proposing an enhanced document retrieval method, with a detailed discussion of DrQA's reader provided in Chapter 3.

The R 3 system integrates document retrieval and machine comprehension into a single model, allowing for joint training, unlike most open-domain QA systems that treat these tasks separately Recognizing the crucial role of the document retriever, the authors emphasize that the overall system's performance heavily relies on its effectiveness; a poor retriever hinders the reader's ability to extract correct answers R 3, or Reinforced Ranker-Reader, employs a Ranker for document retrieval and a Reader for comprehension, utilizing reinforcement learning to enhance their interaction The Ranker, trained with reinforcement learning, provides a probability distribution of documents based on the Reader's performance on top-ranked results, establishing a feedback loop that improves both components This approach surpasses traditional ranking methods like TF-IDF, while the Reader is optimized using gradient descent to accurately identify answer spans in the documents R 3 achieves state-of-the-art performance in both document retrieval and machine comprehension tasks.

In the open-domain question answering (QA) setting, reading comprehension models, despite their high efficiency, rely significantly on document retrieval to obtain relevant information For instance, the Gated-Attention (GA) model achieves a reading accuracy of 60% on the QUASAR-T test set; however, when factoring in the performance of the document retriever, the overall accuracy decreases to 26.4% Consequently, there is a growing emphasis on enhancing the document retrieval process to improve overall QA performance.

An open-domain QA system typically consists of two main components: a Document Retriever for document retrieval and a Document Reader for machine comprehension Our system follows this framework, emphasizing the Document Retriever Specifically, the Document Retriever is an end-to-end deep learning model that can be subdivided into four components.

The proposed system consists of an Embedding Layer that transforms words from questions and documents into a vector space, followed by a Question Encoding Layer and a Document Encoding Layer that generate final representations for both questions and documents Additionally, it features a neural-based Scoring Function designed to effectively measure similarity between two fixed-size vectors To enhance the capabilities of our Document Retriever in an open-domain question-answering (QA) environment, we leverage the Document Reader from DrQA to extract answers from the retrieved documents.

Document Retriever

Embedding Layer

An embedding layer (EL) serves as the foundational component in deep learning models for various NLP tasks, assigning distributional vectors to tokens in the input sequence for further processing This layer represents the initial level of abstraction in question/document representation learning Our approach utilizes both token-level and character-level embeddings to effectively capture semantic and morphological aspects of words While it's not mandatory to employ both types, their combined use has become a best practice as they address each other's limitations In this low-level embedding layer, there is no linguistic distinction between questions and documents, allowing for the use of identical parameters for both to enhance representation power.

Before converting tokens into vectors, it is essential to pre-process the data by extracting all tokens from the raw documents Although this task seems straightforward, achieving absolute accuracy in extraction is not trivial.

In written texts, tokens often include ambiguous characters that necessitate a strong understanding of the language to identify their boundaries To address this issue, non-word and non-number characters are removed, and the text is converted to lowercase If a document contains a URL, it is treated as a lengthy, non-informative token To streamline the text, a template matching method is used to locate and replace all URLs with the term "url." At this stage, the document remains a single string.

Figure 3.2: The architecture of the Embedding Layer. of text We use the best English tokenizer model from spaCy 1 to obtain a list of tokens from a document.

Token embedding is a crucial step in natural language processing, where each token is mapped to its corresponding embedding using a look-up table We utilize pre-trained English word vectors from fastText, which employs the Continuous Bag of Words (CBOW) model with position weights These vectors, trained on extensive datasets such as Common Crawl and Wikipedia, feature 300 dimensions and encompass a vocabulary of over 2.5 million tokens To maintain the integrity of the embeddings, we opt not to fine-tune them during training, as adjusting them with a small dataset could disrupt the overall structure and degrade the contextual representation of the tokens.

1 https://spacy.io/models/en#en core web lg

Character embedding addresses the out-of-vocabulary problem encountered in token embeddings, which can have a vocabulary size of up to 2.5 million Unlike token embeddings, character embeddings do not rely on pre-trained models, as the number of characters is significantly smaller than that of tokens After pre-processing, the dataset consists of only word and number characters, totaling 36 characters Additionally, since characters lack inherent semantic structures, it is most effective to learn their vector representations directly from the training data.

In this paper, let V C be the character set, the character embedding matrix

The initial character embeddings, denoted as C ∈ R |V C |×n, are generated randomly using Glorot initialization and subsequently refined as trainable model parameters For each token t, a look-up table is utilized to derive a sequence of character embeddings T = c 1, c 2, , c |T|, where each ci belongs to C A single layer of bi-directional long short-term memory (BiLSTM) is then applied to T, resulting in character-level embeddings e c, calculated as e c =→− h|T| ⊕ ←− h|T|, where ⊕ represents the concatenation of the last hidden states from both the forward and backward directions.

Final embedding involves processing a sequence of tokens, denoted as P = {t i } i |P| = 1, which can represent either a question or a document The output of the embedding layer (EL) is a sequence of embeddings E = {e i } i |P = 1 |, where each embedding e i is derived from the combination of a pre-trained token embedding e t i and a character embedding e i c This results in an embedding matrix E that captures the input information of the entire sequence, whether it is a question or a document, and serves as the input for subsequent layers in the model.

Question Encoding Layer

This layer focuses on generating a fixed-size vector representation for questions, which are typically shorter than documents Factoid questions, reflecting specific information needs, are usually concise and straightforward Therefore, it is unnecessary to complicate the question encoding process with complex mechanisms that could lead to overfitting To create a single vector for a question, we utilize a BiLSTM layer applied to the output of the encoding layer, considering the question's length.

This BiLSTM is used to model the contextual information The last hidden states of the forward and backward LSTM are concatenated into one vector: h q =→− h q |Q| ⊕←− h q |Q| (3.5)

Then, two fully-connected layers are placed on top of this vector with the output of the first one is activated usingReLUfunction: a q = ReLU W ( 1 ) h q + b ( 1 )

(3.6) q = W ( 2 ) a q + b ( 2 ) (3.7) whereW ( 1 ) ,W ( 2 ) , b ( 1 ) , and b ( 2 ) are trainable weights and biases of the model At this step, we obtain the final encoding, namely q, of the input question.

Document Encoding Layer

First of all, to take advantage of the contextual representing power, we use the same BiLSTM from the Question Encoding Layer (QEL) along with its parameters:

In this layer, all hidden states from the BiLSTM are combined with the question encoding, q, rather than just the last hidden states as seen in QEL This approach aligns with the human cognitive process of maintaining the essence of a question while evaluating the relevance of a document By concatenating q, which encapsulates the overall meaning of the question, with the BiLSTM's hidden states, we create question-aware hidden states, H This method enhances the model's ability to assess document relevance in relation to specific questions.

The document encoding is influenced by the question, as represented in the equation H = h i d ⊕ q i |D| = 1 = n h i o |D| i = 1 Here, |D| denotes the number of tokens in the document, and E d = e d i i |D| = 1 represents the word embeddings from EL By integrating the question encoding with the document's hidden states, the model generates unique representations of the same document based on the specific question posed.

A document is composed of numerous sentences, yet a factoid question can often be answered with just one sentence or a fragment of it This indicates that various pieces of information can be extracted from a document, with some being relevant to specific questions and others not To enhance the answer selection process, it is essential to encode only the most pertinent information This is accomplished by utilizing a self-attentive network, which considers the question signal to determine the most relevant sections of the document The attention weights are computed based on the combined information from both the document and the question, represented by the formula: a_i_d = ReLU(Wh_i + b).

In the context of a BiLSTM model, the final document representation \( d \) is derived from a linear transformation of the weighted sum of hidden states \( H_d \) This is expressed through the equation \( d = Wc + b \), where \( W \) and \( b \) are learnable parameters, and \( c \) is calculated as \( c = \sum_{i=1}^{|D|} \alpha_i h_{i}^{d} \) Here, \( \alpha \) represents the attention weights, \( h_{i}^{d} \) are the hidden states, and \( l \) denotes the question encoding size It is important to note that the dimensions of \( d \) match those of \( q \), ensuring compatibility within the same vector space.

Scoring Function

To assess the relevance between two fixed-size vectors, q representing the question and d representing the document, a scoring function is utilized The two most prevalent methods for this measurement are Euclidean distance, calculated as \( ||q - d||^2 \), and cosine similarity, expressed as \( q \cdot d \).

Euclidean distance measures the proximity of vectors, with smaller values indicating greater relevance, while cosine similarity evaluates the angle between vectors, yielding values between -1 and 1, where values closer to 1 signify higher similarity Although these methods are useful, their fixed nature limits their effectiveness in assessing multi-dimensional vector interactions To enhance adaptability, we propose allowing the model to learn the scoring function, enabling it to better align with the output vectors from previous layers.

Our scoring function utilizes a neural network with two feed-forward layers, akin to matching methods that identify relationships between two vectors This setup allows our Document Retriever to be trained in an end-to-end manner, where errors are backpropagated through both the scoring function and encoding layers As a result, the model enhances its ability to measure similarity and improve representations of questions and documents concurrently.

Our scoring function utilizes a feature vector that combines question encoding (q), document encoding (d), their element-wise product, and their absolute element-wise difference This approach effectively mimics the calculations of cosine similarity and Euclidean distance, aiding in the convergence of the scoring function The similarity score between the two encodings can be computed accordingly.

In this article, we present the mathematical framework for a neural network model, defined by the equations \( a = \text{ReLU}(Wx + b) \), \( s = w \circ a + b \), where the Hadamard product is utilized Here, \( W \) is a matrix of dimensions \( r \times 4l \), while \( w \) and \( b \) are vectors in \( R^r \) representing the trainable parameters of the network The encoding size is denoted by \( l \) and the number of hidden units by \( r \) The resulting scalar \( s \) serves as the similarity score between query \( q \) and document \( d \).

Training Process

The Scoring Function serves as the final layer of the network, generating scores based on a given question and document during the forward pass The model is designed to assign higher scores to relevant documents compared to irrelevant ones, utilizing a pairwise ranking approach Each training example consists of a 3-tuple that includes a question, a positive document, and a negative document To achieve this, the Document Encoding Layer is employed in parallel to create encodings for both the positive and negative documents The Document Retriever model is trained by minimizing the margin ranking loss, with S representing the scoring function.

In the context of the loss function, the error rate is positive when the score of the positive document encoding (d +) is lower than that of the negative document encoding (d −) If d + scores higher, the error rate becomes zero, resulting in no updates to the model parameters To avoid halting the learning process when the scores of the positive and negative documents are equal (S(q, d +) = S(q, d −)), a margin value is introduced This margin ensures that the model continues to improve even when the difference between the scores (S(q, d +) - S(q, d −)) is less than one.

The entire neural network, encompassing the Embedding Layer and Scoring Function, is trained through backpropagation and mini-batch gradient descent, utilizing the Adam optimizer for efficient training To mitigate overfitting, we implement techniques such as Dropout and early stopping The margin ranking loss function serves as a unified objective, enabling the model to develop proficient question and document encodings while optimizing the scoring function simultaneously.

To effectively utilize the margin ranking loss function, it is crucial to establish the criteria for classifying documents as positive or negative, which is largely influenced by the training dataset Given the challenge of lacking ground-truth labels for the ranking task, we adopt the approach of pseudo labels, similar to the methodology outlined in [46] In this context, documents are designated as positive if they contain an exact match of the answer span derived from the machine comprehension task.

The selection of training examples is vital for an effective training process Typically, each question has a few positive documents, while the majority are categorized as negative Generating all potential training instances is impractical and resource-intensive, as many negative documents are easily distinguishable from positive ones and do not enhance the learning process Therefore, it is essential to minimize the number of negative examples to optimize training efficiency.

Algorithm 3.1:Pseudocode of the training procedure.

Input:Number of epochs with no improvement before stopping training

(patience) p; Maximum number of negative documents n.

5 (Training examples generated by randomly selectingn negative documents, each is paired with all positive ones.)

7 (Training examples generated by selecting top-nhighest-scored negative documents using the current saved model, each is paired with all positive ones.)

9 (Train the model with mini-batch gradient descent.)

10 dev acc ← (the accuracy on the development set)

11 if dev acc > best dev acc then

14 best dev acc ← dev acc

To enhance model training, it is crucial to provide challenging examples by dynamically selecting the highest-scored negative documents However, this method necessitates processing all negative documents with the most recent parameters at each training step to identify the top candidates While this can improve the model's capability, it significantly slows down the training process.

Algorithm 3.1 demonstrate how the training procedure works The early stopping mechanism is done by using the accuracy on the development set To speed up the training process but still provide challenging examples for the model, we combine two negative sampling techniques: random and top-n Normally, n random negative documents will be selected, so the sampling process is done quickly However, when the model stops improving, current top-n highest-scored negative documents are used This helps the optimizing process overcome local optima and keep improving since these are the most difficult, yet useful, training instances.

Document Reader

DrQA Reader

The DrQA Reader aims to identify the text span within returned documents that best answers a specific question This process involves three key components: paragraph encoding, question encoding, and prediction.

To enhance the efficiency of Recurrent Neural Networks (RNNs), lengthy documents are segmented into n paragraphs The paragraph encoding layer processes each paragraph, denoted as p = {p 1 ,p 2 , ,p m }, which consists of a sequence of m tokens, and transforms it into a matrix representation where each row corresponds to the embedding of a token Initially, the authors create a feature vector p˜ i for each token p i by integrating multiple sources of information.

The word embedding f e (p i ) is derived from pre-trained Glove word embeddings, with the majority remaining static However, the 1000 most common question words, including "what," "when," and "how," are kept flexible to enhance their contextual relevance.

• The exact match indicator vector which contains three binary values signal whether p i is in question q: f em (p i ) = {I(p i ∈ q),I(lowercase(p i ) ∈ q),I(lemma(p i ) ∈ q)}

• Some other token features include part-of-speech (POS) tag, named entity recognition (NER) tag and term frequency (TF) value: f t = {POS(p i ),NER(p i ),TF(p i )}

The aligned question embedding for a given passage \( p_i \) is calculated as the sum of the attention scores \( a_{i,j} \) between the passage and each question word \( q_j \), expressed mathematically as \( \sum_{j=1}^{l} a_{i,j} f_e(q_j) \) The attention score \( a_{i,j} \) is determined using the formula \( a_{i,j} = \frac{\exp(\alpha(f_e(p_i})) \cdot \exp(\alpha(f_e(q_j)))}{\sum_{k=1}^{l} \exp(\alpha(f_e(p_i})) \cdot \exp(\alpha(f_e(q_k)))} \), where \( \alpha(\cdot) \) is a fully-connected feed-forward layer utilizing the ReLU activation function.

After obtaining a sequence of p˜i, a multi-layer bi-directional RNN is applied The output of the paragraph encoding layer is:

Question encoding generates a single vector representation for an entire question instead of a sequence of vectors for each token This is achieved by utilizing a Recurrent Neural Network (RNN) on the word embeddings of the question, represented as {q1, , ql} = RNN(fe(q1), , fe(ql)) The final question embedding is calculated as q = Σj bj qj, where bj indicates the contribution of each word to the overall question vector, determined by the formula bj = exp(w · qj) / Σk exp(w · qk) based on the weight w.

Prediction At this phase, the authors build two independent classifiers, one for the answer span’s start position and one for its end position:

The final answer prediction across all paragraphs is the sequence of tokens from positioni to position j such that P start (i) × P end (j), i ≤ j ≤ i + k, is maximized,where k is the maximum answer’s length allowed.

Training Process and Integrated System

In DrQA's training process, the Reader receives a question, an answer, and a list of documents, with a crucial requirement that at least one document must contain an exact match of the answer This leads to a scenario where, during inference, DrQA consistently produces an answer, even if it's not present in the documents The preparation of the document list for each training instance is vital; using only positive documents may mislead the model into expecting answers in all provided documents, which does not reflect real-world situations Moreover, when integrated, the Reader's input documents stem from the Retriever's output, which does not ensure that all returned documents are relevant.

To enhance the Reader's performance during training, it is essential to simulate the inference phase by providing a balanced distribution of positive and negative documents generated by the Retriever This is accomplished by applying the trained Document Retriever to the training dataset and selecting the top 50 highest-scored documents The inclusion of both positive and negative documents in the training data significantly improves the Reader's overall effectiveness.

Once the Document Retriever and Document Reader are trained, the system operates in a pipeline format During the QASA Retriever's execution, it ranks all documents in the database for each question, which poses scalability challenges as the database grows To address this issue, integrating the QASA Retriever with a faster, simpler retriever module can be effective For instance, filtering out documents without overlapping words with the question significantly reduces the document pool while maintaining minimal accuracy loss This preliminary filtering step enhances efficiency before the QASA Retriever is applied.

Tools and Environment

The Retriever is developed using Python and TensorFlow, an open-source machine learning platform created by Google TensorFlow offers a flexible ecosystem of tools and libraries that facilitate advanced research in machine learning The source code for the QASA Retriever, along with detailed training and usage instructions, is available on GitHub at https://github.com/trangnm58/QASA The models used in the experiments are trained according to the environment configuration outlined in Table 4.1.

CPU Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz 2

RAM 16 GB DIMM ECC DDR4 @ 2400MHz 8

Dataset

The Retriever and Reader models are trained using the QUASAR-T dataset, which includes 43,012 factoid questions sourced from various origins Each question is linked to 100 pseudo-documents from the ClueWeb09 dataset, containing approximately one billion web pages The documents are categorized into long ones, limited to 2048 characters, and short ones, capped at 200 characters Initially filtered by a fast retriever, these documents now require a more advanced model for efficient re-ranking Answers to the questions are presented as free-form text spans, which may not necessarily appear within the documents, posing challenges for both the ranking and reading models An example of a question, its corresponding answer, and associated pseudo-documents is illustrated in Figure 4.1.

Question Lockjaw is another name for which disease

Tetanus, commonly known for causing muscle spasms in the jaw, is often referred to as "lockjaw." This condition arises as the infection advances, leading to tight and rigid jaw muscles that prevent individuals from opening their mouths.

Tetanus , commonly called lockjaw , is a bacterial disease that affects the nervous system

Figure 4.1: Example of a question with its corresponding answer and contexts from QUASAR-T.

The QUASAR-T dataset statistics are detailed in Table 4.2 Notably, as highlighted in section 3.1.5, the dataset lacks ground-truth labels essential for training the Retriever Consequently, when evaluating a question, it is important to assess whether any document from the provided list is relevant.

In a dataset of 100 pseudo-documents, those containing the exact answer are classified as positive, while others are deemed negative Notably, there are instances where none of the associated documents are positive, leading the Retriever to consistently generate negative or unrelated documents.

Instances categorized as invalid are highlighted in Table 4.2, where "Valid" signifies the instances containing the ground-truth answer within at least one pseudo-document This establishes the upper bound for assessing the performance of both the retriever and the reader, calculated as the ratio of valid instances to the total instances Notably, the upper bound for the test set is 77.37%.

The authors of [12] assess the quality of the QUASAR-T dataset using various evaluation methods, from basic heuristics to advanced deep neural networks, including feedback from human testers Their findings reveal that the top-performing model, BiDAF [41], reaches an accuracy of 28.5%, whereas human testers achieve 60.6% However, it's important to highlight that human performance still falls 16.77% short of the previously calculated upper bound, indicating the dataset's significant level of difficulty.

QUASAR-T is an open-domain QA dataset that encompasses a wide range of topics, including music, science, and food While a complete categorization of the dataset was not provided, annotators successfully categorized 144 randomly selected questions from the development set into 214 question genres and 122 entity types for answers The distribution of these question genres and answer entity types is illustrated in Figure 4.2.

Figure 4.2: Distribution of question genres (left) and answer entity-types (right).

Experiments

Tiêu đề	Advanced Deep Learning Methods and Applications in Open-Domain Question Answering
Tác giả	Nguyen Minh Trang
Người hướng dẫn	Assoc. Prof. Ha Quang Thuy, Ph.D. Nguyen Ba Dat
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	master thesis
Năm xuất bản	2019
Thành phố	Ha Noi

Định dạng
Số trang	67
Dung lượng	1,3 MB