The model presented in Chapter 3 and the results presented in Chapter 4 waspreviously published in the Proceedings of NLLP 2023 as “Joint Learning for Legal Text Retrieval and Textual En
Challenges in legal AI with NLP
Legal document retrieval (LDR)
The legal document retrieval problem is a vital step that should be performed before processing any other legal NLP tasks In this task, the relevancy prop- erty is the main considered factor The input of the LDR problem includes two components: ô q: a legal query written in natural language, which describes a legal issue or a real-life situation. ô D= {d), do, d3, ,dn}: a legal corpus which contains n legal articles.
The task of the retrieval model is to find a subset of articles: where the content of all articles is related to the legal issue mentioned in the query.
To solve this problem, a familiar method is to transform the retrieval problem into the document ranking problem The input of the ranking model is a pair consisting of a query and an article (q, đ;), and the output is the relevance score of that pair, normalized to a range from 0 to 1 Let the ranking model be the function freievals then the input and output of this function would be denoted as the equation 1.1.
Fretrieval(d, di) = Ri, € 0, 1] (1.1) where Ri is the relevance score and has a value between 0 and 1 The set of relevant articles will be selected based on the relevance score of each article, using appropriate threshold or top-k selection techniques.
Legal textual entailment (LPE)
Based on retrieved documents, LTE is the following task which determines the “af- firmation” of the query This is equivalent to answering the question of whether the content of the query entails or aligns with the content of related legal documents. The input of the LTE problem includes: ô q: a legal query written in natural language, which describes a legal issue or a real-life situation. © Dg = {d¿,,dị,, ,d¿„}: a set of documents related to the input query q.
The output of the problem is one of two labels, negative (represented as 0) and positive (represented as 1) The negative label means that the content of the query q does not entail the content mentioned in the related documents Conversely, the positive label means that the meaning of the query q entails the content of the documents set Dy Let fentaitmen, be the target model, then its input and output would be denoted as the equation 1.2. fentaitment(4, Dạ — {di,, diy, tee di, }) = Eq (1.2) where lữ„ € {0,1} represents the negative or positive label, respectively.
Main contribution 2.2.0.2 0.0.0 2000000200 2 ee 6
This thesis’s contributions focus on improving the retrieval results of LDR prob- lem The first main contribution involves using the close relationship between the relevancy and affirmation properties to propose a new problem and the retrieval system that employs the multi-task model Meanwhile, the second contribution is proposing a retrieval system with an additional re-ranking phase utilizing the power of LLMs by prompting technique To conclude, the three main contributions of the thesis can be summarized as follows: e To address the LDR problem, a multi-task model based on BERT architecture was developed to leverage the relationship between the relevancy and affirma- tion The retrieval results of the proposed system outperformed the best team participating in the COLIEE 2022 and 2023 competitions This result em- phasizes the effectiveness of tackling the legal retrieval problem through the legal relevancy-affrmation LRA problem This improvement demonstrated the supportive relationship between the relevancy and affirmation properties. Addressing the problem with a multi-task approach allows the model to be trained more effectively and achieves significantly higher results without in- creasing the model’s parameters Ablation studies have been conducted to prove the impact and the difference in retrieval results between single-task and multi-task models. ¢ Finally, for exploiting the basic logical reasoning ability of LLMs to handle some complicated, scenario query, a re-ranking phase utilizing LLMs through prompting is proposed as the last phase of the retrieval system The exper- imental results show that the LLMs’ re-ranking phase makes a significant improvement in retrieval results.
Thesis structure 2 ra 7
This thesis comprises several chapters: Chapter 1 provides an introduction, outlining its scope, motivation, and specific problems addressed Chapter 2 presents related studies and foundational knowledge on lexical ranking algorithms, language models (BERT, Multilingual-BERT, Mono-T5), and the Cross-Entropy loss function Chapter 3 explores the relationship between legal document retrieval and textual entailment, introduces a ranking model combining BM25 and BERT, and proposes a re-ranking approach using LLMs Chapter 4 evaluates the effectiveness of these models on the COLIEE dataset.
IEE dataset and experimental setups will be described. e Ablation studies In this chapter, some ablation experiments will be pre- sented to analyze and highlight the role of the multi-task setting. ằô Conclusion Some conclusions from the experimental process will be pointed out.
Chapter 2: Related works and backgrounds knowledge
In this chapter, previous researches related to legal information retrieval are de- scribed to make an overall picture of the development of this research field Re- trieval techniques range from simple methods based on lexical features to those involving semantic features and logical inference from the content of the query.
Additionally, the foundational models and techniques used in this thesis will be presented to provide a more detailed understanding of the methods employed.
Related works about legal information retrieval
Legal information retrieval is an important problem in juris-informatics The methodologies have evolved considerably over time, with methods ranging from simple to complex, and from older to newer approaches [17, 19, 25, 39] Early studies focused on extracting relevant legal information using information retrieval
Information Retrieval (IR)-based approaches have been utilized to extract relevant legal documents based on specific situations, such as the use of topic keywords by Gao et al (2019) [10] However, this method heavily relies on extracted keywords, which may not fully capture the complexity of legal texts Kim et al (2019) proposed a method based on the assumption that an article related to a query would likely generate the query itself [13] This approach incorporates statistical language models and TF-IDF scores to create a ranking model that achieved superior results.
F-measure at Task 3 of COLIEE 2019.
As the research shifted toward using deep learning architectures, attention-based models were proposed to achieve better representations of legal texts BERT-PLI
To mitigate the challenge of extensive legal documents, various approaches have emerged Nguyen et al utilized attentive deep neural networks in Attentive CNN and Paraformer architectures to capture important information within lengthy texts Vuong et al addressed the scarcity of labeled data by employing a heuristic method based on TextRank and a pre-trained supporting model Kim et al leveraged the sentence-transformer model and a histogram of similarity to identify relevant paragraphs Additionally, Lawformer employed sliding window attention instead of global attention to reduce computational complexity and increase input length.
Due to the appearance of large language models, more studies focus on exploit- ing LLMs for various legal NLP tasks, especially legal information retrieval The chain-of-thought prompting in large language models has been investigated, as demonstrated by Wei et al (2022) [37] This research leveraged the power of large attentive language models to display connected reasoning results, providing an im- proved way of eliciting reasoning in LLMs for legal text processing Meanwhile,
Sun et al (2023) explored the potential of using generative LLMs like Chat- GPT and GPT-4 for relevance ranking in IR [29] They showed that these large language models could achieve better performance than conventional supervised methods, even outperforming state-of-the-art models like monoT5-3B on various benchmarks.
Natural language processing background
TF-IDF algorithm 004 10
TF-IDF, a renowned algorithm, measures the relevance of terms within documents by considering both the term's frequency within the document and its inverse frequency across the entire corpus This calculation ensures that terms with a high frequency in a particular document and low frequency overall are deemed more significant, highlighting their relevance to that document's content.
Words with high frequency in a corpus are considered trivial and have little impact on ranking The weight of a word in a document is calculated using formula 2.1, where D is the document collection and d is a document within that collection.
` ) (2.1) fw,D where fy,p is the number occurrences of w in d , |D| is the size of the corpus and fw,p is the number of documents containing w.
For finding set of document related to query ứ which contains a set of word q t1,to, ,tn, the weight of each word œ;¿ is first computed using the formula 2.1.
Then, the relevance score R# of the document d; and the query ¿ is simply computed by the equation 2.2.
Okapi-BM25 algorithm
One of the traditional algorithms commonly used in retrieval systems is the Okapi
BM25 (Best Match 25) [24] By analyzing the frequency of terms/tokens appearing in the query and in each document across the entire corpus, the model calculates a relevance score for each document and the query Specifically, with the input query q containing n tokens {t7,¢3,t3, ,¢/} and a legal document d containing m tokens tt, ,t4,}, the relevance score between q and d could be calculated
10 as the formula 2.3. f4) - (ki + 1) ƒ(,4) + Ki: (: —b+b: an)
Recore(q, đ) = ằ IDF(Œ)) l (2.3) i=1 where the ƒ(7, đ) is the function that counts the number of occurrences of the token t? in document d The notation |d| represents the length (number of tokens) in the document ở, and avgdl is the average length of all documents and is computed by the following formula: ằ Id; ND
The weight of a term is calculated using the tf-idf (term frequency-inverse document frequency) formula, where tf is the number of occurrences of the term in the document, and idf is the inverse document frequency, which is calculated as the logarithm of the total number of documents in the corpus divided by the number of documents that contain the term The tf-idf formula assigns higher weights to terms that occur more frequently in a document and less frequently in the corpus as a whole, which helps to identify important and relevant terms.
Transformers 1 0 11
A language model is a type of probabilistic model that contains information about the distribution of natural language which is learned from a very large corpus. Thus, a language model could encompass knowledge of the entire corpus With the strong development of deep learning models, language models have been improved from using distributions of preceding n-grams to learning the distribution based on contextual semantics One of the most notable and widely used deep learning— based language models is the BERT model [8], based on the encoder component of the Transformer [33] architecture Based on the BERT architecture, pre-trained parameters trained on various datasets using different training techniques have emerged In this and the following sections, basic knowledge about advanced deep— learning—based language models will be presented.
The Transformer [33] is a component built on neural networks and the self- attention mechanism It is commonly used to learn how to represent sequential data Many models based on this architecture are applied to several fields including
Natural Language Processing [8, 16], Computer Vision [9, 35], and even spatio-
11 temporal modeling [3] In this section, the architecture and intuition behind the workings of the Transformer will be discussed.
Input and output format An input data sample for the transformer model can be considered as a sequence consisting of N tokens to, t1, ,tn—1 Each token ù¿ is represented by a D-dimensional vector z¡ € R? The vectors representing each (0) token can be arranged into a matrix X() of dimensions D x N, with each vector x) being the value of the i-th column The vector representing each token can either be fixed or learned during the model’s pre-training or fine-tuning process. With text data, each token may be a sub-word that has been split from an original word.
The Transformer takes the matrix X as input and returns the representation matrix X() with dimensions D x N The i-th column of the matrix is the vector represented for the i-th token This output representation, thanks to the self- attention mechanism, incorporates the contextual semantics from other tokens in the input sequence This vector is used to predict the (¿ + 1)-th token, or used as a global representation for classification problems, etc.
(Normalization) yom) = xŒ"~1) + MHSA(XTM-ằ)
Figure 2.2: Components in a Transformer block
Transformer block The output representation (the matrix X(”) can be achieved by iteratively applying the transformer block according to the formula:
The transformer block comprises two stages The first stage, self-attention, attends to the relationships between tokens within a sequence The second stage, feed-forward, applies non-linear transformations to refine token representations.
The matrix Y°, with dimensions of R?*?, is the output of the first stage This matrix is computed by aggregating contextual information across the input sequence through the attention mechanism The i-th column of matrix Y, denoted as y;, represents the i-th position of the sequence (m) and is computed as the weighted average of all column vectors in the input sequence.
(m-1) vectors of the previous matrix Y The equation 2.5 shows the computation of i-th column of matrix Yằ?),
Where A!” is the attention matrix which represents the attention weight of each position with others in the sequence The columns of the attention matrix are normalized, which means
Based on the equation 2.5, the representation vector of the most relevant location
(based on the attention matrix) will have the greatest impact on the value of the recent representation vector The formula 2.5 can be generalized by the following matrix multiplication: yim) = x(m=1) A(m) (2.6)
The value of the attention matrix comes from a technique so-called self-attention. This technique’s purpose is to compute the attention weight of a position with other positions in the same sequence To avoid confusion between attention infor- mation and the information of the sequence itself, a linear transformation using Ủy and U, matrix is proposed before applying the equation computed the attention weight (the equation 2.7) The use of two transformation matrices facilitates the
13 attention matrix being asymmetric — a necessary property of attention matrix be- cause sometimes in natural language, we need token ‘a’ to attend to token ‘b’, but not in reverse Tưyi
S3 —¡ exp(+¿ DU), Uạz;) where and Ứ„ often are projected in a lower dimensional space (the equation
The training process will update values of Ứy and Ug to the optimal point To generalize the attention value, the multi-head self—attention (MHSA) is utilized.
In MHSA, H attention head where each attention head corresponds to a different Ủy and U, matrices, is used So, the full formula of the first stage having multi-head attention is as the equation 2.9.
Y9 = MHSAg(Xft=Ð) = Soy xm) a) (2.9) h=1 where H matrices vi") € R?*? project the H self-attention outputs down to the
The second stage involves with refining the representation utilizing the non— linear transformation In this stage, each column is fed—forwarded separately with the same-parameters multi-layer perceptron (MLP) The equation 2.10 presents the stage 2 computation. x! = MLP¿(w") (2.10)
In the complete architecture of a Transformer component, MHSA, and MLP are stacked together The residual connection [30] is utilized for stabilizing the training process and having a sensible inductive bias This technique is applied for both the MHSA and MLP layer for making a non-linear transformation to the output representation The idea of residual connection is that instead of directly defining aTM = ƒa(zứằ=1)), the formation of x) becoming:
This formula turns the goal into modeling the differences between two represen- m1) Another common technique in recent deep tations zứ") — 2D) = resg(z learning models is LayerNorm [2] which also has a stabilization impact The Lay- erNorm is applied to each token separately by removing the mean and dividing the
Tai = : (wai mean (2;)) Ya + Ba = layerNorm(X) qi (2.11) var (xj) where mean(z;) = 5 er tay and var(z¡) = >a (cai — mean(z;))” Two pa- rameters +„ and {4 are learned scale and shift This normalization prevents the representation vector from exploding in magnitude Figure 2.2 illustrates a stan- dard transformer block including all components described above.
Positional Encoding The transformer block considers the input sequence as a set of tokens, which means the representation of the token is computed in parallel, and no need to perform sequential computation This brings strong parallelism capability for the model but will lose the information about the sequence’s orderli- ness In study [33], authors proposed to add directly the positional encoding vector into the embedding vector The positional embedding vector is calculated by the formula 2.12.
PEjpos.2i) = Sin sr |; PE (pos.2i41) = COS | ——=r 2.12 (pos.2i) (1) (pos.3/11) IS + ) (2.12) where pos : 0 < pos < L/2 is the position of a token in the input sequence, i:0