CHAIRPERSON OF DEAN OF FACULTY OF THESIS COMMITTEE COMPUTER SCIENCE AND ENGINEERING VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY... VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY HO CHI
Trang 1NGUYEN VINH KHIEM
APPLICATION OF LARGE LANGUAGE
Trang 2THIS THESIS IS COMPLETED AT HO CHI MINH UNIVERSITY OF TECHNOLOGY – VNU-HCM
Supervisors:
Assoc Prof Huynh Tuong Nguyen Assoc Prof Quan Thanh Tho Examiner 1: Dr Le Thanh Van Examiner 2: Dr Le Thi Thuy This master’s thesis is defended at Ho Chi Minh City University of Technology (HCMUT) – VNU-HCM on 17th June 2024
Master’s Thesis Committee:
Approval of the Chairperson of the Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis being corrected (If any)
CHAIRPERSON OF DEAN OF FACULTY OF THESIS COMMITTEE COMPUTER SCIENCE AND ENGINEERING
VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
Trang 3VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness
THE TASK SHEET OF MASTER’S THESIS
I THESIS TITLE (in English): Application of large language models in Text-to-SQL II THESIS TITLE (in Vietnamese): Ứng dụng mô hình ngôn ngữ lớn trong việc tạo
câu truy vấn III TASKS AND CONTENTS:
a Research and design a model capable of generating SQL queries from text b Implement, test and evaluate model
IV THESIS START DAY: 15/01/2024 V THESIS COMPLETION DAY: 20/05/2024 VI SUPERVISORS
1 Assoc Prof Huynh Tuong Nguyen 2 Assoc Prof Quan Thanh Tho
Ho Chi Minh City, date 05/08/2024
(Full name and signature)
Trang 4ii VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness
I wish to express my profound gratitude to the esteemed professors and lecturers of the Department of Computer Science and Engineering, and the Ho Chi Minh City University of Technology at large The knowledge they imparted is priceless and has been
instrumental in the completion of this thesis I am also thankful to my colleagues at GiaoHangNhanh Company for granting me the chance to engage deeply in research and improve my professional expertise, alongside providing resources essential for training my models
Lastly, I owe a deep gratitude to my family, friends and classmate, all of whom have been supportive, encouraging, and provided the emotional and physical support needed to complete this thesis
With heartfelt gratitude, I wish good health and all the best to the professors and lecturers of the Department of Computer Science and Engineering at the Ho Chi Minh City University of Technology, National University of Ho Chi Minh City
Trang 5VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness
Trang 6iv VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness
Trang 7VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness
Trang 82.2 Graphix-T5: Mixing Pre-Trained Transformers with Graph-AwareLayers for Text-to-SQL Parsing 10
2.3 T5QL: Taming language models for SQL generation 13
2.3.0.1 Constrained Decoding 15
2.3.0.2 SQL Grammar for constrain decoding 16
2.4 Discussion 18
3 Theorical Background193.1 Recurrent Neural Networks (RNNs) 19
3.2 Transformer 21
3.2.1 Self-Attention 21
3.2.2 Feed Forward Network 24
3.2.3 Positional Encoding 25
3.3 Pre-trained Language Model 26
3.3.1 GPT - Generative Pretrained Transformer 27
Trang 93.3.2 BERT - Bidirectional Encoder Representations from
4.2 Evalution metric 37
4.3 Dataset Description 39
4.4 Experimental Results 40
5 Conclusion415.1 Achieved Results 41
5.2 Issues and Challenges 42
5.3 Future Works 42
References 44
Trang 10List of Figures
1.1 Percentage of Programming Language used by Professional
Devel-opers 2
1.2 Text-to-SQL problem[2] 4
2.1 The technique taxonomy for text-to-SQL 7
2.2 Visualization of RAT-SQL model[5] 8
2.3 Relationship between members in schema 10
2.4 Visualization of Graphix-T5 model[7] 11
2.5 Example of Multi-hop relation between nodes 12
2.6 Visualization of No-Match and Bridge Node Mode 13
3.2 Visualization of Transformer architecture[6] 22
3.3 Visualization of Scaled Dot-Product Attention 23
3.4 Multi-Head Attention consists of several attention layers running inparallel 24
3.5 Feed Forward Network 25
3.6 Overview of some popular LLMs based on Transformers[11] 27
3.7 Architecture of GPT model[12] 28
Trang 113.8 Input transformations for fine-tuning on different tasks[12] 29
3.9 The overview of BERT Architecture [14] 30
3.10The overview of BERT Architecture 32
3.11Overview of Flan-T5 finetuning data and task[3] 33
4.1 Architecture of proposed model 36
Trang 12List of Tables
4.1 ROUGE Score for 2 circumstances 40
Trang 13Topic Introduction
The progression in the field of natural language processing has been nificantly accelerated with the advent of large language models (LLMs) Modelslike GPT-3, BERT, and their successors have drastically improved our proficiencyin processing, understanding, and generating text that is remarkably human-like.These models have been meticulously trained on vast collections of text, whichhas endowed them with a nuanced understanding of language This breakthroughhas laid a foundation for pioneering applications in several linguistic tasks, repre-senting a formidable leap in technology that has transformed the way we interactwith machines
sig-SQL’s role in managing and analyzing data within relational databases isindisputably vital in our modern data-centric world The ubiquity of SQL acrossvarious sectors underscores its importance for organizing and retrieving criticaldata.According to the yearly survey conducted by StackOverflow[1], SQL main-tains its status as one of the globally dominant languages It is observed thatamong the technologies professionals most frequently utilize, JavaScript, HTM-
Trang 14Ho Chi Minh University of TechnologyFaculty of Computer Science and Engineering
L/CSS, and SQL emerge as the top three, with JavaScript and HTML/CSS nearlyreaching parity as the leading languages for coding novices
Figure 1.1: Percentage of Programming Language used by Professional Developers[1]
The task of converting natural language into SQL commands, known as
Text-to-SQL, has gained prominence It grants non-experts the ability to accessdatabase information, significantly broadening the scope of data utility and facil-itating informed decision-making across diverse user groups
While LLMs hold the potential to simplify the interaction between naturallanguage and SQL queries, the task of Text-to-SQL generation comes with dis-tinct challenges LLMs need to acquire a profound semantic understanding of thequeries, efficiently generate SQL commands, and interpret the users’ intent withhigh accuracy The intricacies involved in SQL’s structure and the variable na-ture of natural language queries add layers of complexity to this task IntegratingLLMs into Text-to-SQL systems is a complex endeavor that goes beyond techni-cal challenges It requires the selection of suitable models, rigorous experimentaldesign, and the development of reliable metrics to gauge performance In addi-tion, it is imperative to consider the wider implications on user experience and
Trang 15database functionality, striving towards a solution that is not only seamless andefficient but also scalable.
In the current era, data has become a critical asset essential for a widerange of human endeavors, encompassing both commercial activities and scien-tific investigations However, the burgeoning volume and escalating intricacy ofdata present significant challenges in its querying and exploration, even for thosewith expertise in the field Present-day data query interfaces are generally bifur-cated into two categories: form-based interfaces, which are user-friendly but offerconstrained querying capabilities, and more advanced, low-level tools These ad-vanced tools permit the synthesis of queries in native database languages, suchas SQL, but are primarily designed for a specialized audience, like SQL profes-sionals To democratize data access and utilization, ensuring that everyone caneffectively engage with, comprehend, and extract value from data, it is crucial toremove the technical obstacles that hinder data accessibility and reduce relianceon IT specialists Adopting natural language for query expression can democra-tize data accessibility
In this vein, there is a growing scholarly interest in the development of
Nat-ural Language (NL) Interfaces for Databases (NLIDBs) These interfaces enableusers to articulate queries in natural language and convert them into the database’s
native query language Specifically, Text-to-SQL (or NL-to-SQL) systems are
de-signed to transform queries from NL into SQL.Initial attempts in this domain have predominantly utilized database schemasand data indexes to construct corresponding SQL queries from NL inputs Inthese systems, a query response is conceptualized as a graph, with nodes repre-senting the database relations containing the query’s keywords, and edges indicat-ing the joins between these relations Parsing-based methodologies analyze the
Trang 16Ho Chi Minh University of TechnologyFaculty of Computer Science and Engineering
input question’s grammatical structure, subsequently aligning it with the structureof the intended SQL query Recently, there has been a surge of interest in neuralmachine translation (NMT) techniques These techniques approach the text-to-SQL challenge as a problem of language translation, training neural networks onextensive collections of NL query/SQL pairs The flourishing of these methodscan be attributed to recent breakthroughs in deep learning and natural languageprocessing (NLP), complemented by the introduction of two substantial datasets(WikiSQL and Spider), which are pivotal for training text-to-SQL systems
Figure 1.2: Text-to-SQL problem[2]
The text-to-SQL challenge involves converting a natural language query(NLQ) into an equivalent SQL query that maintains the same meaning and iscompatible with a given Relational Database (RDB) schema The objective is togenerate a SQL query that, upon execution, yields results aligned with the user’sintentions This task encompasses difficulties not only in comprehending theoriginal NL query but also in formulating a SQL query that is both syntacticallyand semantically accurate in relation to the specific database schema.[2]
Captivated by the diversity and the challenges presented in the Text-to-SQL
problem, the author has decided to pursue a master’s thesis on the topic:
discuss related research works, the common methods used to develop a systemfor extracting SQL queries from questions, and select an appropriate approach to
Trang 17address this problem To achieve the aforementioned objectives, the author willsequentially tackle the following issues:
• Research published papers and projects to identify a suitable methodologyfor addressing Text-to-SQL challenges
• Acquire a dataset to train and fine-tune the model.• Construct a model capable of transforming questions into SQL queries uti-
lizing the selected method• Evaluate the model’s effectiveness and accuracy
• Pre-trained LLMs Flan-T5[3] base will be used as text encoder and decoder.
Flan-T5 has been trained on a mix of unsupervised and supervised tasks,all reformatted into text-to-text tasks, enhancing its versatility for variouslinguistic applications
• Fine-tune the pre-trained LLM to execute Text-To-SQL task based on thetraining dataset
• Evaluate the performance of model after training
Trang 18Chapter 2
Related Works
Although the use of deep learning for the text-to-SQL challenge has gainedprominence only in recent times, a variety of systems with differing methodolo-gies and innovative features have already been developed These systems show-case a range of approaches, highlighting the problem’s complexity and the broadspectrum of potential solutions being investigated.Despite this diversity, certaincore components are common to most systems, enabling the creation of a gener-alized model to enhance our understanding of them
Based on the technique taxonomy for text-to-SQL, the neural network’score is bifurcated into two primary components: the encoder and the decoder.The encoder’s role is to transform variable-shaped inputs into a set of fixed-shapeinternal representations, which are then utilized by the decoder This process alsoinvolves the enrichment of each input’s representation by integrating informationfrom other inputs, thereby crafting a more comprehensive representation apt forthe specific problem instance The decoder, in turn, leverages these representa-tions to predict the most likely SQL query
Considering the textual nature of the inputs (NLQ, DB, schema links), theirconversion into an efficient numerical form suitable for the encoder is pivotal
Trang 19Figure 2.1: The technique taxonomy for text-to-SQL
This conversion, termed input encoding, involves not only the restructuring ofinputs to a compatible format for the encoder but also the selection of an op-timal encoder network for processing these inputs into a hidden internal repre-sentation Subsequently, output decoding involves both designing the predictionstructure and selecting the right network to make these predictions, like a SQLquery, which can be interpreted either as a simple string or a structured programadhering to a specific grammar While some systems distinguish between nat-ural language representation and encoding, others merge these steps, with thepossibility of amalgamating all three steps into a single process Neural trainingdenotes the methodology adopted to train the neural network
The taxonomy’s final aspect is output refinement, employed during ing to minimize errors and enhance outcomes It’s crucial to note that outputrefinement, while closely linked to and interacting with output decoding, is notan inherent part of the neural network Therefore, it can often be added or re-moved post system creation and training
Trang 20decod-Ho Chi Minh University of TechnologyFaculty of Computer Science and Engineering
Linking for Text-to-SQL Parsers
In the context of converting natural language queries into SQL queriesfor database interrogation, modern semantic parsing models face difficulties inadapting to new, unseen database schemas This challenge primarily arises fromtwo tasks: (a) effectively encoding the database relationships for the parser’s use,and (b) establishing a correspondence between database columns and their ref-
erences in a query RAT-SQL [4] presents itself as a comprehensive solution,
designed to encapsulate relational structure both in the database schema and theposed question This framework aims to streamline schema encoding, schemalinking, and feature representation within the Text-to-SQL encoding process Itemploys relation-aware self-attention, a technique that blends broad-spectrumreasoning across schema entities and query terms with structured reasoning overpre-defined schema relationships
Figure 2.2: Visualization of RAT-SQL model[5]
The Relation-Aware Transformer[4], commonly referred to as RAT, is an
innovative approach for embedding sequences that are semi-structured Thismodel is unique in its ability to encode both the already existing relationships
Trang 21within the input and the ‘soft’ relations that it generates among elements of thesequence in a unified embedding Moreover, this framework facilitates the inte-gration of solutions for embedding and linking schemas One of the key compo-
nents of RAT is its use of the self-attention encoder, also known as a Transformer,
a concept introduced by Vaswani[6] and his colleagues in 2017 The distinctivefeature of RAT lies in its capability to integrate known relational informationinto the encoding process by augmenting the attention mechanism with these re-lational representations
The RAT-SQL framework approaches the translation of natural language
queries into SQL by conceptualizing the data schema and the question as a
"ques-tion contextualized schema graph." This graph, denoted as G = ⟨V, E⟩, prises nodes (V ) and edges (E) The nodes represent the various elements ofthe database and the query: V = C ∪ T ∪ Q, where C denotes the columns in thedatabase tables, T represents the tables themselves, and Q stands for the indi-vidual words in the query The edges (E) define the relationships between themembers in schema (Column-Column, Column-Table, Table-Table), relationshipbetween each question word and the relationships between schema members andthe question words, effectively linking the database structure to the query context.In the process of modeling text-to-SQL generation, RAT-SQL employs anencoder-decoder framework The encoder’s role is to interpret the input graphand transform it into joint representations for each element: ci for columns ci ∈C, ti for tables ti ∈ T , and qi for question words qi ∈ Q These representationsencapsulate both the structure of the database and the semantics of the query.Subsequently, the decoder utilizes these representations to calculate a probabilitydistribution Pr(P|GQ) over possible SQL programs This distribution reflectsthe likelihood of each potential SQL query being the correct translation of thegiven natural language question, guided by the structured context provided by theschema graph
Trang 22com-Ho Chi Minh University of TechnologyFaculty of Computer Science and Engineering
Figure 2.3: Relationship between members in schema
with Graph-Aware Layers for Text-to-SQL ing
Pars-Relational databases, pivotal in decision-making across various domainslike health care, sports, and entertainment, have become increasingly prevalent inthe era of big data These databases enable efficient information retrieval throughstructured query languages such as SQL However, the intricate nature of SQLoften results in a steep learning curve for non-technical users This challenge hassparked significant interest in text-to-SQL technologies, which translate naturallanguage instructions or queries into SQL commands Addressing this issue, a
novel architecture, GRAPHIX-T5[5], is introduced This architecture excels in
modeling relational structure information while preserving the robust contextual
encoding abilities of the pretrained T5[7].
Trang 23Figure 2.4: Visualization of Graphix-T5 model[7]
The GRAPHIX layer represents an innovative integration of semantic andstructural data in the realm of neural network architecture It functions by com-bining semantic information, derived from each transformer block, with the struc-
tural insights of a relational graph neural network (GNN) block Initially, the
se-mantic aspects are encoded using a Transformer block, while the structural
com-ponents are formulated through a Relational Graph Attention Network (RGAT).
In each GRAPHIX layer, there is a harmonious blending of semantic andstructural data This is achieved by utilizing the information from both thesedomains, allowing for a more comprehensive and integrated approach to infor-mation processing The incorporation of the GRAPHIX layers into a new en-coder, which replaces the original T5 encoder, marks a significant advancement.Notably, the parameters of the semantic block within each GRAPHIX layer aredeliberately initialized with T5 This strategy is employed to retain the power-ful contextualized encoding capabilities that are inherent in the pre-trained T5model
By doing so, the GRAPHIX layers not only introduce a novel method ofdata integration but also ensure that the strengths of the existing T5 architectureare not lost This creates a synergistic effect, where the new architecture benefitsfrom the robustness of T5’s contextual encoding while also bringing in the addedadvantages of structural data integration from GNNs
Trang 24Ho Chi Minh University of TechnologyFaculty of Computer Science and Engineering
In the Graphix-T5 model, input encoding is achieved through a cated method of graph construction This process involves encoding both theinput question and the data schema into a heterogeneous graph, denoted as G =⟨V ,R⟩ This graph comprises three types of nodes: V = Q ∪ C ∪ T , repre-senting the questions, columns, and tables, respectively Furthermore, it includesmultiple types of relations, R = r1, , r|R|, with each ri signifying a one-hoprelation between nodes Multi-hop relations, expressed as rk, are defined as acomposition of these one-hop relations: rk = r1◦ r2· · · ◦ rI
sophisti-Figure 2.5: Example of Multi-hop relation between nodes
The relationship sets within this graph are categorized into three maintypes: Schema relations, which describe connections between schema elementslike tables and columns; Schema linking relations, which bridge the gap betweenthe question and the schema; and Question relations, which connect individualtokens within the question
Prior approaches often added dummy edges, labeled as NO-MATCH, to dicate unlinked but potentially related question tokens and schema tokens How-ever, this method can lead to an over-smoothing problem, as it introduces nu-merous noisy neighbors, complicating the computation of attention scores For
Trang 25in-example, if there are A question tokens and B schema items that are semanticallyrelevant but not linked by existing string-match rules, the number of NO-MATCHedges required would be A × B.
Figure 2.6: Visualization of No-Match and Bridge Node Mode
Graphix-T5 addresses this issue innovatively by using a special token, ⋆, asa bridge node This approach significantly reduces the complexity of the graph,lowering the number of necessary edges from A × B to A + B By doing so,all schema nodes become reachable from the question token nodes, streamliningthe process of attention computation and enabling a more efficient and effectiveencoding of the input
gen-eration
In the field of computer science, the task of automated code generation haslong been a key focus Traditional methods have faced limitations in flexibility,but recent advancements in deep learning (DL) have introduced new capabili-ties in this area Some DL-based methods offer assistance in code completion,while others can transform natural language (NL) inputs into code A particu-larly challenging application of this is converting NL into SQL queries, as NL
Trang 26Ho Chi Minh University of TechnologyFaculty of Computer Science and Engineering
queries can be ambiguous, such as when columns from different tables sharenames Additionally, creating labeled pairs of NL queries and SQL is a difficult,time-intensive process that demands expertise in SQL
Contemporary leading-edge techniques in semantic parsing rely heavily onlarge language models (LLMs), which are effective but require high-end GPUs,limiting their broader use Moreover, these state-of-the-art methods are not al-ways reliable in producing valid SQL In response to these challenges, this re-
port introduces T5QL[8], an innovative SQL generation approach that not only
performs well on benchmark datasets but also does so using smaller languagemodels, specifically T5-Base, and demonstrates improvements over existing topmethods A significant advantage of T5QL is its consistent output of valid SQL,assured by using a context-free grammar to guide the SQL generation process
Figure 2.7: T5QL model architecture[8]
The T5QL method for SQL generation employs a two-model architectureto enhance the accuracy and validity of generated SQL queries The first com-ponent, a generator model based on T5, takes natural language (NL) input alongwith the database schema and begins the generation process with an empty string.It uses a constrained decoder which acts as a guide by determining the valid nexttokens that can be used at each step, ensuring that the generated SQL adheres tothe correct syntax as defined by SQL grammar
This generation process utilizes a beam search strategy, which allows the
Trang 27model to explore and keep track of multiple generation paths simultaneously,effectively creating a set of ’k’ potential SQL candidates These candidates arethen passed to the second component of the architecture, the ranker model The
ranker model, which could be based on a model like CodeBERT[9], evaluates
the candidates and reorders them based on their likelihood of being the correcttranslation of the original NL query into SQL The highest-ranked SQL query isthen selected as the output This two-step process combines the generative powerof T5 with the evaluative precision of the ranker to produce SQL queries that arenot only syntactically valid but also semantically aligned with the NL input
2.3.0.1Constrained Decoding
Constrained decoding functions as a pivotal component of the T5QL ology, serving to refine the selection of tokens that the generator can use forpredicting the next token in a SQL query This process is governed by a context-free grammar (CFG) that outlines the structure of valid SQL statements Themechanism starts by identifying the maximum parsable prefix P∗ from the cur-rent generation P, ensuring that this segment of the SQL query is syntacticallycorrect
method-Once P∗ is determined, the lookahead capability of the parser comes intoplay It anticipates all feasible suffixes that can follow P∗and integrates them intoa trie T , essentially acting as a predictive model for subsequent tokens This pro-cess involves tokenizing the potential suffixes and updating the trie accordingly
Trang 28Ho Chi Minh University of TechnologyFaculty of Computer Science and Engineering
Figure 2.8: Pseudo code for Constrained Decoding
In the final phase, T5QL employs the trie to discern possible tokens forextending P, ensuring they align with the syntactical rules of SQL An importantaspect of the constrained decoding is the FILTERWRONGTOKENS function, whichapplies contextual awareness to the decoding process This function filters outany tokens that do not correspond to the columns specified in the ’from’ clauseof the query and correctly associates table aliases with their respective originaltables This context-aware approach ensures that the generated SQL not onlyfollows the grammatical structure but also remains contextually relevant to thedatabase schema in question
2.3.0.2SQL Grammar for constrain decoding
The SQL grammar for constrained decoding is specifically designed to sure the generation of syntactically correct and contextually appropriate SQLstatements In this grammar, components wrapped in square brackets denoteoptional elements, allowing for the creation of SQL queries with or without cer-tain clauses, such as the ’where’ clause The grammar is engineered to support
Trang 29en-SQL ’select’ statements, which are the primary means for data retrieval fromdatabases These ’select’ statements can stand alone or be part of complex queriesconnected by set operations like unions, intersects, and excepts.
Figure 2.9: SQL Grammar Rule
A key feature of this grammar is the inversion of the ’from’ and ’select’statements Such an arrangement is not merely to enforce syntactical correctnessbut also to ensure that the generated SQL queries only reference valid table namesand column names, which must exist within the given database schema Thelogic behind this inversion is strategic: by parsing the ’from’ statement first,the grammar can ascertain the available tables Consequently, when the ’select’statement is parsed, the grammar can then restrict token predictions to columnsthat are present within these tables
For a particular query and database schema combination, the grammaris augmented with additional production rules that specify the valid tables andcolumns For the example in figure 2.7 above, if the current query starts with"from User select", then tokens like "user.ID" and "user.name" are deemed validfor prediction, while tokens like "account.country" are not, as they do not belong