Learning to Transform Vietnamese Natural Language Queries into SQL Commands Thi-Hai-Yen Vuong Thi-Thu-Trang Nguyen Nhu-Thuat Tran University of Engineering & Technology University of Engineering & Technology University of Engineering & Technology Vietnam National University Vietnam National University Vietnam National University Hanoi, Vietnam Hanoi, Vietnam Hanoi, Vietnam thuattn@vnu.edu.vn 15021317@vnu.edu.vn yenvth@vnu.edu.vn Le-Minh Nguyen Xuan-Hieu Phan Information Science School Japan Advanced Institute of Science & Technology Ishikawa, Japan nguyenml@jaist.ac.jp University of Engineering & Technology Vietnam National University Hanoi, Vietnam hieupx@vnu.edu.vn Abstract—In the field of data management, users traditionally manipulates their data using structured query language (SQL) However, this method requires an understanding of relational database, data schema, and SQL syntax as well as the way it works Database manipulation using natural language, therefore, is much more convenient since any normal user can interact with their data without a background of database and SQL This is, however, really tough because transforming natural language commands into SQL queries is a challenging task in natural language processing and understanding In this paper, we propose a novel two–phase approach to automatically analyzing and converting natural language queries into the corresponding SQL forms In our approach, the first phase is component segmentation which identifies primary clauses in SQL such as SELECT, FROM, WHERE, ORDER BY, etc The second phase is slot– filling that helps extract sub–components for each primary clause such as SELECT column(s), SELECT aggregation operation, etc We carefully conducted an empirical evaluation for our method using conditional random fields (CRFs) on a medium– sized corpus of natural language queries in Vietnamese, and have achieved promising results with an average accuracy of more than 90% Index Terms—Understanding natural language query, transform natural language query to SQL query I I NTRODUCTION Relational databases store a vast amount of data in most of current information systems A common way to access and manage the data in those databases is to use the Structured Query Language (SQL) However, this method requires an understanding of relational database, data schema, and SQL syntax as well as the way it works Database manipulation using natural language, therefore, is much more convenient since any normal user can interact with their data without a background of database and SQL Figure shown an example of the text-to-SQL generation task Recently, there are various approaches have been proposed to solve this task Firstly, the components in SQL query are 978-1-7281-3003-3/19/$31.00 ©2018 IEEE Fig Example of transforming natural language query to SQL query identified manually by using rules [11] The other approaches formalize text-to-SQL task to machine translation problem [5], [8], [9] Semantic parsing is also applied to generate SQL query structure like context-free grammar [14] Most of current approaches only handle very primary queries like WikiSQL dataset [16] While a large number of queries are in complex structure, they include optional components such as join, group by and nested queries An example which returns the names of students whose Math scores are greater than the average score of all the student in Ha Noi city as follow: SELECT Ho_ten FROM Diem_thi WHERE diem_toan > (SELECT AVG(diem_toan) FROM Diem_thi WHERE cum_thi is "Ha Noi") In our work, we categorize the SQL queries into three levels based on the complexity of SQL query structure There are simple, medium, and complex level In the simple level, SQL queries only contain the primary components An example which returns the names of students whose Math scores are greater than 9, is illustrated by the following query: “SELECT Ho_ten FROM Diem_thi WHERE diem_toan > 9” In the medium level, the input query requires knowledge to analyze and understand to SQL query For example, “the names of cities with more than 10 students whose scores are greater than 27 in total score” are shown in the below query Transforming this input query requires knowledge on “total score” which is sum of scores (diem toan, diem li and diem hoa) SELECT cum_thi FROM Diem_thi WHERE (diem_toan + diem_li + diem_hoa) > 27 GROUP BY cum_thi HAVING COUNT(*) AS so_thi_sinh > 10; In the highest level - complex, the input query includes sophisticated components such as joint tables, nested queries, and sub-query Next example is the following query, “which returns the most frequent Math score in Ha Noi city” SELECT diem_toan FROM (SELECT diem_toan, COUNT(*) AS so_thi_sinh FROM Diem_thi WHERE cum_thi IS "Ha Noi" GROUP BY diem_toan ORDER BY so_thi_sinh DESC LIMIT 1) Most of current approaches can only handle a part of the simple level and a small part of the complex level Analyzing and understanding natural language query into SQL query are challenging tasks Firstly, SQL query contains complex structure such as multi-nested query Secondly, this task is not only depend on the input query, but also database architecture including list of tables, table structure and relations among tables In this paper, we propose a new approach for analyzing and understanding natural language query into SQL query The process of our approach consists of two major phases: (1) component segmentation phase which identifies primary clauses in SQL architecture such as SELECT, WHERE, GROUP BY, etc., and (2) slot-filling phase that helps extract several sub-components for each primary clause such as SELECT column(s), SELECT aggregation operation, etc We focus on solving natural language query in the simple level Our work has three main contributions as follows: • We proposed a novelty two-phase approach to analyze and understand natural language query • We built machine learning models to solve component segmentation problem and slot-filling problem, by using Conditional Random Field method • We also built a medium-size dataset of Vietnamese natural language queries for evaluation and achieved promising results II R ELATED W ORK Semantic parsing In semantic parsing for representation learning for sequence generation, natural language descriptions are parsed into logical forms [3] As a sub-task of semantic parsing, earlier work focuses on specific databases [4], [7], [10], [12] Recent research considers generalizing the new database by incorporating user’s guidance [5] Another direction incorporates the data in the table as an additional input [8], [9] The limitations of these approaches are security and scalability issues while handling large scale user databases In 2017, Zhong et al proposed the Seq2SQL [16], a deep neural network combined with policybased reinforcement learning In 2017, Xu et al proposed SQLNet to solve the order issue that Seq2SQL encountered [13] In 2018, Tao Yu et al proposed TypeSQL based on the architecture of SQLNet and format the task as a slot filling problem [15] Natural language interface for databases One pioneering study is PRECISE [10], which maps the token in the corresponding query with column attributes, values in the database table Giordani and Moschitti translate questions to SQL queries by first generating candidate queries from a grammar then ranking them using tree kernels [4] The above approaches depends on the accuracy of the grammar and are not suitable for tasks that require generalization to new schema Iyer et al approaches by using neural network model sequence to sequence (Seq2Seq model) [5] The limitation of Seq2Seq model can be overcome by adding human feedback III O UR A PPROACH A Analyzing and Understanding Natural Language Query Figure and have shown an detailed example of these two phases and input/output in the process Fig An example of Two-phases in the task Fig The process of Two-phases in the task In the we define a component list of segmentation five types problem, Lcs = {select, condition, group by, order by, other} as shown in Figure I Given an input Vietnamese natural language query x = (x1 , x2 , , xn ), the component segmentation will segment the input query into list of SQL clauses Cx = {ci (lics , si , ei )} For each component ci (lics , si , ei ) ∈ Cx , lics (∈ Lcs ) is a component type, si and ei are position of the start token and the end token of ci in x In this work, we focus on the simple level of natural language command, which means that components are non-overlapping The component segmentation is formalised as sequence tagging problem TABLE I C LAUSE TYPES CLAUSE TYPES SHORT select sel condition col Group by group by Order by order by DESCRIPTION Component execute a query that retrieves the information in the database table Condition for filtering information in the database table Merging records with the same value in a column or multiple columns Executing sort results and perform operations (max, min) Similarly, the slot-filling is also formalised as sequence tagging problem The list of label in slot-filling phase consist of four type Lsf = {column, aggregator, operator, value} shown in Figure II TABLE II SLOT TYPES SLOT TYPES SHORT column col aggregator agg Operator op by Value val DESCRIPTION Column name in SELECT, WHERE, etc clause Operations: sum, count, average, none Comparing operations in where clause for data types of text, numberic (datetime data brought to numberic) the values in where clause and order by clause (desc, asc) B Building Analyzing and Understanding Natural Language Query Model with Conditional Random Field For segmentation problems and slot-filling problems, linearchained graphical models like conditional random fields (CRFs) [6] and Hidden Markov Model (HMM) [1] have been proven effective based on their encoding the sequential dependencies between consecutive positions We use CRFs model to solve these above problems We use IOB format to represent label for both of the component segmentation task and the slot-filling task In the component segmentation task, we define the set of class labels Lcs = {select, condition, group by, order by, other} The B < component type > indicates the first token of a component and I < component type > is the next or last token of that component O is outside of components Similarly, the set of slot-filling labels is Lsf = {column, aggregator, operator, value} Training or estimating parameters for CRFs model is to search the optimal weight vector θ = (λ∗1 , λ∗2 , , λ∗n ) that commonly performed by maximizing the likelihood function due to using advanced convex optimization techniques Recent studies have shown that L-BFGS are efficient Prediction labels for new input x is calculated by y ∗ = argmaxy∗ ∈L pθ∗ (y|x) C Feature Templates for Building The Analyzing and Understanding Natural Language Query Model Feature selection is an important part in the analyzing and understanding natural language query model The more specific characteristics of each label that feature template could cover the higher the accuracy of the model Therefore, we design a variety of highly discriminative features shown in Table III The first is contextual feature We use a window to extract contextual information from word around current position {w−n , , w−2 , w−1 } is previous words, w0 is current position and {w1 , w2 , , wn } is next words; where n is window size In the component segmentation, we assign the value to n as In slot-filling, n is TABLE III F EATURE TEMPLATES TO TRAIN THE CRF S MODEL Contextual feature Left context Current token Right context POS tag Current token Orthographic Current token Dictionaries is column name is table name Component in in in in select condition group by orderby Context predicate templates [w−n ], , [w−2 ], [w−1 ] [w0 ] [w1 ], [w2 ], , [wn ] A Part-Of-Speech Tag [pos0 ] Orthographic projection [or0 ] Text templates for matching database information 1-token: [w0 ] 2-token: [w−1 w0 ], [w0 w1 ] 3-token: [w−2 w−1 w0 ], [w0 w1 W2 ] Type of component which current token is in [l0cs ], only in the slot-filling model In addition to text content, these models used information about part-of-speech (POS) tags of words For languages like Vietnamese, word boundary must first be identified Hence, these models actually use three kinds of information: word tokens, word orthographic, and POS tags of segmented words These add richer features to the models and, therefore, help to achieve better component segmentation performance Besides, we also use dictionary and clause information for looking-up features: is column name, is table name There is an additional feature for slot-filling model that is component type Component type feature is the label information form previous phase: in select, in condition, in group by and in order by IV E VALUATION A Experimental Data To evaluate the proposed method, we asked annotators to annotate Vietnamese natural language query dataset We obtained a mediumsize data set consisting of 1258 queries on database: High school final test scores, Flight and Book database Figure and Figure show some statistics in the dataset including the number of samples corresponding to components and slot-filling and its proportion in the entire data set Fig Label Statistic in the component segmentation phase output label sequence in CRFs is not only based on the current observation and current state but also the past and the future observations and states, that is the reason why CRFs outperform HMM SVMs achieve the lowest result among three models Because SVMs not consider state-to-state dependencies and observation-to-state dependencies like CRFs or HMM Furthermore, CRFs propagate the probability of a state sequence given the observed sequence to mitigate this issue, while SVMs only separate the data into categories by mapping the data points onto an optimal linear separating hyperplane Fig Precision, recall and F1 score in CRFs, HMM and SVM model in component the segmentation phase Fig Label Statistic in the slot-filling phase We divide the dataset into folds with train/test splits and calculate results of the best model per each phase B Experimental Results and Analysis In order to prove the performance of the proposed CRFs model, we conducted experiments to build HMM and Support Vector Machine (SVM) models [1], [2] with similar feature selection and consider analyzing experimental results carefully The experiment results illustrate that CRFs achieve the best result among three models for both of the component segmentation phase and slot-filling phase It is easy to see that through Figure and Figure Predicting the most likely Fig Precision, recall and F1 score in CRFs, HMM and SVM model in the slot-filling phase Table IV shows performance for each label in the component segmentation phase The micro-averaged F1-score is 93.48%, it is means that we can achieve a high accuracy level with this feature selection Table IV also indicates that accuracy of selection and groupby is higher than condition and orderby This is understandable because WHERE clause and ORDER BY are more ambiguous The results reported in Table V show precision, recall and F1-score of each slot type The slot-filling phase get high performance with 91.9% F1-score The aggregator and TABLE IV P RECISION , R ECALL AND F1- SCORE OF THE COMPONENT SEGMENTATION MODEL WITH CRF S Type sel cond group by order by Averagemicro Precision 96.46 92.43 97.50 88.89 93.43 Recall F1-score 95.73 89.69 97.50 90.91 93.54 93.92 80.94 97.50 89.89 93.48 operator have high performances, because aggregator only belongs to the SELECT clause and operator belongs to the WHERE clause While the column and value information in SQL query can belong to a lot of different clauses such as SELECT clause, GROUP BY clause, ORDER BY clause, which can lead to ambiguity in predicting precise clause for the column information Therefore, their performances are lower and unstable TABLE V P RECISION , R ECALL AND F1- SCORE OF THE SLOT- FILLING MODEL WITH CRF S Type sel-col cond-col group by-col order by-col sel-agg cond-op cond-val order by-val Averagemicro Precision 87.23 93.56 97.50 87.76 96.91 93.75 91.81 92.31 92.62 Recall F1-score 80.39 92.31 97.50 87.76 98.95 96.15 93.08 87.80 91.18 83.67 92.93 97.50 87.7 97.92 94.94 92.94 90.90 91.90 Figure presents some examples of predictions by the model and ground truth results In the simple level, which only contain the non-overlap components, our model could segment SQL clause and extract subcomponent quite correctly like the first and the second example In the higher levels of query like third example, our model predicts ”Khoi A” (Combination A) as column, which is incorrect In this case, we needs domain knowledge to understand “combination A” Additionally, our model cannot cover the complex queries which contain joint tables and nested query V C ONCLUSION In this work, we propose a new two-phase approach for converting Vietnamese natural language query into structured query language The results of this research might assist both experienced and inexperienced users to manage databases easily In our novelty approach, both two phases are formalized as the sequence tagging problem which are easily to be applied some notable machine learning methods to solve In the scope of this research, CRFs model outdoes HMM model and SVM model This approach has completely solved the problem at a simple level of natural query In the next step, the approach Fig Example of predictions by the model and and ground truth results Q denotes the natural language query L and L’ denote the ground truth label and the label produced by the model S and S’ denote the ground truth SQL query and the SQL query produced by the model can be developed further by solving the natural language query at medium level and high level R EFERENCES [1] Baum, L.E and Petrie, T., 1966 “Statistical inference for probabilistic functions of finite state Markov chains.” The annals of mathematical statistics, 37(6), pp.1554-1563 [2] Cortes, C and Vapnik, V., 1995 Support-vector networks Machine learning, 20(3), pp.273-297 [3] Dong, L., and Lapata, M (2016) “Language to Logical Form with Neural Attention.” CoRR, abs/1601.01280 [4] Giordani, A and Moschitti, A., (2012) “Translating questions to SQL queries with generative parsers discriminatively reranked.” Proceedings of COLING 2012: Posters, pp.401-410 [5] Iyer, S., Konstas, I., Cheung, A., Krishnamurthy, J., and Zettlemoyer, L.S (2017) “Learning a Neural Semantic Parser from User Feedback.” ACL [6] Lafferty, J., McCallum, A and Pereira, F.C., 2001 “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” [7] Li, Y., Yang, H., Jagadish, HV (2006), “Constructing a generic natural language interface for an XML database,” In EDBT, Vol.3896, pp 737754 [8] Mou, L., Lu, Z., Li, H., and Jin, Z (2017) “Coupling Distributed and Symbolic Execution for Natural Language Queries.” ICLR [9] Pasupat, P., and Liang, P.S (2015) “Compositional Semantic Parsing on Semi-Structured Tables.” ACL [10] Popescu, AM., Etzioni, O., Kautz, H (2003), “Towards a theory of natural language interfaces to databases”, In Proceedings of the 8th International Conference on Intelligent User Interface, pp 149-157, In ACM [11] Stratica, N., Kosseim, L and Desai, B.C., 2005 “Using semantic templates for a natural language interface to the CINDI virtual library” Data & Knowledge Engineering, 55(1), pp.4-19 [12] Wang, C., Cheung, A., Bodik, R (2017), “Synthesizing highly expressive SQL queries from input-output examples,” In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp.452-466, In ACM [13] Xu, X., Liu, C and Song, D., 2017 “Sqlnet: Generating structured queries from natural language without reinforcement learning” [14] Yaghmazadeh, N., Wang, Y., Dillig, I and Dillig, T., 2017 “SQLizer: query synthesis from natural language Proceedings of the ACM on Programming Languages,” 1(OOPSLA), p.63 [15] Yu, T., Li, Z., Zhang, Z., Zhang, R and Radev, D., 2018 “Typesql: Knowledge-based type-aware neural text-to-sql generation.” [16] Zhong, V., Xiong, C and Socher, R., 2017 “Seq2sql: Generating structured queries from natural language using reinforcement learning.” ... Experimental Data To evaluate the proposed method, we asked annotators to annotate Vietnamese natural language query dataset We obtained a mediumsize data set consisting of 1258 queries on database:... operator have high performances, because aggregator only belongs to the SELECT clause and operator belongs to the WHERE clause While the column and value information in SQL query can belong to. .. approach for converting Vietnamese natural language query into structured query language The results of this research might assist both experienced and inexperienced users to manage databases easily