L2S Transforming Natural Language Questions into SQL Queries

2015 Seventh International Conference on Knowledge and Systems Engineering L2S: Transforming natural language questions into SQL queries Duc Tam Hoang Minh Le Nguyen Son Bao Pham Faculty of Information Technology, VNU University of Engineering and Technology, Vietnam National University, Hanoi tamhd1990@gmail.com School of Information Science, Japan Advanced Institute of Science and Technology nguyenml@jaist.ac.jp Faculty of Information Technology, VNU University of Engineering and Technology, Vietnam National University, Hanoi sonpb@vnu.edu.vn Abstract—The reliability of a question answering system is bounded by the availability of resources and linguistic tools In this paper, we introduce a hybrid approach to transforming natural language questions into structured queries It alleviates the lack of experts in domain observation and the deficient performance of linguistic tools Specifically, we exploit the semantic information for mapping natural language terminologies to structured query and bipartite graph model for the matching phase Experimental results on the Vietnam national university entrance exam dataset and the Geoqueries880 [27] dataset achieve accuracies of 91.14% and 87.55% respectively I Apart from this introduction, the rest of the paper is arranged as follows Section II provides a review of related works with regard to closed-domain QA system Section III presents the architecture of L2S Section IV proposes our approach to solve the problem of interlingual mapping Two experiments that we tested L2S are presented in Section V Finally, Section VI summarizes the main points and discusses our future extension II In this sections, we provide an insight into publications related to closed-domain question answering system and the database that they target INTRODUCTION A reliable QA system requires an approach capable of exploiting details from the question and the background knowledge of domain, which makes closed domain QA system domain-dependent State-ofthe-art approaches generally learn from a set of annotated training data If the domain has a rich training data, it is feasible to develop a statistical approach However, if the training data is absent, especially with under-resourced languages such as Vietnamese, statistical approaches are at a deadlock [10] Question answering has been addressed by numerous systems following a variety of approaches Current approaches are generally categorized into two main types: statistical and non-statistical approaches Statistical approaches are dedicated to the manipulation of statistical models From the set of pairs between natural language and machine language, the statistical approach first builds the training set of correct and incorrect question-answer pairs [13] The QA problem is then transformed into a binary classification task with label correct mapping and incorrect mapping between the question and the answer/query Features for machine learning may be extracted by obtaining tokens and syntactic trees of questions and queries [12] Rule-based (grammar-based) approaches pose different types of problem Firstly, extensive effort of experts are required for handcrafted rules Secondly, if the database changes, a typical grammarbased approach demands a huge workload creating new set of rules otherwise the accuracy will fall significantly [11] L2S aims to deal with problems described above Our objective is an approach which tolerates the lack of an annotated corpus, experts’ effort in domain analysis and powerful linguistic tools In this sense, linguistic tools are the underlying tools which support the process of QA system, such as tokenizer, part-of-speech tagger, syntactic parser, dependency tree parser and named entity recognizer One failure of linguistic tools generally lead to an incorrect answer Notably, statistical approaches have only begun to receive serious attention recently in some specific areas, such as the medical domain [9] With a huge set of annotated training data, statistical approaches demonstrate promising results [14] [24] Regarding non-statistical approaches, the majority of them are rules-based approaches [5] Compared to statistical approaches, which gain attention in research-oriented works, rule-based approaches find favour with real-life industrial systems The main component of rules-based approaches is a set of QA patterns [23] [8] [22] The database type plays a crucial role in building up a QA system Over recent years, the growth of data in tabular presentation has brought attention to QA using relational databases Structured databases have become a main type of data format, beside written free text Moreover, the conversion from semi-structured resources (such as tables) to relational database is straightforward, compared to other knowledge representation such as Ontology Because the patterns are written by experts with extensive domain knowledge, a rule-based system gains promising result with a relatively small set of patterns When the set of patterns gets larger, it is difficult to manage or improve System accuracy is not always improved when adding new rule as the new rule may conflict with the existing rules We propose a hybrid approach, which utilizes the semantic information and graph model as a converter from questions posed in NL to structured query database (SQL) Firstly, we propose a method for mapping NL statement to SQL elements by exploiting the semantic information Secondly, we constructs the bipartite graph to evaluate the remaining tokens L2S aims to require no training data and only minimal domain knowledge Currently, in the phase of data preprocessing, an understanding of domain is necessary but it takes far less effort compared to building the set of rules or collecting an annotated training dataset 978-1-4673-8013-3/15 $31.00 © 2015 IEEE DOI 10.1109/KSE.2015.38 RELATED WORKS There are also non-rule approaches such as syntactic-based analysis [2], Prolog-like representation [25] and graph-based approach [21] A Converting from a natural language question to a structured query Database is a critical factor in QA High-level representation of data (such as Ontology) supports complex operation but the quantity 85 Ambiguity Solving component guarantees a sound input (no ambiguity) for the matching module Taking the output of the Linguistic Component and the Lexicon Component, L2S compares all words of W to the elements of D A word w1 is evaluated ambiguous if it matches with zero or more than two of the same type We have tried the method of ellipsis or choosing the highest possibility but it was inefficient as sometimes it leads to failure In this case, L2S first retrieves all words {w2 , w3 , , wn } in which the similarity function Ps (wi , w1 ) > λ There is then a possibility to engage the users by clarifying the input with suggestion from similar words of available data in such format is sparse In many systems, the knowledge has to be converted manually from written text For example, if a method uses Ontology, a stage converting data to query structured database or free context into Ontology format is inevitable There is a trade-off between accessing database and transforming text into database Aqualog [15], QuestIO [7] and followed-up systems achieve promising results, but they are not popular in the industry because of their knowledge representation Ontology OWL The adhoc transformation between a written corpus or a Structured Query Database and an Ontology is laborious Automatic processing for the transformation is far from something which could meet the requirements of QA systems Sometimes, it is difficult to overcome the problem of modeling data in Ontology relations, even for human B Matching module The Matching module is the core of a QA system In is sometimes called by different names but serves the same purpose of interlingual mapping In L2S, it is responsible for producing a list of specific SQL elements from the output of previous step The matching module has to support three tasks as follows VnQAS [19] is a followed-up system of Aqualog It is one of the notable QA systems for Vietnamese [17] Based on the structure of Aqualog, authors of VnQAS transformed the question into knowledge representation via the pattern matching technique For the demonstration, they manually created an Ontology (15 concepts, 17 attributes and 78 instances) and a set of hand-written patterns for questions in the domain of organizational structure III Firstly, function words in Q are identified They are the words associated with the function in the SQL query, such as question words, comparing words and linking words They are processed differently from other words Secondly, each input word wi is matched with the associated oj from database oj might be a value, an attribute or a relation The matching module has to resolve the ambiguity when one input word might match multiple objects L2S ARCHITECTURE Given a question Q relating a specific database D, L2S aims to transform Q into a SQL query, which is in the form of SELECT FROM WHERE The field WHERE can be either a single condition or a joint of multiple conditions A condition is a pair between two variables ei and ej which are compared by an operator o Thirdly, each pair (wi , oj ) is linked to another pair (wk , oh ) if there is a conditional relation between wi and wk If there is no pair (wk , oh ) given by the question, L2S has to retrieve it L2S consists of three modules First, Q and D are analysed by the Preprocessing Module Then, in the Matching module, data is processed through the Semantic Matching component and the Graphbased Matching component respectively Finally, the Generating Module is responsible for building a complete SQL query Because the Matching module is the most difficult part in QA problem, each QA system undertakes a different technique The proposed method used inL2S is described in details in section IV A Preprocessing module Taking all the elements list E from the Matching module, this module is responsible for creating a sound SQL query which delivers the exact answer C Generating module From Q and D, this module extracts all features for the matching phase There are three pre-processing components: Linguistic component (to analyse Q), Lexicon component (to analyse D) and Ambiguity Solving component (to correct the input) Operator Action: Considering possible SQL queries which can be drawn from the database, we introduce a new phase to assign suitable operator for each conditional pair The set of operators between attribute and value includes =equal , greater , >=greater or equal , 0 cf (p) = min{cf (w, e) : (w, e) ∈ p} for (w, e) ∈ p f (w, e) ← f (w, e) + cf (p) (send flow along the path) f (e, w) ← f (w, e) − cf (p) (flow of the backward path) If a statement is processed, an answer is output from L2S or the baseline The answer is correct if its query derives the exact information that has been asked One question may accept several queries, but it has a unique result where they are evaluated by human annotators The annotators perform two actions First, they execute the query Then, if the query are successful executed, the result is compared with the precise answer provided by experts The answer is correct if and only if the query is executable and the obtained result and the precise answer are identical If the obtained result is different from the answer, it is incorrect If the query is not delivered or unexecutable, the answer is invalid return f the maximum flow goes through all nodes which are expected to be in the SQL query V EXPERIMENT B Experiment A Dataset We first analyze the testing sets (68 straightforward and 361 general) Most questions belong to three main categorizes: entity - asking about specific subject/object, number - asking for quantity/ranking of a group of subject and ratio - asking for the proportion In this section, we present an empirical evaluation to assess the effectiveness of L2S in Vietnamese We conduct experiments on two sample datasets with regard to two domains Each dataset consists of two parts: testing questions and a database Table I shows the difference between two testing sets of UEEM While the majority of straightforward questions are entity (70.59%), general questions are divided more evenly Especially, 9.14% of all the general questions are statement sentences, which not contain question word This would lead to error with hand-written patterns designed to capture the question structure The first dataset was taken from the domain of university national entrance exam marks (UEEM) The database was taken from top universities in Vietnam It contains the candidates’ results along with their information, including name, date of birth, hometown and identification number In term of testing questions, 429 questions were collected from human users We divide the testing set into two types: straightforward questions (68) and general questions (361) Straightforward questions have simple structure, indicated by the providers The question contains a complete meaning, with one question word and no ambiguous term In contrast, general questions are not bounded by the definition of simplicity They were expressed in a natural way They might have more than one question words, or no question word Some terms were omitted, some unrelated words were added, which leads to the problem of ambiguity However, it guarantees the perception of natural language This follows the rule of noisy channel: the noisy channel makes what people said different from what they think The database of UEEM contains three types of information: relation (names of the university), attributes (“identification number”, “name”, “marks” and title of other information) and value (value of each instance in the database) It is common that two candidates receive the same mark in [0-10] Therefore, one value may belongs to different instances, posing a high ambiguity in this database To guarantee a sound input, we implemented the prepossessing module to enhance this performance in QA task The source code and all updates of the completed system have been published online3 Table II illustrates the precision, recall, F-measure and accuracy of the two testing sets Precision is the percentage of correct answer in total answer Recall is the percentage of correct answers in the set of correct answer and no answer The accuracy is the percentage of correct answers in total questions L2S answers all questions, while the Baseline only answers tractable questions Therefore, the recall of L2S is always 100% The second dataset was in the domain of Geography (GEO) By translating the set of Geoqueries880 [27] questions into Vietnamese, we collect an original set of 880 questions2 All proper American names were substituted by the corresponding names in Vietnamese We filtered out similar questions to keep 498 distinct questions for testing We employed three translators working separately Then they discussed to generate one final set of testing question Table II: Experiment with UEEM dataset Both datasets were collected carefully We not only make the testing experiment for our approach, but also create two standard testing dataset for QA in Vietnamese language From the domain, we manually created a mapping table between the terminology of column and the possible expression in question The table was created based on the common knowledge towards each domain We keep the same table for two experiments To measure the performance of L2S, we build a baseline with the graph-based approach, which treats the database as a dictionary [21] Without the technique for solving ambiguity, the baseline ignores all questions which are intractable - question with ambiguous words Sim test Baseline L2S Precision 76.47% 98.53% Recall 100% 100% F-measure 86.67% 99.26% Accuracy 76.47% 98.53% Gen test Baseline L2S Precision 75.96% 91.13% Recall 23.51% 100% F-measure 35.91% 95.36% Accuracy 21.89% 91.13% The precision of Baseline remains stable around 76% However, it refuses to answer intractable questions, which was abundant in the general sets Therefore, the recall and F-measure drops considerably http://sourceforge.net/projects/l2s/ http://www.cs.utexas.edu/users/ml/geo.html 88 Table I: simple and general test set of UEEM dataset Testing sets Straightforward General 500 highest IDF words Straightforward General Total 68 361 Entity question 70.59% 48.48% Number question 20.59% 35.46% Ratio question 8.82% 8.31% Non-question 0% 9.14% Named Entity 84% 85% Numeric 29.2% 26% Proper noun 30.5% 33.5% Question words 4.5% 6.2% Other types 19.8% 20.3% if we remember that the UEEM database was more ambiguous The Baseline results are interesting because they indicate that the graphbased method is somehow effective The less ambiguous the domain is, the more efficient it achieve However, it ignores the majority of questions due to intractable feature This method alone cannot be used for actual system L2S have achieve a high result with accuracy of 91.13% in the general test, compared to 21.89% of the baseline There are three main reasons for this distinction • Linguistic tools failure, including the tokenizer and parser A minor incorrect positions in the parser leads to intractable question, this is resolve by using semantic information in L2S • L2S recognizes compare words like “lớn hơn”greater , “nhỏ hơn”smaller and so on as a named entity The Baseline could not find an equivalent element for them in the database, treat them as intractable • General questions contain unknown words, either stop words or words not in database Next we measure the performance of L2S in the new domain Because the configuration of L2S between two experiments are mostly the same, we compare the results and put forward our evaluation The precision and F-measure of L2S in the second experiment is lower than the first one (general test set and simple test set) The main reason is that the failure of tokenization and named entity recognition As all other linguistic system, L2S has one crucial point is that the tokenizer has to works precisely Overall, the performance of L2S is promising L2S has proved it robustness in two sample datasets With the same linguistic tools for an under-resourced language like Vietnamese, L2S does not require annotated training data nor a set of hand-crafted rules Comparing the result with the first experiment, we can say that L2S has demonstrated it effectiveness in dealing with different domain This proved the effectiveness of our hybrid approach, combining semantic information and graph-based model However, some mistakes in the tokenization and question words processed lead to incorrect answers in both L2S and the baseline C Experiment A brief analysis of the second dataset shown that the proportion of named entity in 500 words which has the highest IDF value is 81.3% Similar to the UEEM dataset, named entities in the second experiments are mostly proper name of locations, rivers, mountain, question words and comparing words They plays an important role in question answering task D Discussion The experiments are conducted in two different datasets with no available annotated training set or hand-crafted rules Human intervention was minimized to the common knowledge of the domains The result strengthens our hypothesis that it is viable to build a reliable question answering system in under-resourced languages Two experiments were conducted with the same configuration of system No extensive observation of the domain for hand-written rules or annotated features is require The whole workload is significantly smaller than the time and effort spending on writing structured rules (in a typical rule-based system) The word frequency in this dataset is not as high as the first dataset Given a random element, such as name of a river, height of a mountain or population of a province, there is no boundary of its value The number of two elements which share the same value is smaller than the first database In other words, this database is less ambiguous than the first one Nevertheless, this dataset provides a new challenge In the database, 35.6% of value (including mountains, lakes and rivers) have strange name A strange word originates from the local language, such as “T’nưng”(lake), “phan xi pan” (mountain), “Xi giơ Pao” (mountain) and so on They are localized languages, leading to failure of linguistic tools In our prior analysis, all Vietnamese available linguistic tools cannot handle these names We sampled incorrect results from all experiments For each incorrect answer, it is analyzed to find out the main cause Table IV: The main reason of errors Problem Linguistic tools failure Special tags Question words ambiguous Complex questions Table III lists results of Baseline and L2S on the testing set Since we are interested in the performance with regard to new domain, we evaluate the F-measure of both systems Table III: Experiment with Geoquery dataset Gen test Baseline L2S Precision 83.3% 87.55% Recall 22.01% 100% F-measure 34.82% 95.36% Baseline 38.23% 5.36% 4.15% 26.60% L2S 4.86% 0.26% 4.15% 3.26% Table IV shows a significant reduce in three main causes of error Problems that the Baseline refuses to answer was resolved by L2S On the one hand, L2S leverage the precise output of one linguistic tool to overcome the failure of others For example, question is “Thí sinh có ngày sinh 06-03-1990?” The dependency parser failed to identify the relation between “ngày sinh” and “0603-1990” However, L2S detects the word “06-03-1990” has the Accuracy 21.08% 87.55% The baseline results show that, for Geographical questions and Baseline system, the mis-matching queries are less common They make the precision of Baseline higher, 83.3% This is not surprising 89 [11] name tag “date”, which is mapped to synonym of “ngày sinh” Even if the word “ngày sinh” is removed from question, L2S still retrieves appropriate pair for “06-03-1990” and successfully deliver the answer We are developing the method to overcome the limitation of question classification based on lexical tag VI [12] CONCLUSION [13] In this paper, we present our hybrid approach for developing QA systems in specialized domains L2S contains one novel method of semantic processing and one followed up method in graphed-based processing It overcomes the weakness of current approaches with regard to the lack of training data, domain observation and week underlying linguistic tools [14] L2S exploits information from underlying tools to select the reliable semantic information Then, the graph-based processing handles the remaining tokens and entities between two languages (natural language and SQL languages) By combining two approaches, L2S alleviates the heavy dependence on linguistic applications and domain knowledge [15] [16] [17] The experiments indicate effectiveness of the hybrid method The first experiment measured to what extent the presented approaches are useful to answer straightforward questions and tricky questions The second experiment demonstrates the robustness of L2S across different domains Results show that L2S maintains the accuracy over different domains and requires a small workload to switch the domain [18] [19] [20] ACKNOWLEDGMENT This work is supported by the Nafosted project 102.01-2014.22 [21] REFERENCES [1] Alexandrescu, A., Kirchhoff, K., 2009 Graph-based Learning for Statistical Machine Translation, in: Proc of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA pp 119–127 [2] Androutsopoulos, L., 1995 Natural Language Interfaces to Databases - an Introduction Journal of Natural Language Engineering 1, 2981 [3] Asratian, A.S., Denley, T.M.J., Hăaggkvist, R., 1998 Bipartite Graphs and Their Applications Cambridge University Press, New York, NY, USA [4] Cohen, W.W., Ravikumar, P., Fienberg, S.E., 2003 A Comparison of String Distance Metrics for Name-Matching Tasks, in: Proc of IJCAI03 Workshop on Information Integration, pp 73–78 [5] Costa, P., Almeida, J., Pires, L., van Sinderen, M., 2008 Evaluation of a Rule-Based Approach for Context-Aware Services, pp –5 [6] Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., 2002 GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications, in: Proc of ACL, pp 168–175 [7] Danica Damljanovic, V.T., Bontcheva, K., 2008 A Text-based Query Interface to OWL Ontologies, in: Proc of the Sixth International Conference on Language Resources and Evaluation, Marrakech, Morocco [8] Dat Quoc Nguyen, Dai Quoc Nguyen, S.B.P., Pham, D.D., 2011 Ripple down rules for part-of-speech tagging, in: Proc of the 12th International Conference on CL and Intelligent Text Processing, pp 190–201 [9] Demner-Fushman, D., Lin, J., 2007 Answering Clinical Questions with Knowledge-Based and Statistical Techniques Comput Linguist 33, MIT Press, Cambridge, MA, USA pp 63–103 [10] Dien, D., Kiem, H., 2003 POS-Tagger for English-Vietnamese Bilingual Corpus, in: Proc of the HLT-NAACL 2003 Workshop on Building and using parallel texts Association for Computational Linguistics, Stroudsburg, PA, USA pp 88–95 [22] [23] [24] [25] [26] [27] 90 Garcia, K.K., Lumain, M.A., Wong, J.A., Yap, J.G., Cheng, C., 2008 Natural Language Database Interface for the Community Based Monitoring System, in: Proc of the 22nd Pacific Asia Conference on Language, Information and Computation, De La Salle University (DLSU), Manila, Philippines pp 384–390 Giordani, A., 2008 Mapping Natural Language into SQL in a NLIDB, in: Proc of the 13th international conference on Natural Language and Information Systems, Springer-Verlag, Berlin, Heidelberg pp 367–371 Giordani, A., Moschitti, A., 2010 Semantic Mapping between Natural Language Questions and SQL Queries via Syntactic Pairing, in: Proc of the 14th International Conference on Applications of Natural Language to Information Systems, Springer-Verlag, Berlin, Heidelberg pp 207– 221 Kusumoto, T., Akiba, T., 2012 Statistical Machine Translation without Source-side Parallel Corpus Using Word Lattice and Phrase Extension, in: Proc of the Eight International Conference on Language Resources and Evaluation (LREC12), European Language Resources Association (ELRA), Istanbul, Turkey Lopez, V., Uren, V., Motta, E., Pasin, M., 2007 Aqualog: An ontologydriven question answering system for organizational semantic intranets Web Semant 5, 72–105 Moreda, P., Llorens, H., Saquete, E., Palomar, M., 2011 Combining semantic information in question answering systems Inf Process Manage 47, 870–885 Nguyen, Dat Tien, Hoang, Duc Tam and Pham, Son Bao, 2012 A Vietnamese Natural Language Interface to Database, Sixth IEEE International Conference on Semantic Computing, ICSC, Palermo, Italy 130–133 Nguyen, D.B., Hoang, S.H., Pham, S.B., Nguyen, T.P., 2010 Named Entity Recognition for Vietnamese, in: Proc of the Second international conference on Intelligent information and database systems: Part II, Springer-Verlag, Berlin, Heidelberg pp 205–214 Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., 2009a A Vietnamese Question Answering System, in: Proc of KSE, pp 26–32 Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P., 2009b Building a large syntactically-annotated corpus of Vietnamese, in: Proc of the Third Linguistic Annotation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA pp 182–185 Popescu, A.M., Etzioni, O., Kautz, H., 2003 Towards a theory of natural language interfaces to databases, in: Proc of the 8th International Conference on Intelligent User Interfaces, ACM, New York, NY, USA pp 149–157 Saxena, A.K., Sambhu, G.V., Subramaniam, L.V., Kaushik, S., 2007 IITD-IBMIRL System for Question Answering using Pattern Matching, Semantic Type and Semantic Category Recognition, in: Proc of The Sixteenth Text REtrieval Conference, 2007, Gaithersburg, Maryland, USA Sneiders, E., 2002 Automated Question Answering Using Question Templates That Cover the Conceptual Model of the Database, in: Andersson, B., Bergholtz, M., Johannesson, P (Eds.), Natural Language Processing and Information Systems, Springer Berlin Heidelberg pp 235–239 Suzuki, J., Sasaki, Y., Maeda, E., 2002 SVM Answer Selection for Open-Domain Question Answering, in: Proc of the 19th International Conference on Computational Linguistics, , Stroudsburg, PA, USA pp 1–7 Waltz, D.L., 1978 An English Language Question Answering System for a Large Relational Database Commun ACM 21, pp 526–539 Wieling, M., Nerbonne, J., 2010 Hierarchical spectral partitioning of bipartite graphs to cluster dialects and identify distinguishing features, in: Association for Computational Linguistics, Stroudsburg, PA, USA pp 33–41 Wong, Yuk Wah, Mooney, Raymond, 2007 Learning Synchronous Grammars for Semantic Parsing with Lambda Calculus, Association for Computational Linguistics, Prague, Czech Republic pp 960–967 ... 2010 Semantic Mapping between Natural Language Questions and SQL Queries via Syntactic Pairing, in: Proc of the 14th International Conference on Applications of Natural Language to Information Systems,... Philippines pp 384–390 Giordani, A., 2008 Mapping Natural Language into SQL in a NLIDB, in: Proc of the 13th international conference on Natural Language and Information Systems, Springer-Verlag,... testing questions, 429 questions were collected from human users We divide the testing set into two types: straightforward questions (68) and general questions (361) Straightforward questions

Định dạng
Số trang	6
Dung lượng	579,55 KB