2012 IEEE Sixth International Conference on Semantic Computing A Vietnamese Natural Language Interface to Database Dat Tien Nguyen Tam Duc Hoang Son Bao Pham University of Engineering and Technology Vietnam National University, Hanoi Email: datnt88@gmail.com University of Engineering and Technology Vietnam National University, Hanoi Email: tamhd1990@gmail.com University of Engineering and Technology Vietnam National University, Hanoi Email: sonpb@vnu.edu.vn Abstract—In this paper, we present a Vietnamese Natural Language Interface to a survey database for individuals and businesses who want to know economic information from economic surveys We carry out analysis of various Vietnamese question types and investigate the stability of our approach using GATE framework and R language Our system uses R language to specifically deal with statistical question types Experimental results are very promising where our system achieves an accuracy of 78.5% on economic survey data domain oping a question answering system In section III, we present our system and evaluate it in section IV The conclusion and future work will be presented in section V II RELATED WORK Esther Kaufmann,Abraham Bernstein, and Renato Zumstein introduced QUERIX (A natural language interface to query Ontologies based on Clarification Diaglogs) [4] It extracts the sequence of the main word categories: Noun, Verb, Preposition, Wh-Word and Conjunction Then, its core component, the matching center perform steps: identifying the subjectproperty-object patterns of a query, searching for all matches between the synonyms in the ontology, composing SPARQL queries from the joined triples Their system achieved an average recall of 87.11% and an average precision of 86.08%, guaranteeing an efficient method of simple structure PRECISE [5] is a NLIDB question answering system that takes as its input a natural language question to generate a corresponding SQL query PRECISE showed a high precision (over 80% in a list of hundreds English questions) However PRECISE requires all tokens in input questions to be distinct and appear in its lexicon Aqualog [6] is an ontology-based question answering system for English Aqualog takes a natural language question and an ontology as its input, and returns an answer for users based on the semantic analysis of the question and the corresponding elements in the ontology This system use many processing language resource such as word segmentation, sentence segment, part-of-speech tagging to analyze semantic and syntactic of user question The natural question is transfered into a Query-Triple with format (generic term, relation, second term) System use JAPE grammars in GATE to identify terms and their relation, then perform exact result by calculating on database In the Vietnamese research, K Nguyen and H Le [7] introduce a NLIDB (Natural Language Interface to DataBases) question answering system in Vietnamese employing semantic grammars Their system contains main modules: QTRAN and TGEN Query Translator (QTRAN) maps a natural language question to a SQL query and Text Generator (TGEN) generates answers To , QTRAN analyzes question in to systax tree via CYK algorithm by using limited contextfree grammars Then, the syntax tree is converted into SQL I INTRODUCTION Generally, a computer database can be accessed via technical commands or queries that requires a certain level of database knowledge from users Natural Language Interface (NLI) aimed to allow everyone, or rather, all clients to ask questions to the database in a natural language format The database can be in the form of an SQL database or spreadsheet Thus, a NLI system needs to take a natural language question and turn that into an appropriate technical query for the database With the recent advances in Natural language processing and Human Computer Interaction, users are not expected to remember complex structures of machine language to interact with the computers Instead, users expect computers to be intelligent and be able to interact with human in a natural way Therefore, NLI become a necessary requirement in facilitating the interaction between users and the database In this paper, we proposed an approach to build a Vietnamese Natural Language Interface to Database The database of our system is collected from an economic survey of three thousand companies in Vietnam which we will provide further details later on Our system has two components: Question Analysis module and Result Computing module The first component identifies question types and extracts core information from users’ questions For the domain of economic survey data, users are interested in various statistical measures The second component aims to identify the particular data that the user request as well as computing the target statistics The statistics computation is done by formulating the users questions into statements in R which is a popular statistical programming language [1], [2] Our paper is organized as follows In section II, we review some existing approaches that have been proposed for devel978-0-7695-4859-3/12 $26.00 © 2012 IEEE DOI 10.1109/ICSC.2012.33 130 question terms are : "tìm ra"list , "bao nhiêu"how_many , "như nào"how We built Jape rules to detect Question Terms and integrate the results into the Vietnamese word segmentation VNTokenizer [10] so those question terms are correctly recognized as words An example of a Jape rule is shown in Figure query by using a mapping dictionary that determine names of attributes TGEN module find out the relationship between meta-data and relations in the database tables to generate exact answers Patten-based and keyword-based approaches are used in TGEN module Based on previous research for English, V M Tra, V D Nguyen, O T Tran, U T T Pham, and T Q Ha [8] proposed an implementation for Vietnamese question answering system by combining SnowBall system [9] and semantic relation extraction using search engines They proved that this proposed method is sufficient for Vietnamese question answering system by experimental results on traveling domain They achieved 89.7% precision and 91.4% ability to give the answers when testing on traveling domain III OUR VIETNAMESE NATURAL LANGUAGE IINTERFACE TO DATABASE Our system is implemented in Java as a web application using a client-server model The general architecture of our NLI system is shown in Figure Figure 2: A Jape pattern captured Question Term 2) Question Type: There are many types of questions in Vietnamese Some common question types are Yes/No, Calculating, Give Reason (Why) and Comparing type etc However, in this paper, we work with survey database which contains statistical economic information We build a statistical question answering system that specifically meet users need about statistical information of the survey data Thus, our system our system focuses on questions that calculate the statistical measures such as average, sum, counting, deviation, decline calculation etc Question category is indicative of the answer type It also guides the R Engine of the second component to compute results from survey database We created Jape rules using noun phrases annotations and the question-word information identified by the preprocessing to identify the question type Figure shows an example of the Jape pattern that captured deviation question type annotation Figure 1: System Architecture A Question Analysis Component In our domain, a typical question contains three parts: question term, type of question and question information Question term indicates the target answer type Question type captures the statistical measure that users are after Question information is the type of requested data which will be used to retrieve corresponding fields in the database For example, given the following question: ’Cho biết trung bình doanh số thu năm 2003 bao nhiêu?’ In which phrases "Cho biết" and "Là bao nhiêuHow_much " are Question Terms The word "trung bìnhaverage " indicates the statistical type of question, which is the average value in this case The phrase "doanh số thu năm 2003sale_which_is_gained_in_2003 " contains the main content of the question, capturing the core data that users are looking for The above question has same meaning and similar question components as the following: "Năm 2003, doanh số trung bình thu bao nhiêu?" The task of the Question Analysis Component is to return Question Term, Type of Question and Question Information from an input question of the user Figure shows how the Question Analysis Component works 1) Detecting Question Term: Questions generally contain not only asked information but also special words or phrase Such special ones, called Question Term, often appear at the beginning or the end of the question Examples of Vietnamese Figure 3: A Jape pattern captured deviation question type B Result Computing Component The task of this component is to find the columns in the database that correspond to the Question Information In order to this, this system analyzes the survey questionnaires and rank the information of each question in the questionnaire We 131 Figure 5: Ranking questioned information process approach to represent the information in questionnaire and input questions A question is represented by its segmented words and their corresponding synonyms Columns in the database are then ranked based on the number of matched tokens from the input question However, sometimes the questions of user are ambiguous leading to columns in the database having the same ranking For these situations, we present this information to the users as suggestion for related information We think that this is very useful as in many cases users ask one question but expect all related answers For instance: When the user asks: - Cho biết tổng số lao động năm 2003 bao nhiêu?the_number_of _employees_in_2003 our system also suggests a related information by providing "the number of employees in 2004" Our system ensures that the users always get some answers Even if the answer might not be totally correct, it is the best attempt to find the most relevant one 2) R Engine: R language provides a wide variety of statistical and graphical tools, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others The task of R Engine is to compute the statistical results of the users’ question The R Engine uses the columns in the survey database and Question Type as parameters to compute result Besides, to improve the interaction between users and system, R Engine also calculate the results for the top ranked columns to present related answers to users Figure 4: Question Analysis Component call this process Relevant Data Retrieval After determining the required columns in the database, the R Engine uses the columns and the Question Type as parameters to compute the final result 1) Relevant Data Retrieval: Before a question can be answered, the relevant knowledge sources must be identified If the answer to a question is not available in the data sources, no matter how well other components work, a correct result will not be found This component uses the question information to determine relevant data in the database This part is very important because it need to give the exact parameter (columns) to the R Engine for computing results on the survey database The meaning of each column in the database is represented by a phrase, so the most straightforward way would be to match the question information of the input question with the column phrase Obviously, using exact word matching would only achieve limited result To utilize words with similar meanings, one construct synonym dictionaries using Term Extraction [11][12] or Term Collocation [13] For our system, we construct synonym dictionaries for economic and statistical data domain Vietnamese Synonyms By identifying the semantic similarity [14] we automatically build a synonym dictionary for all nouns, verbs and adjectives which appear in the survey questionnaire data Because, in some cases, users use abbreviation words, we also add abbreviations to the synonym dictionary All words and abbreviations which have the same meaning are organized in one group Hence, they are viewed similar to each other in the ranking step For example: "công ty"company , "doanh nghiệp"enterprise , "DN" and "Cty" are synonyms where "DN" and "Cty" are abbreviations for "doanh nghiệp"enterprise and "công ty"company respectively Ranking questioned Information This step uses the question’s segmented words [10] and the synonym dictionary to rank the relevant columns in questionnaire data (shown in Figure 5) In Vietnamese, tokens or phrases can change position with each other to express different sentences, but still keep the same meaning In this paper, we use the Bag of Words IV EXPERIMENTAL RESULT A Data setup We use a large economic survey data to test our NLI system The survey is conducted by on approximately three thousand of small and medium-sized enterprises in Vietnam The questionnaire includes many closed questions about the financial situation of a company as well as questions on organizational methods, the number of employees, the salary status etc All participating companies fully complete the questionnaire For example: Câu hỏi Q11: "Cho biết số lượng công nhân công ty?"How_many_employees_are_there_in_yourcompany? Trả lờiAnswer : ngườiEmployee The answer of each company the the questionnaire is a row in database Each question is a column in the database In above question, the answer of all companies is stored in column named Q11 in the database The results of survey is 132 V CONCLUSIONS Question answering research attempts to deal with a wide range of question types It requires more sophisticate techniques than traditional information retrieval such as document retrieval In this paper, we propose an approach that uses Vietnamese language processing techniques coordinated with R Language to develop a complete Natural Language Interface system for individuals and businesses who want to get statistical information from economic surveys We introduce a Vietnamese natural language interface to survey database Our system consists of two components namely Question Analysis and Result Computing Experimental results of the system on a wide range of statistical questions are promising Specifically, our system achieves an accuracy of 78.5% Besides, system can help users find out the related information in the survey database by suggesting related results for users In the future, we will improve the accuracy of system by using Vietnamese grammar in analyzing questions and include additional question types stored in the a file ".cvs" that contains 300 columns and 2975 rows Experimental setup We collected 500 real questions from users who are seeking information from the survey 300 questions are used for the training phase to build our system The 200 remaining questions are used to evaluate the correctness of our system B Evaluation and results Out of these 200 questions, 192 questions are correctly processed by the Question Analysis component resulting in 96% accuracy The 4% errors are due to the lack of coverage of our Jape grammars in capturing the types of question as well as the Question Terms After testing the first component we evaluate overall correctness of the system The results are shown in Table I Table I: Correctness of system Correct answers Wrong answer Non-segmented questions Number question 157 35 Percent 78.5 17.5 REFERENCES Our system gives correct answers to 157 questions 35 questions are returned incorrect results This is because of the ambiguity in word segmentation process by VnTokenizer [10] For example, consider the question Doanh số trung bình doanh nghiệp năm 2004?what is average income of all companies in 2004? After segmentation, we have the phrase set: doanh-sốsales , trung-bìnhaverage , các-doanh-nghiệpall enterprises , năm2004year_2004 VnTokenizer considers năm-2004year2004 is a phrase in Vietnamese However,there should be two words "nămyear " and "2004" In this situation and other cases, we discover the correct answer in the related answers suggestion of system [1] J Fox and R Andersen, “Using the R statistical computing environment to teach social statistics courses,” Department of Sociology, McMaster University, 2006 [2] A Vance, “Data analysts captivated by R’s power,” New York Times, 2009 [3] D S Gary G Hendrix, Earl D Sacerdoti and J Slocum, “Developing a natural language interface to complex data,” ACM Transactions on Database Systems (TODS) TODS, 1987 [4] A B Esther Kaufmann and R Zumstein, “Querix: A natural language interface to query ontologies based on clarification dialogs,” In proceedings of the 5th International Semantic Web Conference (ISWC 2006), Athens, GA, 2006 [5] A Popescu, O Etzioni, and H Kautz, “Towards a theory of natural language interfaces to databases,” In Proceedings of IUI, 2003 [6] V Lopez, V Uren, E Motta, and M Pasin, “Aqualog: An ontologydriven question answering system for organizational semantic intranets,” Journal of Web Semantics, 5(2):72-105, Elsevier, 2007 [7] T H Le and K A Nguyen, “Natural language interface construction using semantic grammars,” In Proceedings of PRICAI, 2008 [8] V M Tra, V D Nguyen, O T Tran, U T T Pham, and T Q Ha, “An experimental study of Vietnamese question answering system,” International Conference on Asian Language Processing IALP, 2009 [9] E Agichtein, L Gravano, J Pavel, V Sokolova, and A Voskoboynik, “Snowball: A prototype system for extracting relations from large text collections,” ACM’s Special Interest Group on Management Of Data, 2001 [10] D D Pham, G B Tran, and S B Pham, “A hybrid approach to Vietnamese Word Segmentation using Part of Speech tags,” International Conference on Knowledge and Systems Engineering KSE, 2009 [11] Y Sasaki, “Question answering as question-biased term extraction: a new approach toward multilingual QA,” Proceeding ACL ’05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005 [12] T R Lynam, C L A Clarke, and G V Cormack, “Proceeding hlt ’01 proceedings of the first international conference on human language technology research,” Unknown Journal, 2001 [13] N J McCracken, A R Diekema, G Ingersoll, S C Harwell, E E Allen, O Yilmazel, and E D Liddy, “Modeling reference interviews as a basis for improving automatic QA systems,” Proceeding IQA ’06 Proceedings of the Interactive Question Answering Workshop at HLTNAACL, 2006 [14] D T Nguyen and S B Pham, “Finding the semantic similarity in Vietnamese”„” International Conference on Asian Language Processing IALP, 2010 Figure 6: Correct answers with suggestions Interesting Results Our system has an ability to answer questions with general meaning When user ask a general question like Hãy cho biết tổng doanh số theo năm?Calculating_the_sum_of _income_in_each_year Our system returned all related results In other words, it compute the income information in the year 2003 and 2004 stored in the survey database 133 ... propose an approach that uses Vietnamese language processing techniques coordinated with R Language to develop a complete Natural Language Interface system for individuals and businesses who want to. .. data,” ACM Transactions on Database Systems (TODS) TODS, 1987 [4] A B Esther Kaufmann and R Zumstein, “Querix: A natural language interface to query ontologies based on clarification dialogs,” In... between meta-data and relations in the database tables to generate exact answers Patten-based and keyword-based approaches are used in TGEN module Based on previous research for English, V M Tra, V