A Community Based Vietnamese Question Answering System tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tập...
A Community-Based Vietnamese Question Answering System Quan Hung Tran, Nien Dinh Nguyen, Kien Duc Do, Thinh Khanh Nguyen, Dang Hai Tran, Minh Le Nguyen, and Son Bao Pham Abstract Most recent Vietnamese QA systems have not considered so far in using the data crawled from the community web services as a useful resource In this paper, we take into accounts the community-based resource to build a Vietnamese question answering system named VnCQAs Our system comprises of three modules for building the database of question-answer pairs, analyzing questions and choosing the best answer respectively Experimental results show that our system achieves promising performances Introduction Nowadays, the community web services play a crucial role in significantly supporting human users to seek desired responses, especially in technology domain Users often pose their queries on Yahoo! Answer, technology web forums or Facebook for finding helps as well as personal experience-based advice from others However, queries are often complex and contain multiple sub-questions whilst others’ feedbacks and comments miss valuable information or only deal with a part of these queries For example, a question “Có nên mua Samsung Galaxy S4 khơng?” (should I buy Samsung Galaxy S4?) expects the answer about individual opinions instead of the specifications of the phone itself Quan Hung Tran · Nien Dinh Nguyen · Kien Duc Do · Thinh Khanh Nguyen · Dang Hai Tran · Son Bao Pham Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi e-mail: {quanth_55,niennd_55,kiendd_55,thinhnk_55, dangth,sonpb}@vnu.edu.vn Minh Le Nguyen School of Information Science Japan Advanced Institute of Science and Technology e-mail: nguyenml@jaist.ac.jp © Springer International Publishing Switzerland 2015 V.-H Nguyen et al (eds.), Knowledge and Systems Engineering, Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_10 117 118 Q.H Tran et al Assuming that we have a collection of users’ queries from community web services and a corresponding collection of feedbacks and comments, building a community-based question answering (cQA) system to return a best answer for each user’s whole query raises a challenge issue It is because that the task is under the key research problems of how to construct the database of question-answer pairs, how to analysis questions from users’ queries, and how to produce a best answer Regarding to these problems, some researches concern about question identification [5, 16, 6, 15], question similarity [2], question generation [17], question analysis [10, 9], answer summarization [3] and answer re-ranking [13] At this time, most recent Vietnamese QA systems have not considered so far in using community web services as a useful resource for such researches Existing Vietnamese QA systems [8, 14, 7, 12] are usually rule/grammar-based ones and utilizes structured databases or crawled web-pages Additionally, there is a Vietnamese QA system that used community data as described in the Dang et al [1] ’s research The Dang et al ’s system responds to a new question by finding the similar questions from Yahoo Answer However, this system did not return the answers of those similar questions, and the reported accuracy was not high In this paper, we present a community-based Vietnamese question answering system, namely VnCQAs Our system solves the issue of domain adaptability and inability to be able to answer complex questions Furthermore, our VnCQAs system uses machine learning techniques to obtain high accuracy Our system contains three main modules: Database Construction, Question Analysis and Answer Selection, which are responsible for building the database, analyzing questions and choosing the best answer, respectively Figure illustrates an example1 with the input question: “nên mua Ipad hay laptop” (should I buy Ipad or latop?) The output includes the best available answer and related questions Users can find the answers of the related questions by clicking the corresponding links The paper is presented as follows In the section 2, we introduce the overview of the whole system and describe the modules We describe the experimental results in section The conclusion and future work are presented in section System Architecture In this section, we introduce the VnCQAs system architecture (as displayed in Figure 1) and briefly describe all modules in the system When a new question is presented to the system, our system finds the most similar questions from the database The answers of these similar questions are called candidate answers These candidate answers are then processed to output the final answer We build a system with three modules: Database Construction, Question Analysis, Answer selection Database Construction module is responsible for building the database of question-answer pairs The online demonstration is available at: http://150.65.242.39:8080/VNQA/ A Community-Based Vietnamese Question Answering System 119 Fig The system user interface Fig The system architecture Question analysis module analyzes the questions and gives useful information about the question such as keywords, questions types and words’ synonyms Answer selection module processes the candidate answers and return the final answers for the given question 120 Q.H Tran et al 2.1 Database Construction Module This module is to extract the question-answer pairs for constructing the database from the community data by two steps: question detection and answer detection (as shown in Figure 3) Fig The database construction module Our main community sources of question-answer pairs are threads collected from some famous technology forums in Vietnamese such as Vatgia, VnZoom and Tinhte These sites include the series of threads, in which each thread has a specific topic and it is further divided into posts Typically one of the posts presents the question, and some other posts contain the answer Furthermore, the community data we crawled from the sites have different layouts and therefore we standardize the data by parsing them to our predefined XML format for later processing Figure give the question: “Mấy anh cho em hỏi em vào My computer hay bị treo vài giây” (Why when I enter My computer folder, the computer stops responding for a few second) The only suitable answer is the last post: “cũng phần mềm AV phần mềm máy nên bạn kiểm tra lại máy xem nhé.” (maybe because of the AV software or maybe because of some other software, you should check your computer again) 2.1.1 Question Detection In the question detection, we use a machine learning model to classify whether a post is the question post or not Our features used for the machine learning are sequence patterns which are based on the generalized form of text E.g., a sentence: “Subnet Mask dùng để làm gì?” (What is Subnet Mask used for?) can be represented in sequence form as follows: “Np V E làm gì”, in which Np, V, E are part of speech (POS) tags of the respective words We define the question words that also appear many times over questions e.g., giúp(help), phân biệt (distinguish), đánh giá (evaluate), (how to do), (why) These question words are kept in their original form and arranged into 18 groups namely: A Community-Based Vietnamese Question Answering System 121 Fig Thread XML example • q0000: (what) • q0001: nào, (which) • q0002: (who) • q0003: đâu (where) • q0004: hay (or) • q0005: sao, sao, (why) • q0006: (how to do) • q0007: làm (what to do) • q0008: sao, nào, (how) • q0009: (how many), (how long), bao xa (how far) • q0010: khơng (not), chưa (not yet) • q0100: giúp(help), tư vấn(advise), dạy(teach), hướng dẫn(instruct), dẫn(instruct) • q0101: hỏi (ask), thắc mắc (worry) • q0102: khắc phục (overcome) • q0103: vấn đề (problem) • q0104: cách (solution) • q0105: so sánh (compare), đánh giá (evaluate) • q0106: phân biệt (distinguish) Other words in question are replaced by their POS tags to make the sequence more general The sequence patterns are extracted by using Prefix Span algorithm [15] After that, we select the patterns that contain the question words We then 122 Q.H Tran et al apply the method called “Multiple minimum supports” [4] to guarantee the quality of patterns 2.1.2 Answer Detection After finding the questions from the previous step, we detect the corresponding answers for each question by classifying the remaining posts through using a SVM model with a set of features: • Is the post belonged to the author of the thread? • Does the post contain quote of the questioner? • Does the post contain quote of other users? • The relative position of the post in the thread • The relative length of the post compared to others in the thread • Similarity between the post and the detected question • The proportion of noun, verb and pronoun that the post contains If the post is from the question’s owner, it is unlikely to be the answer Otherwise, the remaining post which contains the quote of the questioner often has a high possibility to be the answer 2.2 Question Analysis Module The question analysis module aims to extract important information from the questions for finding the similar questions in the later module In this module, we investigate three steps: question classification, question keyword identification, and similar word identification as presented in Figure Figure shows an example of data extracted by the question analysis module from the question: “Làm để tạo vùng nhớ ảo thay RAM” (How to create virtual memory to replace RAM) 2.2.1 Question Classification We classify questions because a question that is classified as a different type from the original question is unlikely to be a similar question Moreover, the question type also provides the constraints for verifying the answers We categorize the questions into classes: Fact, Solution and Explanation by using the machine learning method with a set of features: Unigrams, Bigrams, and Similarity The unigram and bigram features are calculated as the boolean value, while the value of the similarity feature which represents a measure of similarity between two questions is estimated by using the phrasal overlap 2.2.2 Keyword Identification The questions which are likely to be similar usually have the same set of keywords Besides, many questions in online forums and QA sites contain the unnecessary words and phrases, removing these helps to improve the ability of finding similar questions A Community-Based Vietnamese Question Answering System 123 Fig The question analysis module Fig An example of analyzing question The keyword identification aims to find the most important words in a question We compute a score for each word appearing in the question corpus by using term frequency - inverse document frequency (tf*idf) weighting scheme Then we use a threshold to determine whether the word is a keyword or not 2.2.3 Similar Word Identification Regarding to the performance of our system for finding the similar questions, we also use a synonym dictionary to return the words that has the same meaning with the original words in the input question 124 Q.H Tran et al Fig An example of giving the answer 2.3 Answer Selection The answer selection module is responsible for finding the similar questions with their corresponding candidate answers from the database, and finally it give the best answer Figure shows an example of how an input question is processed in this module As shown in Figure 8, after the input question is analyzed by the question analysis module, we use the extracted information as the input for finding the similar questions by using the Lucene2 For each candidate answer corresponding to a similar question, we then apply the supervised learning approaches to estimate a score of classification confidence The score for each candidate answer is used to re-rank the list of candidate answers, and finally the candidate answer with the highest score is selected as the final answer We consider the following triplet: (Qnew, Qpast, A), where Qnew is the original question, Qpast is the similar question and A is the candidate answer for Qpast Each triplet is classified as satisfied if the answer A can be used to respond to the question Qnew Otherwise, the triplet is classified as unsatisfied We employ the supervised learning approaches with a set of features: • Text length • Number of question marks • Number of stopwords • IDF statistics • Query clarity http://lucene.apache.org/ A Community-Based Vietnamese Question Answering System 125 Fig The answer selection module • Cosine similarity • Topic model Evaluation 3.1 Experimental Result We evaluate our system by the results of finding the similar questions and giving the correct answer for each test question We collect the community data from three famous technology forums in Vietnamese: Vatgia, VnZoom and Tinhte For each question, we assign a score arranged from to to each candidate answer corresponding to the question, in which the exact answers are given the score of 4, the irrelevant answers are given the score of The evaluation data for finding similar questions consists of 1704 questions obtained from the database construction module These questions are checked by hand to ensure that each question is not similar to other questions Then we paraphrase each question into different versions with the same meaning The paraphrased questions with the corresponding candidate answers are indexed into Lucene as mentioned in section 2.3 We use 1704 original questions as the input questions for testing our system performance, if the returned question is one of paraphrased questions, we evaluate this as a good result, and otherwise we count it as a bad result The accuracy result for finding the similar questions is presented in table 126 Q.H Tran et al Table The accuracy results for finding the similar questions Method Accuracy Cosine similarity 80.86 (%) Cosine similarity + Question analysis 87.61 (%) From 1704 questions, we choose 605 questions as the input questions, in which each question have an exact answer to test the performance of giving the correct answer We consider an returned answer as the satisfied answer if it matches the exact answer that we assigned The accuracy result for evaluating the correct answers is shown in table Table The accuracy result for finding the answer Method Accuracy Baseline using the default of Lucene 59.66 (%) Our approach 71.19 (%) Conclusion In this paper, we proposed the community-based question answering system for Vietnamese Our system consists of three modules: database construction, question analysis and answer selection The database construction module is used for creating the database of question-answer pairs, in which each question corresponds to the candidate answers The question analysis module is responsible for extracting useful information such as keywords, question types and synonyms The answer selection module takes the extracted information from the input question for finding the similar questions in the database, and then re-rank the list of corresponding candidate answers to give the best answer Experimental results are promising, where the question analysis module helps to improve the accuracy from 80.86% to 87.61% in finding the similar questions, and the answer selection module get the accuracy of 71.19% that is 11.53% higher than the baseline using the default of Lucene In the future, we will extend the question analysis module by using other additional features based on the dependency tree [11] We will also expand the database to be able to deal with a wide range of questions and improve the answer selection module Acknowledgment This work is partially supported by the Research Grant from Vietnam National University, Hanoi No QG.14.04 A Community-Based Vietnamese Question Answering System 127 References [1] Son, D.T., Dung, D.T.: Apply a mapping question approach in building the question answering system for vietnamese language In: Proceedings of the Conference on Green Technology and Sustainable Development (2012) [2] Bernhard, D., Gurevych, I.: Answering learners’ questions by retrieving question paraphrases from social q&a sites In: Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications, pp 44–52 Association for Computational Linguistics (June 2008) [3] Chan, W., Zhou, X., Wang, W., Chua, T.-S.: Community answer summarization for multi-sentence question with group l1 regularization In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Long Papers, vol 1, pp 582–591 (2012) [4] Jindal, N., Liu, B.: Identifying comparative sentences in text documents In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp 244–251 (2006) [5] Li, B., Jin, T., Lyu, M.R., King, I., Mak, B.: Analyzing and predicting question quality in community question answering services In: Proceedings of the 21st International Conference Companion on World Wide Web, WWW 2012 Companion, pp 775–782 (2012) [6] Li, B., Si, X., Lyu, M.R., King, I., Chang, E.Y.: Question identification on twitter In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp 2477–2480 (2011) [7] Nguyen, A.K., Le, H.T.: Natural language interface construction using semantic grammars In: Ho, T.-B., Zhou, Z.-H (eds.) PRICAI 2008 LNCS (LNAI), vol 5351, pp 728–739 Springer, Heidelberg (2008) [8] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Vietnamese Question Answering System In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pp 26–32 (2009) [9] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Semantic Approach for Question Analysis In: Jiang, H., Ding, W., Ali, M., Wu, X (eds.) IEA/AIE 2012 LNCS (LNAI), vol 7345, pp 156–165 Springer, Heidelberg (2012) [10] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: Systematic Knowledge Acquisition for Question Analysis In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp 406–412 (2011) [11] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.-T., Le Nguyen, M.: From Treebank Conversion to Automatic Dependency Parsing for Vietnamese In: Métais, E., Roche, M., Teisseire, M (eds.) NLDB 2014 LNCS, vol 8455, pp 196–207 Springer, Heidelberg (2014) [12] Nguyen, D.T., Hoang, T.D., Pham, S.B.: A vietnamese natural language interface to database In: Proc of the 2012 IEEE Sixth International Conference on Semantic Computing, pp 130–133 (2012) [13] Surdeanu, M., Ciaramita, M., Zaragoza, H.: Learning to rank answers on large online QA collections In: Proceedings of ACL 2008 HLT (June 2008) [14] Tran, V.M., Nguyen, V.D., Tran, O.T., Pham, U.T.T., Ha, T.Q.: An experimental study of vietnamese question answering system In: International Conference on Asian Language Processing, IALP 2009, pp 152–155 (December 2009) [15] Wang, K., Chua, T.-S.: Exploiting salient patterns for question detection and question retrieval in community-based question answering In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010, pp 1155–1163 (2010) 128 Q.H Tran et al [16] Yang, L., Bao, S., Lin, Q., Wu, X., Han, D., Su, Z., Yu, Y.: Analyzing and predicting not-answered questions in community-based question answering services In: AAAI (2011) [17] Zhao, S., Wang, H., Li, C., Liu, T., Guan, Y.: Automatically generating questions from queries for community-based question answering In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp 929–937 (2011) ... databases or crawled web-pages Additionally, there is a Vietnamese QA system that used community data as described in the Dang et al [1] ’s research The Dang et al ’s system responds to a new question. .. database The answers of these similar questions are called candidate answers These candidate answers are then processed to output the final answer We build a system with three modules: Database... http://150.65.242.39:8080/VNQA/ A Community- Based Vietnamese Question Answering System 119 Fig The system user interface Fig The system architecture Question analysis module analyzes the questions and gives useful