Question Analysis for a Community Based Vietnamese Question Answering System

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	11
Dung lượng	918,73 KB

Nội dung

Question Analysis for a Community Based Vietnamese Question Answering System tài liệu, giáo án, bài giảng , luận văn, lu...

Question Analysis for a Community-Based Vietnamese Question Answering System Quan Hung Tran, Minh Le Nguyen, and Son Bao Pham Abstract This paper describes the approach for analyzing questions in our community-based Vietnamese question answering system (VnCQAs), in which we focus on two subtasks: question classification and keyword identification The question classification employs the machine learning approaches with a feature which represents a measure of similarity between two questions, while the keyword identification uses the dependency-tree-based features Experimental results are promising, in which the question classification obtains the accuracy of 95.7% and the keyword identification gains the accuracy of 85.8% Furthermore, these two subtasks help to improve the accuracy for finding the similar questions in our VnCQAs by 6.75% Introduction Question answering systems usually have a module for analyzing questions in order to extract the important information such as keywords, question types or semantic constraints In this research, we focus on two subtasks of question analysis: question classification and keyword identification Identifying important words from a set of documents is an important task on information retrieval and question answering with two main approaches: using the corpus-based statistics for term weighting [7, 9] and employing the supervised methods [3, 18, 10] The question classification aims to Quan Hung Tran · Son Bao Pham Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi e-mail: {quanth_55,sonpb}@vnu.edu.vn Minh Le Nguyen School of Information Science Japan Advanced Institute of Science and Technology e-mail: nguyenml@jaist.ac.jp © Springer International Publishing Switzerland 2015 V.-H Nguyen et al (eds.), Knowledge and Systems Engineering, Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_51 641 642 Q.H Tran, M.L Nguyen, and S.B Pham classify questions into several pre-defined classes for seeking the suitable answers Li et al [8] proposed a two-layer taxonomy with coarse classes and 50 fine-grained classes, while Bu et al [4] introduced a six-types taxonomy Futhermore, regarding to the methods for classifying questions, some researches employed the rule-based approaches [6] and the machine learning algorithms [4, 1], while other researches considered on combining rule-based and machine learning-based techniques [5] Recently, some question analysis techniques have been examined for Vietnamese [11, 13, 12, 16] However, these researches experimented on a standard corpus, where the words’ spellings are generally good The question analysis for our VnCQAs system, on the other hand, has to deal with noisy data from the communitybased resources In this paper, we propose the dependency-tree-based features in finding keywords We also introduce a new feature called “similarity feature” for classifying questions To the best of our knowledge, it is the first time the question analysis are adapted to the Vietnamese community data The paper was presented as follows: in section 2, we briefly describe the architecture of our VnCQAs system Section presents the overview of our approach for analyzing questions, while the question classification and the keyword identification are introduced in section and respectively Section gives the experimental results and the conclusion are shown in section The VnCQAs System Architecture The architecture of our question answering system [17] is shown in Figure It includes three modules: Database Construction, Question Analysis and Answer selection, in which the database construction module aims to build the database of question-answer pairs, while the question analysis module extract the useful information such keywords, question types and synonyms The answer selection module finds the most similar questions for the input question from the database, in which each similar question corresponds to a candidate answer The candidate answers then are processed to output the best answer In this paper, we focus on analyzing questions in the question analysis module with two main tasks: Question Classification and Keyword Identification Furthermore, in this module, we also use a dictionary of 6626 entities to find the synonyms in the question Figure shows an example for the question analysis module with the question: “Làm để tạo vùng nhớ ảo thay RAM” (How to create virtual memory to replace RAM) Question Classification 3.1 Question Types We classify questions into three types: Fact, Explanation and Solution according to the main purpose of the questioner Question Analysis for a VnCQAs 643 Fig The system architecture Fig An example for the question analysis module • Fact: The questions is only about objects and resources, the expected answer is about the general facts and/or attributes E.g., with the question: “Tấm dán hình từ tính gì?” (What is magnetic screen stickers?), the object is “Tấm dán hình từ tính” (the magnetic screen stickers), and the expected attribute in the answer is definition • Explanation: The questions require explanations or opinions e.g., “Vì điện thoại hay bị sóng?” (Why does my phone frequently lose signal?) 644 Q.H Tran, M.L Nguyen, and S.B Pham • Solution: The questions ask for the solution for a problem e.g.,“Chỉ cho em cách vào facebook Iphone?” (How to access Facebook from Iphone?) 3.2 Methodology In the VnCQAs system, we use the support vector machines (SVMs) for learning classification (as shown in Figure 3) with a set of features: Unigrams, Bigrams The unigram and bigram features are common in the natural language processing tasks, in which the value of each unigram/bigram feature is calculated as a boolean value indicating whether that unigram/bigram feature is included in the question or not Furthermore, another feature we used for training the SVM model is the similarity feature, for which the value of the similarity feature which represents a measure of similarity between two questions is estimated by using the phrasal overlap [2, 15]: overlap phrase (s1 , s2 ) = ∑ni=1 ∑m i2 for m phrasal n-word overlaps, where m is a number of i-word phrases appearing in sentence pairs simoverlap,phrase (s1 , s2 ) = tanh( overlap phrase(s1 , s2 ) ) |s1 | + |s2| Fig The question classification Keyword Identification This section describes the keyword identification by using the machine learning technique with the dependency-tree-based features Question Analysis for a VnCQAs 645 4.1 Keyword Definition We define the keywords as: • The most informative words in the question is the set of keywords which contains most of the information (e.g., topics, main objects and actions) • The words can be used to distinguish different questions, two questions that have the same set of keywords are likely to be similar E.g.,: The question: “Hỏi cách xóa tin nhắn iphone?” (How to delete messages on iphone?) have the keywords of: “cách” (how), “xóa” (delete), “tin nhắn” (messages), “iphone” The main verb “hỏi” (ask) is not considered a keyword because this word does not represent any important information and it also cannot be used to distinguish among questions 4.2 Methodology We use the dependency tree to find keywords on the premise that is to identify a word as informative or not, we take into account the relationships of that word with other words in the question by using the Vietnamese dependency parser [14] For each question, the dependency parser creates a tree that contains the tree structure, the relation of the words and several other information such as part-of-speech tag (as shown in Figure 4, the question is “How I delete messages in Iphone?”) Fig An example of the dependency tree The features are then extracted from the tree and used for training the SVM model which is used to classify a word as a keyword or not (as presented in Figure 5) These features are grouped as follows: • The part of speech (POS) tag of each word: POSW • The part of speech (POS) tag of the parent word of each word in the dependency tree: POSP • Unigrams • The common dependent words: CDW The POSW feature is used because words with certain POS tags (e.g., Noun, Verb, and Adjective) are more likely to be a keyword of a sentence The POSP feature 646 Q.H Tran, M.L Nguyen, and S.B Pham Fig Dependency tree method work flow helps to identify that words that are children of verbs are more likely to be the object of the question The CDW feature is employed because the children of several words (e.g “cách” (solution)) have a high chance of being keywords During implementation for the CDW feature, we use a map that stores the words and an Integer that indicates the number of times that word is the parent of a keyword A manual threshold is then used to identify which words have a high frequency of being the parent of keywords Experimental Results 5.1 Question Classification Evaluation We use a set of 1013 manually tagged questions focusing on the technology domain with three mentioned types: Fact, Explanation, and Solution The tagged questions come from the database construction module of our VnCQAs system These questions are kept in its original form, no modifications are made However, some of the questions in online forums are not understandable, they lack information or context to be understood These questions are removed from the set of data The question distribution is shown in Figure Regarding to learn the SVM model, we use the LIBLinear and SVM-SMO algorithms with 10-fold cross-validation scheme The experiments were conducted on a Window PC with Core i7 CPU and 8GB of RAM The highest accuracy (95.7%) is achieved with the combination of phrasal overlap, unigram and bigram features (as shown in Table 1) Although, the accuracy of the question classification is around 95.7%, our corpus is different from the corpus of other published works, it is hard to directly compare our method to other available methods in the question classification task To make a meaningful comparison of our method with other methods, we investigate on the Question Analysis for a VnCQAs 647 Fig The distribution of questions in the tagged data set Table The question classification accuracy on the community data Features Accuracy (SVM - LIBLinear) Accuracy (SVM-SMO) Bigram 89.8 (%) 89.5 (%) Unigram 94.8 (%) 95.2 (%) phrasal overlap 93.4 (%) 92.9 (%) phrasal overlap + Unigram + bigram 95.6 (%) 95.7 (%) TREC corpus in Vietnamese [16] Our obtained accuracy is comparable to the accuracy of the Tran et al [16]’s approach with the same 10-fold cross-validation scheme (as shown in Table 2) Table The question classification accuracy on the TREC data Classes Our methodTran’s method 85.0 (%) 86.0 (%) fine grain classes classification 84.9 (%) 84.7 (%) coarse classes classification 5.2 Keyword Identification Evaluation To test the performance of the keyword identification, we use a set of 753 words tagged from the sentences in our database (as shown in Figure 7, the question is “How I delete messages in Iphone?”) To make a comparison, we implement a baseline for identifying the keywords by using the using term frequency - inverse document frequency (TF-IDF) method 648 Q.H Tran, M.L Nguyen, and S.B Pham Fig An example of keywords Fig The TF-IDF method’s accuracy The TF-IDF score of each word in a sentence will be calculated, and a threshold is chosen to identify whether a word is a keyword or not The accuracy results of the TF-IDF method are presented in figure Our method outperforms the TF-IDF method as we can see from the obtained accuracies in Table 3) 5.3 Question Analysis Evaluation In this section, we evaluate the contribution of the question analysis to our VnCQAs system by measuring the improvement in finding similar questions To evaluate the ability of the VnCQAs system to find similar questions, we use a set of Question Analysis for a VnCQAs 649 Table The accuracy of the keyword identification Features Accuracy (SVM - LIBLinear)Accuracy (SVM-SMO) POSW 79.1 (%) 79.1 (%) POSW + POSP 81.4 (%) 81.1 (%) POSW + POSP + BOW 85.4 (%) 85.4 (%) POSW + POSP+ BOW+CDW 85.8 (%) 85.4 (%) 1704 questions which are checked by hand to ensure that each question is not similar to other questions Then we paraphrase each question into different versions The paraphrased questions must have close meaning to the original questions We use 1704 original questions as the input questions for testing our system performance, if the returned question is one of paraphrased questions, we evaluate this as a good result, and otherwise we count it as a bad result Table shows the accuracy improvement in finding the similar questions Table The question analysis evaluation Method Accuracy) Cosine similarity 80.86 (%) Cosine Similarity + Question Analysis 87.61 (%) Conclusion In this paper, we described the question analysis module of our VnCQAs system on two subtasks: question classification and keyword identifications We classify questions into three types: Fact, Explanation and Solution by using the support vector machines (SVMs) for learning classification with a set of features: unigrams, bigrams, and similarity Our classification accuracy is high even though we have to deal with noisy community data Furthermore, on the Vietnamese TREC corpus, we gain the competitive accuracy results For the keyword identification subtask, we used the machine learning method with the dependency tree-based features and achieved the accuracy of 85.8% which outperforms the TF-IDF method In the future, we will improve the size and quality of data set used in both subtasks above We also will examine other methods for further improving the performance accuracy in analyzing questions 650 Q.H Tran, M.L Nguyen, and S.B Pham Acknowledgment This work is partially supported by the Research Grant from Vietnam National University, Hanoi No QG.14.04 References [1] Paliwal, M., Kumar, U.A.: Neural networks and statistical techniques: A review of applications Expert Systems with Applications 36(1), 2–17 (2009) [2] Banerjee, S., Pedersen, T.: Extended gloss overlaps as a measure of semantic relatedness In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI 2003, pp 805–810 (2003) [3] Bendersky, M., Croft, W.B.: Discovering key concepts in verbose queries In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp 491–498 (2008) [4] Bu, F., Zhu, X., Hao, Y., Zhu, X.: Function-based question classification for general qa In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, pp 1119–1128 (2010) [5] Huang, Z., Thint, M., Qin, Z.: Question classification using head words and their hypernyms In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, pp 927–936 (2008) [6] Hui, Z., Liu, J., Ouyang, L.: Question classification based on an extended class sequential rule model In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp 938–946 Asian Federation of Natural Language Processing (November 2011) [7] Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 721–735 (2009) [8] Li, X., Roth, D.: Learning question classifiers In: Proceedings of the 19th International Conference on Computational Linguistics, COLING 2002, vol 1, pp 1–7 (2002) [9] Luhn, H.P.: A business intelligence system IBM J Res Dev 2(4), 314–319 (1958) [10] Luo, X., Raghavan, H., Castelli, V., Maskey, S., Florian, R.: Finding what matters in questions In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp 878–887 Association for Computational Linguistics (June 2013) [11] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Vietnamese Question Answering System In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pp 26–32 (2009) [12] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A Semantic Approach for Question Analysis In: Jiang, H., Ding, W., Ali, M., Wu, X (eds.) IEA/AIE 2012 LNCS, vol 7345, pp 156–165 Springer, Heidelberg (2012) [13] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: Systematic Knowledge Acquisition for Question Analysis In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp 406–412 (2011) [14] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.-T., Le Nguyen, M.: From treebank conversion to automatic dependency parsing for vietnamese In: Métais, E., Roche, M., Teisseire, M (eds.) NLDB 2014 LNCS, vol 8455, pp 196–207 Springer, Heidelberg (2014) [15] Ponzetto, S.P., Strube, M.: Knowledge derived from wikipedia for computing semantic relatedness Journal of Artificial Intelligence Research 30(1), 181–212 (2007) Question Analysis for a VnCQAs 651 [16] Tran, D., Chu, C., Pham, S., Nguyen, M.: Learning based approaches for vietnamese question classification using keywords extraction from the web In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp 740–746 Asian Federation of Natural Language Processing (October 2013) [17] Tran, Q.H., Nguyen, N.D., Do, K.D., Nguyen, T.K., Tran, D.H., Le Nguyen, M., Pham, S.B.: A Community-based Vietnamese Question Answering System In: Proceedings of the 2014 International Conference on Knowledge and Systems Engineering, KSE 2014 (2014) [18] Zhao, L., Callan, J.: Term necessity prediction In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp 259–268 (2010) ... Question Analysis for a VnCQAs 647 Fig The distribution of questions in the tagged data set Table The question classification accuracy on the community data Features Accuracy (SVM - LIBLinear) Accuracy... types: Fact, Explanation and Solution according to the main purpose of the questioner Question Analysis for a VnCQAs 643 Fig The system architecture Fig An example for the question analysis module... modules: Database Construction, Question Analysis and Answer selection, in which the database construction module aims to build the database of question- answer pairs, while the question analysis

Ngày đăng: 16/12/2017, 15:57