Automatically learning patterns in subjectivity classification for vietnamese

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/281734772 Automatically Learning Patterns in Subjectivity Classification for Vietnamese Article · January 2015 DOI: 10.1007/978-3-319-11680-8_50 CITATIONS READS 52 4 authors, including: Thai Dang Van-Nam Huynh Japan Advanced Institute of Science and Tech… Japan Advanced Institute of Science and Tech… 7 PUBLICATIONS 3 CITATIONS 134 PUBLICATIONS 972 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: IFSA-SCIS 2017 SPECIAL SESSION SS-02: Uncertainty Handling in Recommender and Decision Support Systems (DEADLINE 15 FEB) View project All content following this page was uploaded by Thai Dang on 05 November 2015 The user has requested enhancement of the downloaded file All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately Automatically Learning Patterns in Subjectivity Classification for Vietnamese Tran-Thai Dang1 , Nguyen Thi Xuan Huong1,2 , Anh-Cuong Le1 , and Van-Nam Huynh3 University of Engineering and Technology Vietnam National University, Hanoi 144 Xuanthuy, Caugiay, Hanoi, Vietnam Haiphong Private University 36 Danlap, Duhangkenh, Lechan, Haiphong, Vietnam Japan Advanced Institute of Science and Technology 1-1 Asahidai, Nomi, Ishikawa, Japan {thaidangtran12@gmail.com; huong_ntxh@hpu.edu.vn; cuongla@vnu.edu.vn; huynh@jaist.ac.jp} Abstract Opinions are subjective expressions that describe people’s viewpoints, perspectives or feeling about entities, events They are essential information for sentiment analysis Therefore, opinions detection, which is also called subjectivity classification, is an important task In this paper, we propose a statistical method to automatically create the patterns for determining opinions from various resources on the web The learned patterns are more flexible and adaptive to domain in comparison with manual creation In this work, we obtained approximate 84% of accuracy when doing on Vietnamese comment data Introduction Sentiment analysis process includes crawling, extracting, and analyzing people’s opinions shared on forums, news portals, social networks It helps manufacturers can gain real feedback to improve their products, or customers can get useful information to make decision when buying products After crawling data from the Internet, we have to determine which comment belongs to subjective comments (comments contain opinions) or objective comments (comments just express the fact) Subjectivity classification is considered as the first step in sentiment analysis process The subjective comments will be normally used in next step to determine which are positive, negative or neutral comments There are some introduced methods to find words, phrases which express opinions Most previous works are carried out on English data However, those methods are not effective totally when applying on Vietnamese data In this paper, we focus on determining subjective comments in Vietnamese data Through investigating the comments on several Vietnamese forums and blogs, people usually use adjective and verb to express their opinions For example, some common adjectives and verbs are used in people’s comments such as: “đẹp” (nice), “xấu” (ugly), “tốt” (good), “mượt mà” (smooth), “thích” (like), “ghét” (hate), “cảm thấy” (feel), etc Therefore adjectives and verbs are able to be strong clues which help to distinguish between subjective and objective comments The words and phrases express opinion can be extracted based on sentiment dictionary, n-gram or syntactic patterns Among those ways, syntactic patterns are useful to enrich the set of features The patterns can be created manually based on knowledge of specific language such as grammar, POS For example we can build a pattern of POS as the following: “Con/Nu Nokia/Np này/P nhìn/V rất/R đẹp/A.” (This Nokia looks very nice) (Nu: Unit noun; Np: Proper noun; P: Pronoun; V: Verb, R: Adverb; A: Adjective) From above example, the phrase “nhìn đẹp” (look very nice) is a component that expresses opinion This phrase can be extracted from the pattern: V-R-A The manual creation not only requires much time but also is difficult to cover all the rules to find out subjective features In the Vietnamese forums, blogs people often use spoken language and slang which is short and informal Hence, we need to propose suitable patterns, then investigate and evaluate their influence To deal with this problem, we introduce a statistical method to help the system learn syntactic patterns and evaluate these patterns from labeled training data The learning processing includes two main steps such as patterns identification and evaluation The system will determine whether the patterns are used to express opinions frequently or not In our work, the training data are tagged by two labels “” (subjectivity), and “” (objectivity) After that, the system extract and evaluate the subjective patterns to build the features set The patterns may be created by using syntactic parser tree or POS information In our work, we chose POS for some reasons: firstly, people often use spoken language which includes incomplete sentences (the sentences lack subject or predicate), so it is difficult to obtain correct parsed tree; secondly, using the POS information is easier for adapting to domain than parser tree because we can use statistical approach to learn POS tags This method is able to apply for many languages without deep knowledge of their syntactic information Moreover, the learned POS patterns from the training data are more flexible and adaptive to domain than manual creation Related Work The subjectivity classification focus on how to build the good features set to improve the system’s performance Janyce Wibe[6] identified strong clues of subjectivity based on distributional similarity, he use small seed manual annotation data to develop promising adjective features In [1], Pang et al., used n-gram as features for polarity classification An other way, to enrich feature set, Perter D.Turney [10] used patterns to extract phrases which contain adjectives or verbs Similarly to English, people usually use adjectives and verbs to express their opinions in Vietnamese data, those are important information to extract features E.Riloff et al.,[3] used two boot-strapping algorithms that extract patterns to learn set of subjective nouns In [4], [5], [7] they used syntactic information to create the patterns To create the patterns to extract features, we need linguistic knowledge, or small set of sentiment seed words [3] In contrast, we propose a statistical method which learns the patterns from labeled training data POS information is also used in various previous researches to determine features Gamon [14] performed sentiment analysis on feedback data and analyzed the role of linguistic features like POS tags Pak and Paroubek [13] reported that both POS and bigram help to perform subjectivity classification of tweets Barbosa and Feng [15] proposed the use of syntax features of tweets like retweet, hashtags, link, punctuation and exclamation marks in conjunction with features like prior polarity of words and POS tags Agarwal et al [16] extended Barbosa and Feng (2010) by using a combination of real-valued polarity with POS and reported POS features are important to the classification accuracy In [9], M.Sokolova and G.Lapalme proposed a hierarchical text representation and built domain-independent rules that not rely on domain content words and emotional words Other observed characteristics may be used to recognize subjective expressions such as word lengthening in [12], emoticon in [11] After extracting the features, they usually use many classification techniques to assign labels for comments Pang and Lee used Support vector machines (SVM), Naive Bayes (NB), Maximum Entropy (ME) in [1] Riloff used SVM in [4] We also investigate some classification algorithms such as SVM, NB in our empirical work Our Approach Motivating from using syntactical patterns and POS information to extract features on English data from previous works, we build a set of Vietnamese patterns that help to enrich the features set Different from previous researches, in our approach POS information is chosen to build the patterns because it is easy to adapt to domain In learning process, we determine the forms of patterns which are called as templates This process is illustrated in figure Figure POS patterns learning process In first stage of this process, we extract all patterns on the labeled training data from the predefined templates The second stage, the patterns are evaluated to select the best set of patterns The evaluation process contains two steps: Evaluating to get sets of acceptable patterns A pattern is acceptable if and only if it appears more frequently in subjective comments than in objective comments Evaluating to get the best set of patterns The best set of patterns is selected from the sets of acceptable patterns by classifying on the training data with set of features (adjectives and verbs) extracted from the patterns 3.1 Training Data In the training data, each comment are tagged by two labels which are “” (subjectivity) or “” (objectivity) and the number of subjective comments are equal to number of objective comments For example: Giá tốt (good price) Xấu tệ hại (very ugly) Tơi khơng thích thiết kế (I don’t like the design of this machine) Tơi tính mua em thay cho em Qmobile M45 (I am considering buying this machine instead of Qmobile M45) Lại đổi giao diện (change the interface) Có màu bạc khơng có hàng (got this product but there is no sliver product) The training data is checked spell and segmented into words before tagging by the POS labels because we work on Vietnamese text 3.2 Templates Definition We mainly use adjectives and verbs as the features, so the templates are created by them and their surrounding POS labels The surrounding POS may be noun (N), proper noun (Np), other adjective (A), or verb (V), adverb (R), coordinating conjunction (Cc), auxiliary (T) We propose two types of templates to learn that are: • Type 1: The templates are built to extract the patterns which contain only POS labels We consider the verb and adjective with their surrounding POS labels in left side, right side, or both sides • Type 2: The type is similar to type 1, but it is more specific than type because we build the templates which extract the patterns which contain words (adjective or verb) and their surrounding POS labels We can use templates of type or type In this paper, we will show experimental results when applying two types of templates for extracting patterns Table and table illustrate examples of the templates of two types For example, the sentence in section has one adjective, if we use the template in line of table 1, we will extract the pattern R-A We are able to expand the templates shown in two tables by increasing the number of surrounding POS labels Table Templates of type Template tag-tag[+1] tag-tag[-1] tag-tag[-1] & tag[+1] tag-tag[+2] tag-tag[-2] Description if the current tag is adjective or verb, considering the template which contains the current tag and one next tag if the current tag is adjective or verb, considering the template which contains the current tag and one previous tag if the current tag is adjective or verb, considering the template which contains the current tag and one previous tag and one next tag (tags in both sides of the current tag) if the current tag is adjective or verbs, considering the template which contains the current tag and two next tags if the current tag is adjective or verb, considering the template which contains the current tag and two previous tags Table Templates of type Template word-tag[+1] word-tag[-1] word-tag[-1] & tag[+1] word-tag[+2] word-tag[-2] Description if the current tag is adjective or verb, considering the template which contains the current word and one next tag if the current tag is adjective or verb, considering the template which contains the current word and one previous tag if the current tag is adjective or verb, considering the template which contains the current word and one previous tag and one next tag (tags in both sides of current word) if the current tag is adjective or verb, considering the template which contains the current word and two next tags if the current tag is adjective or verb, considering the template which contains the current word and two previous tags Similarly to previous works N-gram (unigram, bigram) are used as the features to classify N-gram of words (after segmenting the training data into words) are learned as same as the patterns We also consider that N-gram is equivalent to the phrases extracted from the patterns 3.3 Extraction and Patterns Evaluation The predefined templates (section 3.2) are applied on the training data (it is tagged by POS) to extract all possible patterns After that we evaluate them to get the best set of patterns which is result of learning process Evaluation process contains two steps which are mentioned in section Acceptable Patterns Evaluation: We aim to find the patterns in the training data which characterize subjective expressions That means we only consider the patterns which satisfy the constrain: • A pattern is believable to express subjectivity if and only if: P ( |patterni ) > P ( |patterni ) The constrain means that: A pattern is acceptable if and only if it appears in subjective comments more frequently than in objective comments in the training data The formula below is proposed to get the sets of acceptable patterns: P (|patterni ) P (|patterni )+P (|patterni ) > threshold In order to satisfy the constrain above, the threshold must be greater than 0.5 The threshold can be increased to get the different sets of acceptable patterns The range of threshold is in [0.5, 1.0) (0.5 ≤ threshold < 1.0) In other word, we can generate a new set by changing the threshold By increasing the threshold, the new set is generated whose number of patterns may be smaller than the old set, so the set of acceptable patterns will be narrowed This evaluating step is considered as the first filter in learning process It help us remove a large amount of patterns that are unbelievable to express opinions The best set of patterns is selected from the remainder of patterns The best Patterns set Evaluation: This evaluation step aims to get the best set of patterns from sets of acceptable patterns We use the training data to evaluated In this case, the training data plays development data role A set of acceptable patterns is selected as the best set if it gets the highest accuracy by classifying on training data We assume that the best set of patterns on the training data will be the best set on the other data That means other data is similar to the training data about their grammar of sentences, so the best training data must cover most syntactic structures of subjective sentences However, in fact, set of patterns getting the best performance on the training data can be worse on other data because of the differences of the distribution and the grammar of sentences in data From each set of acceptable patterns, we extract phrases then take the adjectives and verbs in these phrases as the features These features is used to classify on training data and evaluated by 10-fold cross-validation After that, we select a set which has highest value We use the training data as the development data to evaluate acceptable patterns for adapting to domain and satisfying our assumption This work helps us build the set of patterns which is more flexible and diverse The quality of the patterns depends on the training data 4.1 Experiment and Discussion Experimental Data Our experiment is conducted on technical product review data (review of mobile, laptop, tablet, camera, TV) We collected data from some Vietnamese technical forums such as tinhte.vn, voz.vn, thegioididong.com by scrapy framework4 After that, we remove the non-diacritic comments, then correct spell errors in the comments We labeled manually 9000 collected Vietnamese comments with two kinds of labels “” (subjective) and “” (objective) (in section 3.1) After that we divided this annotated data into two parts The first part contains 3000 subjective comments and 3000 objective comments as the training data to learn the patterns The remainder of comments (3000 comments) is used to test the quality of learned patterns The training data and testing data are segmented into words and tagged by POS We used some classification tools in weka5 for evaluating in learning process and evaluating the quality of the learned patterns on test data 4.2 Experimental Results Learning Process: Firstly, we learned N-gram (unigram, bigram) of words from the training data with threshold in range [0.5; 0.6; 0.7; 0.8; 0.9] In order to reduce number of bigrams we just use the bigrams which appear at least two times in training data to build features set The results of this process are illustrated in table Unigram and bigram will be used as the features to classify on the training data We used liblinear library in weka which implements SVM for classification We implemented in 10-fold cross-validation, then evaluated classification’s performance Table Classification results of unigram and bigram threshold 0.5 0.6 0.7 0.8 0.9 unigram 82.59% 83.14% 83.47% 83.29% 81.82% bigram 72.93% 73.52% 75.52% 77.47% 79.27% The results of this experiment are shown in figure From figure 2, we can see that using unigram are better than bigram to build the set of features Moreover, we would like to investigate whether the combination of unigram and bigram can make a better set of features or not We combined the best set of unigram (threshold=0.7) and the best set of bigram (threshold=0.9) into a features set and got 85.03% of accuracy Although using bigram without unigram make the system’s performance decline, it can enrich the set of unigram features Secondly, we learned the POS patterns of two types which are mentioned in section 3.2 The predefined templates are applied on training data to extract all patterns We also use threshold in range [0.5; 0.6; 0.7; 0.8; 0.9] to generate the sets of acceptable patterns The patterns in each set are applied on the training data to extract phrases, adjectives, verbs as features for subjectivity classification We also used liblinear in weka http://scrapy.org http://www.cs.waikato.ac.nz/ml/weka/ Figure Classification results of unigram and bigram (%) to evaluate set of patterns The results are shown in table (patterns from templates of type 1) and table (patterns from templates of type 2) Table Classification results of learning Patterns of type threshold tag-tag[+1] tag-tag[-1] tag-tag[+2] 0.5 0.6 0.7 0.8 0.9 82.81% 81.16% 75.47% 67.92% 50.03% 82.82% 79.54% 79.41% 50.03% 50.03% 83.29% 81.41% 80.56% 76.19% 71.32% tag-tag[-1] tag[+1] 82.66% 81.46% 78.27% 76.17% 72.74% & tag-tag[-2] 82.62% 81.67% 80.99% 76.81% 68.95% In table 4, we got the template tag-tag[+2] with threshold is 0.5 as the best template We extracted 72 patterns, some of them and their extracted phrases are illustrated in table In table 5, we got the template word-tag[-1] with threshold is 0.9 as the best case We got 2175 patterns and some of them with their extracted phrases are illustrated in table Note: A (adjective); V(verb); R (adverb); N(noun) Testing Learned Patterns: We investigated the quality of features set which is ngrams, and words or words and phrases which are extracted from learned patterns The results is shown in table From table 8, we see that: • Base on the results in line 1, 2, 3, 4, we can compare the quality of the features from POS patterns and N-gram When comparing with unigram or bigram or the combination of unigram and bigram, the features of POS patterns in two types are better, but it is not significant Table Classification results of learning Patterns of type threshold word-tag[+1] word-tag[-1] word-tag[+2] 0.5 0.6 0.7 0.8 0.9 82.95% 82.77% 82.66% 82.82% 82.42% 82.87% 83.06% 83.02% 83.21% 83.37% 82.69% 82.91% 82.91% 82.97% 83.06% word-tag[-1] & word-tag[-2] tag[+1] 82.89% 82.76% 82.89% 82.77% 82.96% 82.57% 82.86% 82.69% 82.82% 82.71% Table Learned patterns of type patterns VRA VAA ARA phrases nhìn bóng_bẩy; chụp tốt; nhìn lạ ốp cực; nhìn đẹp thiệt; xài tốt thực_sự ấn tượng; đen menly; tiện_lợi lại sang_trọng Table Learned patterns of type patterns R yếu R hay V xấu V yếu_ớt N khỏe phrases yếu; không yếu; yếu hay; hay; khơng hay trơng xấu; thiết_kế xấu; nhìn xấu nhìn yếu_ớt; thấy yếu_ớt cấu_hình khỏe; sóng khỏe; máy khỏe Table classification results on testing data unigram bigram unigram + bigram words of patterns (type 1) words of patterns (type 2) words and phrases of patterns (type 1) words and phrases of patterns (type 2) words (type 1) + unigram + bigram words (type 2) + unigram + bigram # of features 3894 3210 7104 2972 1655 4472 3062 9847 8421 SVM 82.29% 64.25% 82.54% 82.56% 82.68% 82.56% 82.68% 83.47% 84.03% Naive Bayes 79.28% 59.24% 79.67% 77.46% 68.27% 77.46% 68.27% 78.22% 76.72% • The classification results in line and showed that the quality of patterns of type and type are similar In fact, we can use one of them to extract the patterns • The results in line and show that the phrases extracted from these patterns seem to be not good features because it can not improve the performance of classification • Line 8, contain the results of combination of unigram, bigram and features from POS patterns We can see that the learned patterns help us enrich the features set The patterns of type provided more 2700 features and the patterns of type provided more 1317 features The new features from these patterns and N-gram help to increase accuracy of subjectivity classification (more 1.5% in comparison with just using patterns or N-gram) However, the results are not really high for some reasons: Firstly, the features set has to suffer from the errors of spelling, words segmentation, POS tagging The spelling errors can lead to the noise in features set; Secondly, the quality and variety of the training data also affect to the performance of system Conclusion and Future Work This paper focuses on subjectivity classification problem in which we have proposed a new statistical method for enriching features set based on POS patterns In our work, the patterns are built automatically from labeled training data These patterns are more flexible and adaptive to domain We used learned patterns to extract word collocations using two type of word such as adjective and verb The Support Vector Machine (SVM) and Naive Bayes (NB) are applied to determine whether a given comment belongs to subjective or objective class In our experiment, by combining unigram, bigram and words extracted from POS patterns, we obtained 84.04% by SVM as the best case on the Vietnamese technical product review data In future, we will extend the features set by using other POS tags We also can exploit more templates to generate POS patterns Acknowledgment This paper is supported by the project QGTĐ.12.21 funded by Vietnam National University, Hanoi References Bo Pang, Lillian Lee, Shivakumar Vaithyanathan: Thumbs up? Sentiment classification using machine learning techniques In Proceedings of EMNLP 2002, pages 79-86 Chong Long, Jie Zhang, Xiaoyan Shut: A review selection approach for accurate feature rating estimation In Proceedings of the 23rd International Conference on Computation Linguistics 2010 Pages 766-774 Ellen Riloff, Janyce Wiebe, Theresa Wilson: Learning Subjective Nouns using Extraction Pattern Bootstrapping* In Proceedings of the Seventh coNLL conference held at HLT-NAACL 2003 Pages 25-32 4 Ellen Riloff and Siddharth Patwardhan, Janyce Wiebe: Feature Subsumption for Opinion Analysis In Proceedings of the Conference on Empirical Methods in Natural Language Processing 2006 Pages 440-448 Huong Nguyen Thi Xuan, Anh Cuong Le, Le Minh Nguyen: Linguistic Features for Subjectivity Classification in Asian Language Processing(IALP), International Conference 2012 Pages 17-20 Janyce Wiebe: Learning Subjective Adjectives from Corpora In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Application of Artificial Intelligence 2000 Pages 735-740 Janyce Wiebe, Theresa Wilson, Rebecca Bruce, Matthew Bell, Melanie Martin: Learning Subjective Language In Journal Computational Linguistics Volume 30 Issue 3, September 2004 Pages 277-308 Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, Manfred Stede: Lexicon-based methods for sentiment analysis In Journal Computational Linguistics Volume 37 Issue 2011 Pages 267-307 M.Sokolova, G.Lapalme: Opinion Classification with Non-affective Adjective and Adverbs In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’2009) 10 Peter D Turney: Thumbs up of thumbs down?: semantic orientation applied to unsupervised classification of reviews In proceedings of the 40th Annual Meeting on Association for Computational Linguistics 2002 Pages 417-424 11 Sara Rosenthal, Kathy McKeown: Columbia NLP: Sentiment Detection of Subjective Phrases in Social Media In Conference on Lexical and Computation Semantics 2013 12 Samuel brody, Nicholas Diakopoulos: Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening to detect sentiment in microblogs In Proceedings of the Conference on Empirical Methods in Natural Language Processing 2011 Pages 562-570 13 A.Pak, Patrick Paroubek: Twitter as a Corpus for Sentiment Analysis and Opinion Mining In Proceedings of the Seventh conference on International Language Resources and Evaluation LREC’10, Valletta, Malta, European Language Resources Association ELRA, 2010 Pages 14 Michael Gamon: Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis In Proceeding of COLING-04, the 20th International Conference on Computational Linguistics 2004 Pages 841-847 15 L.Barbosa, J.Feng: Robust sentiment detection on Twitter from biased and noisy data In Proceedings COLING ’10 Proceedings of the 23rd International Conference on Computational Linguistics 2010 Pages 36-44 16 A.Agarwal, B.Xie, I.Vovsha, O.Rambow, R.Passonneau: Sentiment analysis of Twitter data In Proceedings of LSM ’11 Proceedings of the Workshop on Languages in Social Media 2011 Pages 30-38 View publication stats ... adapt to domain In learning process, we determine the forms of patterns which are called as templates This process is illustrated in figure Figure POS patterns learning process In first stage... we introduce a statistical method to help the system learn syntactic patterns and evaluate these patterns from labeled training data The learning processing includes two main steps such as patterns. . .Automatically Learning Patterns in Subjectivity Classification for Vietnamese Tran-Thai Dang1 , Nguyen Thi Xuan Huong1,2 , Anh-Cuong Le1 , and Van-Nam Huynh3 University of Engineering and

Định dạng
Số trang	12
Dung lượng	284,98 KB