1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Luận văn thạc sĩ VNU UET transductive support vector machines for cross lingual sentiment classification

44 2 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 44
Dung lượng 459,75 KB

Nội dung

Transductive Support Vector Machines for Cross-lingual Sentiment Classification Nguyen Thi Thuy Linh Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Professor Ha Quang Thuy A thesis submitted in fulfillment of the requirements for the degree of Master of Computer Science December, 2009 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology any other educational institution, except where due acknowledgement is made in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed i LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Abstract Sentiment classification has been much attention and has many useful applications on business and intelligence This thesis investigates sentiment classification problem employing machine learning technique Since the limit of Vietnamese sentiment corpus, while there are many available English sentiment corpus on the Web We combine English corpora as training data and a number of unlabeled Vietnamese data in semi-supervised model Machine learning eliminates the language gap between the training set and test set in our model Moreover, we also examine types of features to obtain the best performance The results show that semi-supervised classifier are quite good in leveraging cross-lingual corpus to compare with the classifier without cross-lingual corpus In term of features, we find that using only unigram model turning out the outperformace LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com i Acknowledgements I’m grateful to my advisor Associate Professor Ha Quang Thuy who guide and encourage me since I was undergraduate student I have learned much about machine learning and nature language processing from him, and I appreciate his guidance and assistance I thank so much to Assistant Professor Nguyen Le Minh in JAIST (Japan Advanced Institute Science and Technology) for valuable comments and helpful suggestions since I started research on my thesis I am also thankful the members of Smart of Integrated Systems (SIS Lab) and Information System Department in the University of Engineering and Technology all of who always support me in work and study I thank to the members of Nature Languages Processing Laboratory in JAIST for the corroborate with me during the time I was an exchange student in JAIST I would like to dedicate this thesis to my wonderful family - from whom I’ve learnt many things about life including the process of scientific thought Hanoi, December 2009 Nguyen Thi Thuy Linh LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Table of Contents Introduction 1.1 Introduction 1.2 What might be involved? 1.3 Our approach 1.4 Related works 1.4.1 Sentiment classification 1.4.1.1 Sentiment classification tasks 1.4.1.2 Sentiment classification features 1.4.1.3 Sentiment classification techniques 1.4.1.4 Sentiment classification domains 1.4.2 Cross-domain text classification Background 2.1 Sentiment Analysis 2.1.1 Applications 2.2 Support Vector Machines 2.3 Semi-supervised techniques 2.3.1 Generate maximum-likelihood models 2.3.2 Co-training and bootstrapping 2.3.3 Transductive SVM The 3.1 3.2 3.3 semi-supervised model for cross-lingual The semi-supervised model Review Translation Features 3.3.1 Words Segmentation 3.3.2 Part of Speech Tagging 3.3.3 N-gram model approach 1 3 4 4 5 6 7 10 10 11 11 13 13 16 16 16 18 18 ii LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com TABLE OF CONTENTS iii Experiments 4.1 Experimental set up 4.2 Data sets 4.3 Evaluation metric 4.4 Features 4.5 Results 4.5.1 Effect of cross-lingual corpus 4.5.2 Effect of extraction features 4.5.2.1 Using stopword list 4.5.2.2 Segmentation and Part 4.5.2.3 Bigram 4.5.3 Effect of features size 20 20 20 22 22 23 23 24 24 24 25 25 of speech tagging Conclusion and Future Works 28 A 30 B 32 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com List of Figures 1.1 An application of sentiment classification 2.1 2.2 Visualization of opinion summary and comparison Hyperplanes separate data points 3.1 Semi-supervised model with cross-lingual corpus 15 4.1 4.2 The effects of feature size 26 The effects of training size 27 iv LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com List of Tables 3.1 3.2 3.3 An example of Vietnamese Words Segmentation 17 An example of Vietnamese Words Segmentation 18 An example of Unigrams and Bigrams 19 4.1 4.2 4.3 Tools and Application in Usage 21 The effect of cross-lingual corpus 23 The effect of selection features 25 A.1 Vietnamese Stopwords List by (Dan, 1987) 31 B.1 POS List by (VLSP, 2009) 33 B.2 subPos list by (VLSP, 2009) 34 v LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Chapter Introduction 1.1 Introduction “What other people think” has always been an important factor of information for most of us during the decision-making process Long time before the explosion of World Wide Web, we asked our friends to recommend an auto machine, or explain the movie that they were planning to watch, or conferred Consumer Report to determine which television we would offer But now with the explosion of Web 2.0 platforms blogs, discussion forums, review sites and various other types of social media, consumers have a huge of unprecedented power whichby to share their brand of experiences and opinions This development made it possible to find out the bias and the recommendation in vast pool of people who we have no acquaintances In such social websites, users create their comments regarding the subject which is discussed Blogs are examples, each entry or posted article is a subject, and friends would produce their opinion on that, whether they agreed or disagreed Another example is commercial website where products are purchased on-line Each product is a subject that consumers then would leave their experience comments on that after acquiring and practicing the product There are plenty of instance for creating the opinion on on-line documents in that way However, with very large amounts of such available information in the Internet, it should be organized to make the best of use As a part of the effort to better exploiting this information for supporting users, researches have been actively investigating the problem of automatic sentiment classification Sentiment classification is a type of text categorization which labels the posted LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 1.1 Introduction comment is positive or negative class It also includes neutral class in some cases We just focus positive and negative class in this work In fact, labeling the posted comments with consumers sentiment would provide succinct summaries to readers Sentiment classification has a lot of important application on business and intelligence (Pang & Lee, 2008) therefore we need to consider looking into this matter As not an except, till now there are more and more Vietnamese social websites and commercial product online that have been much more interesting from the youth Facebook1 is a social network that now has about 10 million users Youtube2 is also a famous website supplying the clips that users watch and create comment on each clip Nevertheless, it have been no worthy attention, we would investigate sentiment classification on Vietnamese data as the work of my thesis We consider one of applications for merchant sites A popular product may receives hundreds of consumer reviews This makes potential customers very hard to read them to help him on making a decision whether to buy the product In order to supporting customers, summarizer product reviews systems are built For example, assume that we summarize the reviews of a particular digital camera Canon 8.1 as Figure 1.1 Canon 8.1: Aspect: picture quality - Positive: - Negative: Aspect: size - Positive: - Negative: Figure 1.1: An application of sentiment classification Picture quality and size are aspects of the product There are a list of works in such summarizer systems, in which sentiment classification is a crucial job Sentiment classification is one of steps in this summarizer http://www.facebook.com http://www.youtube.com LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 4.3 Evaluation metric 22 Unlabeled Set (Unlabeled Vietnamese Reviews): We downloaded additional 980 Vietnamese reviews from Vietnamese commercial websites and employed that reviews to construct the unlabeled set In addition, we collected and labeled 20 product reviews (10 positive and 10 negative reviews) from Vietnamese web sites Those reviews will be employed to learn a classifier as a baseline Note that the training set and the unlabeled set are used in the training phrase, while the test set is blind to the training phrase 4.3 Evaluation metric As a first evaluation measure we simply take the classification accuracy, meaning the percentage of reviews classified correctly We also computed Precision, Recall and F1 of the identification of the individual classes (positive and negative class) The metrics are defined the same as followings ∩ retrieved documents| P recision = |relevant documents |retrieved documents| ∩ retrieved documents| Recall = |relevant documents |relevant documents| F1 = × precision precision + × recall recall In addition, we calculate the accuracy score Accuracy that is one of measurements for a system is a degree of closeness of a quantity to its actual value 4.4 Features Recall that the n-gram model we remind in Chapter In this thesis, we use unigrams and bigrams as features The features weight is calculated by term frequency (TF) weight that is often used in information retrieval This weight evaluate how important a word (or item) to a document in a corpus The important increases proportionally to the number of times a word appears in the document TF is defined as follows: the number of occurences of the term ti in the document dj T F = the sum of number of occurences of all term in document dj LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 4.5 Results 23 Table 4.2: The effect of cross-lingual corpus No Techniques (1) (2) (3) Semi-supervised Supervised Semi-supervised 4.5 4.5.1 Training size 7536 + 980 7536 20 + 980 # of features 20428 20023 2232 Accuracy Pre Re F1 0.7125 0.7062 0.5181 0.71 0.71 0.52 0.72 0.73 0.71 0.71 0.49 0.50 Results Effect of cross-lingual corpus In order to test our proposal, we built a classifier that use only 20 labeled reviews from commercial Vietnamese websites and Unlabeled Set as a baseline method And then, we compare the classification performance between the corpus making use of English labeled data and the baseline method The classification accuracies resulting are shown in line (1) and (2) respectively of Table 4.2 As a whole, our approach clearly surpass the baseline without the English corpus of 20% Using an available English corpus as supportive knowledge improve the classification performance significantly Furthermore, our approach also performs well in comparison to the supervised techniques that only employ the labeled data to learn the model shown in line (3) of Table 4.2 Because the number of unlabeled data is small for the number of labeled data in the training set for semi-supervised learning, the classification performance is unremarkable increase In topic-based classification, the SVM classifier have been reported to use bagof-unigram features to achieve accuracies of 90% and about for particular categories (Joachims, 1999)(Linh, 2006)- and such results are for setting with more than two classes This provides suggestive evidence that sentiment categorization is more difficult than topic classification, which corresponds to the mention above Nonetheless, we still wanted to investigate ways to improve our sentiment categorization results; these experiments are reported below In Table 4.2, we calculate the Precision, Recall and F1 for positive class LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 4.5 Results 4.5.2 24 Effect of extraction features In order to improve the sentiment classification results, we performed tests based on the standard dataset that was descripted 4.5.2.1 Using stopword list In text categorization research (Joachims, 1999)(Linh, 2006), they used some stoplists in their experiments In topic based classification, important word is related the topic that it belongs, we want to receive much more that words Generally, the more important words the large weight number they have While, stopwords appear almost documents, therefore, removing stopwords in order to removing meaningless for classification In this study, we also make a test the effect of stopwords in documents The classification results are illustrated in line (4) of Table 4.3 The result is smaller than using unigrams alone and it shows a different between topic-based classification and sentiment classification Therefore, we wonder whether the important words have no effect in sentiment classification From the analysis above, we then test the influence of the vector weight Recall that we represent each document d by a feature-count vector (n1 (d), nm (d)) In order to investigate whether reliance on frequency information could account for the higher accuracies of SVMs, we set ni (d) and nj (d) in the same weight In other hand, if feature fi appears three times and feature fj appears one time in document d, fi and fj were weighted in the same number Interestingly, this is in direct opposition to the observations of (Nigram et al., 2000) with topic classification We speculate that this indicates a difference between sentiment and topic categorization - perhaps due to topic being conveyed mostly by particular content words that tend to be repeated As can be seen from line (2) of Table 4.3, the performance is not better than using only unigrams with features frequency 4.5.2.2 Segmentation and Part of speech tagging In line (5), we segment Vietnamese words and set each word be features (unigrams model) In complex words, the syllables are connected by "_" We apply the Segmentation module belonging to VLSP project The results are showed in Table 4.3 Next step, we experimented with appending POS tags to every word by POS tag module of VLSP project The POS tags module tags each word into subPos (see Appendix B) and the number of features will increase Since observing data, we LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 4.5 Results 25 Table 4.3: The effect of selection features No Features (1) (2) (3) (4) unigram unigram bigram remove_stop + unigram Seg + unigram pos + unigram Subpos + unigram (5) (6) (7) # of features 20428 20428 231834 20409 Accuracy Count 0.7125 0.6958 0.7115 0.6656 Training time(s) 671.66 1107 1450.44 757.48 23661 0.6958 523.27 freq 34906 0.6771 1807.66 freq 40164 0.6628 1387.37 freq freq pre freq freq found that it is unnecessary to use subPos as features, pos list (see Appendix B) is enough for distinguishing A pair word and pos are formatted as follow: [word]-[Pos] As can be seen from line (6) and (7) of Table 4.3, a better performance is achieved by using only pos list, not subPos list However, the effect of this pos information seems to be a wash: comparing line (1) and (6) of Table 4.3 Those evidences show the different between topic-based classification and sentiment classification 4.5.2.3 Bigram We set up an experiment using bigrams model in which each feature is unigram or bigram The connection between grams in bigrams is "_" The result is shown in line (3) of Table 4.3 Look at the Table 4.3, the number of features in bigram experiment much more than the one in unigrams experiment It is also consuming time in training phase However, the result is not better than unigrams model Since, we experiment no bigrams model after segmentation words or POS tagging 4.5.3 Effect of features size In the above experiments, we examined the influences of the type features (unigram, bigram, unigram and NLP techniques together) In this section, we further conduct LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 4.5 Results 26 Figure 4.1: The effects of feature size experiments to investigate the influences of the feature size on the classification results As can be seen from the Figure 4.1 that the feature size has very influences on the classification accuracy of the methods The larger size of features achieves the better performance results We chose features that has high frequency We also perform an experiment to examine the influence of training size to classification results We 10 times and take the medium of each part of training size to examine to the effect on the classification accuracy As can see from Figure 4.2, the classification raises up when the training size increases LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 4.5 Results 27 Figure 4.2: The effects of training size LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Chapter Conclusion and Future Works In this work, we have investigated sentiment classification which has many applications on business information, intelligence and supporting consumers The motivation for our work is that a large labeled dataset is often expensive to obtain We addressed this problem by leveraging cross-lingual dataset We shown that our approach of incorporating features derived from labeled English data and unlabeled Vietnamese data into a semi-supervised model can provide substantial improvements In order to improve the classification accuracy, we also performed some experiments based on several distinct types of features The results produced via semi-supervised classifier are quite good in leveraging cross-lingual corpus compared with those of classifier without cross-lingual corpus As the potential of semi-supervised learning, we showed that the classification results of semi-supervised classifier are outperformed by those of supervised classifier, although the differences are not very large On the other hand, we were not able to obtain better accuracy on the sentiment classification problem in comparison to the reported for standard topic-based categorization, despite the several different types of features were tried Unigram model in frequency information turns out to be the most effective In fact, none of the alternative features that we applied produced consistently better performance once unigram model frequency information was incorporated Semi-supervised learning is an approach that aims at making use of unlabeled features data in order to improve classifier performance Among a pool of semisupervised algorithms, Transductive Support Vector Machine is an effective algorithm for text classification therefor our approach is based on it Transductive Sup- 28 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 29 port Vector Machine provide potential results That is, the differences make sentiment classification more difficult than topicbased text classification And how might we improve the latter? We may develop this work in combination the sentiment words list and classifier by assigning the score to sentiment words Alternatively, we can run another machine translator specified for Vietnamese and English language to obtain a better translation Investigating sentiment classification that has high feasibility and applicability attracts an important to mining un-structured documents This work is in a summarization project for commercial product on-line reviews Such potential results are consider to reveal insights about the approach and motivate the summarize project that can be effective in practice LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Appendix A 30 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 31 Table A.1: Vietnamese Stopwords List by (Dan, 1987) cho có có điều khơng hay lại có khơng cho có khơng hay khơng lại cịn cho dù cịn dù hay khơng lẽ vì lẽ cho hay cho hay cũng có điều giá hồ hồ có khơng lại lẽ nên lẽ LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Appendix B 32 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 33 Table B.1: POS List by (VLSP, 2009) No 10 11 12 13 idPOS N V A P M D R E C I O Z X enPOS noun verb adjective pronoun numeral determiner adverb preposition conjunction auxiliary word emotivity word component stem undetermined LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 34 Table B.2: subPos list by (VLSP, 2009) No 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 idPOS N N N N N N N V V V V A A A A P P P P M M D R E C I O Z X idSubPOS Np Nt Ng Na Nc Nl Nu Vi Vt Vs Vm Ap Ar Ao Ai Pp Pd Pq Pi Mc Mo D R E C I O Z X enPOS proper noun countable noun collective noun abstract noun classifier noun locative noun unit noun intransitive verb transitive verb state verb modal verb property adjective relative adjective onomatopoetic adjective pictographic adjective personal pronoun demonstrative pronoun quality pronoun interrogative pronoun cardinal numeral ordinal numeral determiner adverb preposition conjunction auxiliary word emotivity word component stem undetermined LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Bibliography Blitzer, J., Dredze, M., & Pereira, F (2007) Biograpies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification In Proceedings of ACL Blum, A., & Mitchell, T (1998) Combining labeled and unlabeled data with cotraining Proceedings of COLT-98 Dan, N D (1987) Logic of syntatic Hanoi: University and College Publisher Efron, M (2004) Cultural orientation: Classifying subjective documents by cociation analysis Proceedings of the AAAI Fall Symposium Series on Style and Meaning in Language, Art, Music and Design Gamon, M., Aue, A., Corston-Oliver, S., & Ringger, E (2005) Pulse: Mining customer opinions from free text Advances in Intelligent Data Analysis VI (pp 121–132) Hu, M., & Liu, B (2004a) Mining and summarizing customer reviews Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (pp 168–177) New York, NY, USA: ACM Press Hu, M., & Liu, B (2004b) Mining opinion features in customer reviews Proceedings of Nineteenth National Conference on Artificial Intelligence (pp 755–760) San Jose, USA Joachims, T (1998) Text categorization with support vector machines: Learning with many relevant features Proceedings of the European conference on Machine Learning (ECML) Joachims, T (1999) Transductive inference for text classification using support vector machines Proceedings of ICML 35 LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com Bibliography 36 Linh, N T T (2006) Classification vietnamese webpages with independent language Mullen, T., & Collier, N (2004) Sentiment analysis using support vector machines with diverse information sources Proceedings of the EMNLP Nigram, K., McCallum, A K., Thrun, S., & Mitchell, T (2000) Text classification from labeled and unlabeled documents using em Machine Learning Pang, B., & Lee, L (2004) A sentiment education: sentiment analysis using subjectivity summarization based on minimum cuts Proceedings of the ACL Pang, B., & Lee, L (2008) Opinion mining and sentiment analysis Pang, B., Lee, L., & Vaithyanathan, S (2002) Thumbs up? sentiment classification using machine learning techniques Proceedings of the ACL Tu, N C., Nguyen, T.-K., Phan, X.-H., Nguyen, L.-M., & Ha, Q.-T (2006) Vietnamese word segmentation with crfs and svms: An investigattion Proceedings of the Pacific Asia Conference on Language, Information and Computation (PACLIC) Turney, P D (2002) Thumbs up or thumbs down? semantic orientations applied to unsupervised classification of reviews In Proceedings of ACL Turney, P D., & Littman, M L (2002) Unsupervised learning of semantic orientation from a hundred-billion-word corpus Vapnik (1998) Statistical learning theory Wiley VLSP (2009) http://vlsp.vietlp.org:8080/demo/?page=home Wan, X (2008) Using bilingual knowledge and ensemble techniques for unsupervised chinese sentiment analysis Proceedings of the 2008 conference on Empirical Methods in Natural Language Processing (pp 553–561) Honolulu Wan, X (2009) Co-training for cross-lingual sentiment classification Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (pp 235–243) Suntec, Singapore LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com ... classifier performance Among a pool of semisupervised algorithms, Transductive Support Vector Machine is an effective algorithm for text classification therefor our approach is based on it Transductive. .. luanvanchat@agmail.com 2.2 Support Vector Machines Figure 2.1: Visualization of opinion summary and comparison LUAN VAN CHAT LUONG download : add luanvanchat@agmail.com 2.2 Support Vector Machines Figure... a part of the effort to better exploiting this information for supporting users, researches have been actively investigating the problem of automatic sentiment classification Sentiment classification

Ngày đăng: 05/12/2022, 17:54

HÌNH ẢNH LIÊN QUAN

khăn mềm lau chùi màn hình” (You should clean the screen soft tissue). The sentence does not indicate any sentiment orientation - Luận văn thạc sĩ VNU UET transductive support vector machines for cross lingual sentiment classification
kh ăn mềm lau chùi màn hình” (You should clean the screen soft tissue). The sentence does not indicate any sentiment orientation (Trang 25)