Transductivesupportvectormachinesforcross-lingualsentimentclassification Nguyễn Thị Thùy Linh Trường Đại học Công nghệ Luận văn Thạc sĩ ngành: Khoa học máy tính; Mã số: 60 48 01 Người hướng dẫn: PGS.TS Hà Quang Thụy Năm bảo vệ: 2009 Abstract Sentimentclassification has been much attention and has many useful applications on business and intelligence This thesis investigates sentimentclassification problem employing machine learning technique Since the limit of Vietnamese sentiment corpus, while there are many available English sentiment corpus on the Web We combine English corpora as training data and a number of unlabeled Vietnamese data in semi-supervised model Machine learning eliminates the language gap between the training set and test set in our model Moreover, we also examine types of features to obtain the best performance The results show that semi-supervised classifier are quite good in leveraging cross-lingual corpus to compare with the classifier without cross-lingual corpus In term of features, we find that using only unigram model turning out the outperformace Keywords Khoa học máy tính; Cơng nghệ thơng tin; Dữ liệu; Ngơn ngữ Content Table of Contents Introduction 1.1 Introduction 1.2 What might be involved? 1.3 Our approach 1.4 Related works 1.4.1 Sentimentclassification 1.4.1.1 Sentimentclassification tasks 1.4.1.2 Sentimentclassification features 1.4.1.3 Sentimentclassification techniques 1.4.1.4 1.4.2 Sentimentclassificationdomains Cross-domain text classification Background 2.1 Sentiment Analysis 2.1.1 Applications 2.2 SupportVectorMachines 2.3 Semi-supervised techniques 10 2.3.1 Generate maximum-likelihood models 10 2.3.2 Co-training and bootstrapping 11 2.3.3 Transductive SVM 11 The semi-supervised modelfor cross-lingual approach 13 3.1 The semi-supervised model 13 3.2 Review Translation 16 3.3 Features 16 3.3.1 Words Segmentation 16 3.3.2 Part of Speech Tagging 18 3.3.3 N-gram model 18 Experiments 20 4.1 Experimental set up 20 4.2 Data sets 20 4.3 Evaluation metric 22 4.4 Features 22 4.5 Results 23 4.5.1 Effect of cross-lingual corpus 23 4.5.2 Effect of extraction features 24 4.5.2.1 Using stopword list 24 4.5.2.2 Segmentation and Part of speech tagging 24 4.5.2.3 Bigram 25 4.5.3 Effect of features size 25 Conclusion andFuture Works A B 28 30 32 References Blitzer, J., Dredze, M., & Pereira, F (2007) Biograpies, bollywood, boom-boxes and blenders: domain adaptation forsentimentclassification In Proceedings of ACL Blum, A., & Mitchell, T (1998) Combining labeled and unlabeled data with cotraining Proceedings of COLT-98 Dan, N D (1987) Logic of syntatic Hanoi: University and College Publisher Efron, M (2004) Cultural orientation: Classifying subjective documents by co- ciation analysis Proceedings of the A A A I Fall Symposium Series on Style and Meaning in Language, Art, Music and Design Gamon, M., Aue, A., Corston-Oliver, S., & Ringger, E (2005) Pulse: Mining customer opinions from free text Advances in Intelligent Data Analysis VI (pp 121-132) Hu, M., & Liu, B (2004a) Mining and summarizing customer reviews Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (pp 168-177) New York, NY, USA: ACM Press Hu, M., & Liu, B (2004b) Mining opinion features in customer reviews Proceedings of Nineteenth National Conference on Artificial Intelligence (pp 755-760) San Jose, USA Joachims, T (1998) Text categorization with supportvector machines: Learning with many relevant features Proceedings of the European conference on Machine Learning (ECML) Joachims, T (1999) Transductive inference for text classification using supportvectormachines Proceedings of ICML Linh, N T T (2006) Classification vietnamese webpages with independent language Mullen, T., & Collier, N (2004) Sentiment analysis using supportvectormachines with diverse information sources Proceedings of the EMNLP Nigram, K., McCallum, A K., Thrun, S., & Mitchell, T (2000) Text classification from labeled and unlabeled documents using em Machine Learning Pang, B., & Lee, L (2004) A sentiment education: sentiment analysis using subjectivity summarization based on minimum cuts Proceedings of the ACL Pang, B., & Lee, L (2008) Opinion mining and sentiment analysis Pang, B., Lee, L., & Vaithyanathan, S (2002) Thumbs up? sentimentclassification using machine learning techniques Proceedings of the ACL Tu, N C., Nguyen, T.-K., Phan, X.-H., Nguyen, L.-M., & Ha, Q.-T (2006) Vietnamese word segmentation with crfs and svms: An investigattion Proceedings of the Pacific Asia Conference on Language, Information and Computation (PACLIC) Turney, P D (2002) Thumbs up or thumbs down? semantic orientations applied to unsupervised classification of reviews In Proceedings of ACL Turney, P D., & Littman, M L (2002) Unsupervised learning of semantic orientation from a hundred-billion-word corpus Vapnik (1998) Statistical learning theory Wiley VLSP (2009) http://vlsp.vietlp.org:8080/demo/?page=home Wan, X (2008) Using bilingual knowledge and ensemble techniques for unsupervised chinese sentiment analysis Proceedings of the 2008 conference on Empirical Methods in Natural Language Processing (pp 553-561) Honolulu Wan, X (2009) Co-training forcross-lingualsentimentclassification Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (pp 235-243) Suntec, Singapore ...1.4.1.4 1.4.2 Sentimentclassificationdomains Cross-domain text classification Background 2.1 Sentiment Analysis 2.1.1 Applications 2.2 Support Vector Machines ... many relevant features Proceedings of the European conference on Machine Learning (ECML) Joachims, T (1999) Transductive inference for text classification using support vector machines Proceedings... Linh, N T T (2006) Classification vietnamese webpages with independent language Mullen, T., & Collier, N (2004) Sentiment analysis using support vector machines with diverse information sources Proceedings