re-Canon 8.1: Aspect: picture quality - Positive: - Negative: Aspect: size - Positive: - Negative: Figure 1.1: An application of sentiment classification Picture quality and size are
Trang 1Table of Contents
1.1 Introduction 1
1.2 What might be involved? 3
1.3 Our approach 3
1.4 Related works 4
1.4.1 Sentiment classification 4
1.4.1.1 Sentiment classification tasks 4
1.4.1.2 Sentiment classification features 4
1.4.1.3 Sentiment classification techniques 4
1.4.1.4 Sentiment classification domains 5
1.4.2 Cross-domain text classification 5
2 Background 6 2.1 Sentiment Analysis 6
2.1.1 Applications 7
2.2 Support Vector Machines 7
2.3 Semi-supervised techniques 10
2.3.1 Generate maximum-likelihood models 10
2.3.2 Co-training and bootstrapping 11
2.3.3 Transductive SVM 11
3 The semi-supervised model for cross-lingual approach 13 3.1 The semi-supervised model 13
3.2 Review Translation 16
3.3 Features 16
3.3.1 Words Segmentation 16
3.3.2 Part of Speech Tagging 18
3.3.3 N-gram model 18
ii
Trang 2TABLE OF CONTENTS iii
4.1 Experimental set up 20
4.2 Data sets 20
4.3 Evaluation metric 22
4.4 Features 22
4.5 Results 23
4.5.1 Effect of cross-lingual corpus 23
4.5.2 Effect of extraction features 24
4.5.2.1 Using stopword list 24
4.5.2.2 Segmentation and Part of speech tagging 24
4.5.2.3 Bigram 25
4.5.3 Effect of features size 25
Trang 3List of Figures
1.1 An application of sentiment classification 2
2.1 Visualization of opinion summary and comparison 8
2.2 Hyperplanes separate data points 9
3.1 Semi-supervised model with cross-lingual corpus 15
4.1 The effects of feature size 26
4.2 The effects of training size 27
iv
Trang 4List of Tables
3.1 An example of Vietnamese Words Segmentation 17
3.2 An example of Vietnamese Words Segmentation 18
3.3 An example of Unigrams and Bigrams 19
4.1 Tools and Application in Usage 21
4.2 The effect of cross-lingual corpus 23
4.3 The effect of selection features 25
A.1 Vietnamese Stopwords List by (Dan, 1987) 31
B.1 POS List by (VLSP, 2009) 33
B.2 subPos list by (VLSP, 2009) 34
v
Trang 5of experiences and opinions This development made it possible to find out the biasand the recommendation in vast pool of people who we have no acquaintances.
In such social websites, users create their comments regarding the subject which
is discussed Blogs are examples, each entry or posted article is a subject, and friendswould produce their opinion on that, whether they agreed or disagreed Anotherexample is commercial website where products are purchased on-line Each product
is a subject that consumers then would leave their experience comments on thatafter acquiring and practicing the product There are plenty of instance for creatingthe opinion on on-line documents in that way However, with very large amounts ofsuch available information in the Internet, it should be organized to make the best
of use As a part of the effort to better exploiting this information for supportingusers, researches have been actively investigating the problem of automatic sentimentclassification
Sentiment classification is a type of text categorization which labels the posted
1
Trang 61.1 Introduction 2
comment is positive or negative class It also includes neutral class in some cases
We just focus positive and negative class in this work In fact, labeling the postedcomments with consumers sentiment would provide succinct summaries to readers.Sentiment classification has a lot of important application on business and intelli-gence (Pang & Lee, 2008) therefore we need to consider looking into this matter
As not an except, till now there are more and more Vietnamese social websitesand commercial product online that have been much more interesting from theyouth Facebook1 is a social network that now has about 10 million users Youtube2
is also a famous website supplying the clips that users watch and create comment
on each clip Nevertheless, it have been no worthy attention, we would investigatesentiment classification on Vietnamese data as the work of my thesis
We consider one of applications for merchant sites A popular product may ceives hundreds of consumer reviews This makes potential customers very hard toread them to help him on making a decision whether to buy the product In order tosupporting customers, summarizer product reviews systems are built For example,assume that we summarize the reviews of a particular digital camera Canon 8.1 asFigure 1.1
re-Canon 8.1:
Aspect: picture quality
- Positive: <individual review sentences>
- Negative: <individual review sentences>
Aspect: size
- Positive: <individual review sentences>
- Negative: <individual review sentences>
Figure 1.1: An application of sentiment classification
Picture quality and size are aspects of the product There are a list of works insuch summarizer systems, in which sentiment classification is a crucial job Sentimentclassification is one of steps in this summarizer
1 http://www.facebook.com
2 http://www.youtube.com
Trang 71.2 What might be involved? 3
1.2 What might be involved?
As mentioned in the previous section, sentiment classification is a specific of textclassification in machine learning The number class of this type in common is twoclass: positive and negative class Consequently, there are a lot of machine learn-ing techniques to solve sentiment classification The text categorization is generallytopic-based text categorization where each words receive a topic distribution While,for sentiment classification, consumers express their bias based on sentiment words.This difference would be examined and consider to obtain the better performance
On the other hands, the annotated Vietnamese data has been limited That would
be challenges to learn based on supervised learning In previous Vietnamese textclassification researches, the learning phase employed the training set approximatelywith the size of 8000 documents (Linh, 2006) Because annotating is an expert workand expensive labor intensive, Vietnamese sentiment classification would be morechallenging
1.3 Our approach
To date, a variety of corpus-based methods have been developed for sentiment sification The methods usually rely heavily on annotated corpus for training thesentiment classifier The sentiment corpora are considered as the most valuableresources for the sentiment classification task However, such resources are veryimbalanced in different languages Because most previous work studies on Englishsentiment classification, many annotated corpora for English sentiment classifica-tion are freely available on the Internet In order to face the challenge of limitedVietnamese corpus, we propose to leverage rich English corpora for Vietnamese sen-timent classification In this thesis, we examine the effects of cross-lingual sentimentclassification, which leverages only English training data for learning classifier with-out using any Vietnamese resources To achieve a better performance, we employsemi-supervised learning in which we utilize 960 annotated Vietnamese reviews Wealso examine the effect of selection features in Vietnamese sentiment classification
clas-by applying nature language processing techniques Although, we studied on namese domain, this approach can be applied for many other languages
Trang 8Viet-1.4 Related works 4
1.4 Related works
1.4.1.1 Sentiment classification tasks
Sentiment categorization can be conducted at document, sentence or phrase (part
of sentence) level Document level categorization attempts to classify sentiments inmovie reviews, product reviews, news articles, or Web forum posts (Pang et al.,2002)(Hu & Liu, 2004b)(Pang & Lee, 2004) Sentence level categorization classifiespositive or negative sentiments for each sentence (Pang & Lee, 2004)(Mullen &Collier, 2004) The work on phrase level categorization captures multiple sentimentsthat may be present within a single sentence In this study we focus on documentlevel sentiment categorization
1.4.1.2 Sentiment classification features
The types of features have been used in previous sentiment classification includingsyntactic, semantic, link-based and stylistics features Along with semantic features,syntactic properties are the most commonly used as set of features for sentimentclassification These include word n-grams (Pang et al., 2002)(Gamon et al., 2005),part-of-speech tagging (Pang et al., 2002)
Semantic features integrate manual or semi-automatic annotate to add polarity
or scores to words and phrases Turney (Turney, 2002) used a mutual informationcalculation to automatically compute the SO score for each word and phrase While
Hu and Liu (Hu & Liu, 2004b)(Hu & Liu, 2004a) made use the synonyms andantonyms in WordNet to identify the sentiment
1.4.1.3 Sentiment classification techniques
There can be classified previously into three used techniques for sentiment cation These consists of machine learning, link analysis methods, and score-basedapproaches
classifi-Many studies used machine learning algorithms such as support vector machines(SVM) (Pang et al., 2002)(Wan, 2009)(Efron, 2004) and Naive Bayes (NB) (Pang
et al., 2002)(Pang & Lee, 2004) SVM have surpassed in comparison other machinelearning techniques such as NB or Maximum Entropy (Pang et al., 2002)
Trang 91.4 Related works 5
Using link analysis methods for sentiment classification are grounded on based features and metrics (Efron, 2004) used co-citation analysis for sentimentclassification of Website opinions
link-Score-based methods are typically used in conjunction with semantic features.These techniques classify review sentiments through by total sum of comprised pos-itive or negative sentiment features (Turney & Littman, 2002)
1.4.1.4 Sentiment classification domains
Sentiment classification has been applied to numerous domains, including reviews,Web discussion group, etc Reviews are movie, product and music reviews (Pang
et al., 2002)(Hu & Liu, 2004b)(Wan, 2008) Web discussion groups are Web forums,newsgroups and blogs
In this thesis, we investigate sentiment classification using semantic features incomparison to syntactic features Because of the outperformance of SVM algorithm
we apply machine learning technique with SVM classifier We study on productreviews that are available corpus in the Internet
Cross-domain text classification can be consider as a more general task than lingual sentiment classification In the case of cross-domain text classification, thelabeled and unlabeled data originate from different domains Conversely, in the case
cross-of cross-lingual sentiment classification, the labeled data come from a domain andthe unlabeled data come from another
In particular, several previous studies focus on the problem of cross-lingual textclassification, which can be consider as a special case of general cross-domain textclassification There are a few novel models have been proposed as the same problem,for example, the information bottleneck approach, the multilingual domain models,the co-training algorithm
Trang 10a little distinguish.
There are several tasks with much interesting research in sentiment analysis field,
in which sentiment classification is one of major task This task treats opinion mining
as a text classification problem It classifies an evaluative text as being positive ornegative For example, given a product review, the system determines whether thereview expresses a positive or a negative sentiment of the reviewer
Given a set of evaluative texts D, a sentiment classifier categorizes each document
d ∈ D into one of the two classes, positive and negative Positive means that dexpresses a positive opinion Negative means that d gives an expression about anegative opinion
6
Trang 112.2 Support Vector Machines 7
Opinions are so important that whenever one needs to make decision, one wants
to hear others’opinion This is true for both individuals and organizations Thetechnology of opinion mining thus has a tremendous scope for practical applications.Individual consumers: If an individual wants to purchase a product, it is useful
to see a summary of opinions of existing users so that he/she can make an informeddecision This is better than reading a large number of reviews to form a mentalpicture of the strengths and weaknesses of the product He/she can also comparethe summaries of opinions of competing products, which is even more useful Anexample in Figure 2.1 shows this
Organizations and businesses: Opinion mining is equally, if not even more, portant to businesses and organizations For example, it is critical for a productmanufacturer to know how consumers perceive its product and those of its competi-tors This information is not only useful for marketing and product benchmarkingbut also useful for product design and product developments
im-The major application of sentiment classification is to give a quick view of theprevailing opinion on an object so that people might see “what others think” easily.The task is similar but different from classic topic-based text classification, whichclassifies documents into predefined topic classes, e.g., politics, sport, education, sci-ence, etc In topic-based classification, topic related words are important However,
in sentiment classification, topic-related words are unimportant Instead, sentimentwords that indicate positive or negative opinions are important, e.g., great, inter-esting, good, terrible, worst, etc
2.2 Support Vector Machines
The SVM algorithm was first developed in 1963 by Vapnik and Lerner However, theSVM started up attention only in 1995 with the appearance of Vapnik’s book “Thenature of statistical learning theory” Come along with a bag of algorithm learningfor text classification, SVM has been successfully performance In text classification,suppose some given data points each belong to one of two classes, the classificationtask is deciding which class a new data point will belong to For support vectormachine, each data point is viewed as a p-dimensional vector, and now the goalbecomes into finding out a p − 1 dimensional hyperplane that can separate such
Trang 122.2 Support Vector Machines 8
Figure 2.1: Visualization of opinion summary and comparison
Trang 132.2 Support Vector Machines 9
Figure 2.2: Hyperplanes separate data points
points This hyperplane is classifier or linear classifier in the other way Obliviously,there are many such hyperplanes separating the data However, maximum separationbetween the two classes is our desired Indeed, we choose the hyperplane in order tothe distance from it to the nearest data point on each side is maximized
Given a set of points D = {(xi, yi)|xi ∈ Rp, yi ∈ {−1, 1}}i−1n where yi is either
1 or −1 indicating the class which the point xi belongs to We present ~w as ahyperplane that not only separates the data vectors in one class from those in theother, but for which the separation, or margin, is as large as possible Search suchhyperplane corresponds to a constrained optimization problem The solution can bewritten as
~
w =P
jαjcj~xj, αj ≥ 0Where the αj is greater than zero obtained by solving a dual optimization prob-lem Those ~xj are called support vectors, since they are only data vectors contribut-ing to ~w Identifying of new instances consists simply of determining which side of
~
w hyperplane they fall on
This above formulation is a primal form Writing the classification rule in itsunconstrained dual form reveals that the maximum margin hyperplane and there
Trang 14P n i=1αjcj = 0There are extensions to the linear SVM, they are soft margin and non-linearclassification In this thesis, we do not express in detail It is could be see more in(Vapnik, 1998)
2.3 Semi-supervised techniques
From early research in semi-supervised learning, Expectation Maximization (EM)algorithm has been studied for some Nature Language Processing (NLP) tasks.Still now, EM has been successful in also text classification (Nigram et al., 2000)
EM is an iterative method which alternates between performing an expectation
Trang 152.3 Semi-supervised techniques 11
(E) step and a maximization (M) step The goal is finding maximum likelihoodestimates of parameters in probabilistic models One problem with this approach andother generative models is that it is difficult to incorporate arbitrary, interdependentfeatures that may be useful for solving the task
A number of semi-supervised approaches are grounded on the co-training framework(Blum & Mitchell, 1998), which assumes each document in the input domain can beseparate into two independent views conditioned on the output class One importantaspect should be taken into account is that assumption when we want to apply Infact, the co-training algorithm is a typical bootstrapping method, which starts with
a set of labeled data, and increase the amount of annotated data using some amounts
of unlabeled data in an incremental way Till now, co-training has been successfullyapplied to named-entity classification, statistic parsing, part of speech tagging andsentiment classification
Trang 16mis-2.3 Semi-supervised techniques 12
solving the combinatorial optimization problem OP For a small number of test amples, this problem can be well-done simply by trying all possible assignments of
ex-y1∗, , y∗k to the two classes However, the amount of test data is large, we just find
an approximate solution to optimization problem OP using a form of local search.The key idea of the algorithm is that it begins with a labeling of the examplesbelonging U set based on the classification of an inductive SVM Then it improvesthe solution by switching the labels of these test examples that is miss classifying.After that, the algorithm taking the labeled data in L and U set as input retrainsthe model They improve the loop stops after a finite number of loops iteration, sincethe C−∗ or C+∗ are bounded by the C∗ For each iterative, the algorithm relabels forthe two misclassifying examples The number of the wrong class couples is the one
of iteration
TSVM has been successful for text classification (Joachims, 1998)(Pang et al.,2002) That is the reason we employed this semi-supervised algorithm
Trang 173.1 The semi-supervised model
In document online, the amounts of labeled Vietnamese reviews have been limited.While, the rich annotated English corpus for sentiment polarity identification hasbeen conducted and publicly accessed Is there any way to leverage the annotatedEnglish corpus? That is, the purpose of our approach is to make use of the labeledEnglish reviews without any Vietnamese resources’ Suppose we have labeled Englishreviews, there are two straightforward solutions for the problem as follows:
1 We first train the labeled English reviews to conduct a English classifier Then,
we use the classifier to identify a new translated English reviews
2 We first learn a classifier based on a translated labeled Vietnamese reviews.Then, we label a new Vietnamese review by the classifier
As analysis in Chapter 2, sentiment classification can be treated as text sification problem which is learned with a bulk of machine learning techniques In
clas-13
Trang 183.1 The semi-supervised model 14
machine learning, there are supervised learning, semi-supervised learning and pervised learning that have been wide applied for real application and give a goodperformance Supervised learning requires a complete annotated training reviews setwith time-consuming and expensive labor Training based on unsupervised learningdoes not employ any labeled training review Semi-supervised learning employs bothlabeled and unlabeled reviews in training phase Many researches (Blum & Mitchell,1998)(Joachims, 1999)(Nigram et al., 2000) have found that unlabeled data, whenused in conjunction with a amount of labeled data, can produce considerable im-provement in learning accuracy
unsu-The idea of applying semi-supervised learning has been used in (Wan, 2009)for Chinese sentiment classification (Wan, 2009) employs co-training learning byconsidering English features and Chinese features as two independent views Oneimportant aspect of co-training is that two conditional independent views is requiredfor co-training to work From observing data, we found that English features andVietnamese features are not really independent As the wide - application of Englishand the Vietnamese origin from Latin language, Vietnamese language include anumber of word-borrows Moreover, because of the limitation of machine translator,some English words can have no translation into target language
In order to point out the above problem, we propose to use the transductivelearning approach to leverage unlabeled Vietnamese review to improve the classifi-cation performance The transductive learning could make use full both the Englishfeatures and Vietnamese features The framework of the proposal approach is illus-trated in Figure 3.1 The framework contains of a training phase and classificationphase In the training phase, the input is the labeled English reviews and the unla-beled Vietnamese reviews The labeled English reviews are translated into labeledVietnamese reviews by using machine translation services The transductive algo-rithm is then applied to learn a sentiment classification based on both translatedlabeled Vietnamese reviews and unlabeled Vietnamese reviews In the classificationphase, the sentiment classifier is applied to identify the review into either positive
or negative For example, a sentence follow:
“Màn hình máy tính này dùng được lắm, tôi mua nó được 4 năm nay”
(This computer screen is great, I bought it four years ago) will be classified intopositive class
Trang 193.1 The semi-supervised model 15
Figure 3.1: Semi-supervised model with cross-lingual corpus
Trang 203.2 Review Translation 16
3.2 Review Translation
Translation of English reviews into Vietnamese reviews is the first step of the posed approach Manual translation is much expensive with time-consuming andlabor-intensive, and it is not feasible to manually translate a large amount of En-glish product reviews in real applications Fortunately, till now, machine translationhas been successful in the NLP field, though the translation performance is far fromsatisfactory There are some commercial machine translations publicly accessed Inthis study, we employ a following machine translation service and a baseline system
pro-to overcome the language gap
Google Translate1 : Still, Google Translate is one of the state-of-the-art cial machine translation system used today Google Translate not only has effectiveperformance but also runs on many languages This service applies statistical learn-ing techniques to build a translation model based on both monolingual text in thetarget language and aligned text consisting of examples of human translation be-tween the languages Different techniques from Google Translate, Yahoo Babel Fishwas one of the earliest developers of machine translation software But, Yahoo BabelFish has not translated Vietnamese into English and inversely
commer-Here are two running example of Vietnamese review and the translated Englishreview HumanTrans refers to the translation by human being
Positive example: “Giá cả rất phù hợp với nhiều đối tượng tiêu dùng”
HumanTrans: The price is suitable for many consumers
GoogleTrans: Price is very suitable for many consumer object
Negative example: “Chỉ phù hợp cho dân lập trình thôi”
HumanTrans: It is only suitable for programmer
GoogleTrans: Only suitable for people programming only
3.3 Features
While Western language such as English are written with spaces to explicitly markword boundaries, Vietnamese are written by one or more spaces between words.Therefore the white space is not always the word separator (Tu et al., 2006)
1 http://www.translate.google.com/?hl=ensl=vitl=en