Transductive support vector machines for cross-lingual sentiment classification

re-Canon 8.1: Aspect: picture quality - Positive: - Negative: Aspect: size - Positive: - Negative: Figure 1.1: An application of sentiment classification Picture quality and size are

Trang 1

Table of Contents

1.1 Introduction 1

1.2 What might be involved? 3

1.3 Our approach 3

1.4 Related works 4

1.4.1 Sentiment classification 4

1.4.1.1 Sentiment classification tasks 4

1.4.1.2 Sentiment classification features 4

1.4.1.3 Sentiment classification techniques 4

1.4.1.4 Sentiment classification domains 5

1.4.2 Cross-domain text classification 5

2 Background 6 2.1 Sentiment Analysis 6

2.1.1 Applications 7

2.2 Support Vector Machines 7

2.3 Semi-supervised techniques 10

2.3.1 Generate maximum-likelihood models 10

2.3.2 Co-training and bootstrapping 11

2.3.3 Transductive SVM 11

3 The semi-supervised model for cross-lingual approach 13 3.1 The semi-supervised model 13

3.2 Review Translation 16

3.3 Features 16

3.3.1 Words Segmentation 16

3.3.2 Part of Speech Tagging 18

3.3.3 N-gram model 18

ii

Trang 2

TABLE OF CONTENTS iii

4.1 Experimental set up 20

4.2 Data sets 20

4.3 Evaluation metric 22

4.4 Features 22

4.5 Results 23

4.5.1 Effect of cross-lingual corpus 23

4.5.2 Effect of extraction features 24

4.5.2.1 Using stopword list 24

4.5.2.2 Segmentation and Part of speech tagging 24

4.5.2.3 Bigram 25

4.5.3 Effect of features size 25

Trang 3

List of Figures

1.1 An application of sentiment classification 2

2.1 Visualization of opinion summary and comparison 8

2.2 Hyperplanes separate data points 9

3.1 Semi-supervised model with cross-lingual corpus 15

4.1 The effects of feature size 26

4.2 The effects of training size 27

iv

Trang 4

List of Tables

3.1 An example of Vietnamese Words Segmentation 17

3.2 An example of Vietnamese Words Segmentation 18

3.3 An example of Unigrams and Bigrams 19

4.1 Tools and Application in Usage 21

4.2 The effect of cross-lingual corpus 23

4.3 The effect of selection features 25

A.1 Vietnamese Stopwords List by (Dan, 1987) 31

B.1 POS List by (VLSP, 2009) 33

B.2 subPos list by (VLSP, 2009) 34

v

Trang 5

of experiences and opinions This development made it possible to find out the biasand the recommendation in vast pool of people who we have no acquaintances.

In such social websites, users create their comments regarding the subject which

is discussed Blogs are examples, each entry or posted article is a subject, and friendswould produce their opinion on that, whether they agreed or disagreed Anotherexample is commercial website where products are purchased on-line Each product

is a subject that consumers then would leave their experience comments on thatafter acquiring and practicing the product There are plenty of instance for creatingthe opinion on on-line documents in that way However, with very large amounts ofsuch available information in the Internet, it should be organized to make the best

of use As a part of the effort to better exploiting this information for supportingusers, researches have been actively investigating the problem of automatic sentimentclassification

Sentiment classification is a type of text categorization which labels the posted

1

Trang 6

1.1 Introduction 2

comment is positive or negative class It also includes neutral class in some cases

We just focus positive and negative class in this work In fact, labeling the postedcomments with consumers sentiment would provide succinct summaries to readers.Sentiment classification has a lot of important application on business and intelli-gence (Pang & Lee, 2008) therefore we need to consider looking into this matter

As not an except, till now there are more and more Vietnamese social websitesand commercial product online that have been much more interesting from theyouth Facebook1 is a social network that now has about 10 million users Youtube2

is also a famous website supplying the clips that users watch and create comment

on each clip Nevertheless, it have been no worthy attention, we would investigatesentiment classification on Vietnamese data as the work of my thesis

We consider one of applications for merchant sites A popular product may ceives hundreds of consumer reviews This makes potential customers very hard toread them to help him on making a decision whether to buy the product In order tosupporting customers, summarizer product reviews systems are built For example,assume that we summarize the reviews of a particular digital camera Canon 8.1 asFigure 1.1

re-Canon 8.1:

Aspect: picture quality

- Positive: <individual review sentences>

- Negative: <individual review sentences>

Aspect: size

- Positive: <individual review sentences>

- Negative: <individual review sentences>

Figure 1.1: An application of sentiment classification

Picture quality and size are aspects of the product There are a list of works insuch summarizer systems, in which sentiment classification is a crucial job Sentimentclassification is one of steps in this summarizer

1 http://www.facebook.com

2 http://www.youtube.com

Trang 7

1.2 What might be involved? 3

1.2 What might be involved?

As mentioned in the previous section, sentiment classification is a specific of textclassification in machine learning The number class of this type in common is twoclass: positive and negative class Consequently, there are a lot of machine learn-ing techniques to solve sentiment classification The text categorization is generallytopic-based text categorization where each words receive a topic distribution While,for sentiment classification, consumers express their bias based on sentiment words.This difference would be examined and consider to obtain the better performance

On the other hands, the annotated Vietnamese data has been limited That would

be challenges to learn based on supervised learning In previous Vietnamese textclassification researches, the learning phase employed the training set approximatelywith the size of 8000 documents (Linh, 2006) Because annotating is an expert workand expensive labor intensive, Vietnamese sentiment classification would be morechallenging

1.3 Our approach

To date, a variety of corpus-based methods have been developed for sentiment sification The methods usually rely heavily on annotated corpus for training thesentiment classifier The sentiment corpora are considered as the most valuableresources for the sentiment classification task However, such resources are veryimbalanced in different languages Because most previous work studies on Englishsentiment classification, many annotated corpora for English sentiment classifica-tion are freely available on the Internet In order to face the challenge of limitedVietnamese corpus, we propose to leverage rich English corpora for Vietnamese sen-timent classification In this thesis, we examine the effects of cross-lingual sentimentclassification, which leverages only English training data for learning classifier with-out using any Vietnamese resources To achieve a better performance, we employsemi-supervised learning in which we utilize 960 annotated Vietnamese reviews Wealso examine the effect of selection features in Vietnamese sentiment classification

clas-by applying nature language processing techniques Although, we studied on namese domain, this approach can be applied for many other languages

Trang 8

Viet-1.4 Related works 4

1.4 Related works

1.4.1.1 Sentiment classification tasks

Sentiment categorization can be conducted at document, sentence or phrase (part

of sentence) level Document level categorization attempts to classify sentiments inmovie reviews, product reviews, news articles, or Web forum posts (Pang et al.,2002)(Hu & Liu, 2004b)(Pang & Lee, 2004) Sentence level categorization classifiespositive or negative sentiments for each sentence (Pang & Lee, 2004)(Mullen &Collier, 2004) The work on phrase level categorization captures multiple sentimentsthat may be present within a single sentence In this study we focus on documentlevel sentiment categorization

1.4.1.2 Sentiment classification features

The types of features have been used in previous sentiment classification includingsyntactic, semantic, link-based and stylistics features Along with semantic features,syntactic properties are the most commonly used as set of features for sentimentclassification These include word n-grams (Pang et al., 2002)(Gamon et al., 2005),part-of-speech tagging (Pang et al., 2002)

Semantic features integrate manual or semi-automatic annotate to add polarity

or scores to words and phrases Turney (Turney, 2002) used a mutual informationcalculation to automatically compute the SO score for each word and phrase While

Hu and Liu (Hu & Liu, 2004b)(Hu & Liu, 2004a) made use the synonyms andantonyms in WordNet to identify the sentiment

1.4.1.3 Sentiment classification techniques

There can be classified previously into three used techniques for sentiment cation These consists of machine learning, link analysis methods, and score-basedapproaches

classifi-Many studies used machine learning algorithms such as support vector machines(SVM) (Pang et al., 2002)(Wan, 2009)(Efron, 2004) and Naive Bayes (NB) (Pang

et al., 2002)(Pang & Lee, 2004) SVM have surpassed in comparison other machinelearning techniques such as NB or Maximum Entropy (Pang et al., 2002)

Trang 9

1.4 Related works 5

Using link analysis methods for sentiment classification are grounded on based features and metrics (Efron, 2004) used co-citation analysis for sentimentclassification of Website opinions

link-Score-based methods are typically used in conjunction with semantic features.These techniques classify review sentiments through by total sum of comprised pos-itive or negative sentiment features (Turney & Littman, 2002)

1.4.1.4 Sentiment classification domains

Sentiment classification has been applied to numerous domains, including reviews,Web discussion group, etc Reviews are movie, product and music reviews (Pang

et al., 2002)(Hu & Liu, 2004b)(Wan, 2008) Web discussion groups are Web forums,newsgroups and blogs

In this thesis, we investigate sentiment classification using semantic features incomparison to syntactic features Because of the outperformance of SVM algorithm

we apply machine learning technique with SVM classifier We study on productreviews that are available corpus in the Internet

Cross-domain text classification can be consider as a more general task than lingual sentiment classification In the case of cross-domain text classification, thelabeled and unlabeled data originate from different domains Conversely, in the case

cross-of cross-lingual sentiment classification, the labeled data come from a domain andthe unlabeled data come from another

In particular, several previous studies focus on the problem of cross-lingual textclassification, which can be consider as a special case of general cross-domain textclassification There are a few novel models have been proposed as the same problem,for example, the information bottleneck approach, the multilingual domain models,the co-training algorithm

Trang 10

a little distinguish.

There are several tasks with much interesting research in sentiment analysis field,

in which sentiment classification is one of major task This task treats opinion mining

as a text classification problem It classifies an evaluative text as being positive ornegative For example, given a product review, the system determines whether thereview expresses a positive or a negative sentiment of the reviewer

Given a set of evaluative texts D, a sentiment classifier categorizes each document

d ∈ D into one of the two classes, positive and negative Positive means that dexpresses a positive opinion Negative means that d gives an expression about anegative opinion

6

Trang 11

Opinions are so important that whenever one needs to make decision, one wants

to hear others’opinion This is true for both individuals and organizations Thetechnology of opinion mining thus has a tremendous scope for practical applications.Individual consumers: If an individual wants to purchase a product, it is useful

to see a summary of opinions of existing users so that he/she can make an informeddecision This is better than reading a large number of reviews to form a mentalpicture of the strengths and weaknesses of the product He/she can also comparethe summaries of opinions of competing products, which is even more useful Anexample in Figure 2.1 shows this

Organizations and businesses: Opinion mining is equally, if not even more, portant to businesses and organizations For example, it is critical for a productmanufacturer to know how consumers perceive its product and those of its competi-tors This information is not only useful for marketing and product benchmarkingbut also useful for product design and product developments

im-The major application of sentiment classification is to give a quick view of theprevailing opinion on an object so that people might see “what others think” easily.The task is similar but different from classic topic-based text classification, whichclassifies documents into predefined topic classes, e.g., politics, sport, education, sci-ence, etc In topic-based classification, topic related words are important However,

in sentiment classification, topic-related words are unimportant Instead, sentimentwords that indicate positive or negative opinions are important, e.g., great, inter-esting, good, terrible, worst, etc

2.2 Support Vector Machines

The SVM algorithm was first developed in 1963 by Vapnik and Lerner However, theSVM started up attention only in 1995 with the appearance of Vapnik’s book “Thenature of statistical learning theory” Come along with a bag of algorithm learningfor text classification, SVM has been successfully performance In text classification,suppose some given data points each belong to one of two classes, the classificationtask is deciding which class a new data point will belong to For support vectormachine, each data point is viewed as a p-dimensional vector, and now the goalbecomes into finding out a p − 1 dimensional hyperplane that can separate such

Trang 12

Figure 2.1: Visualization of opinion summary and comparison

Trang 13

Figure 2.2: Hyperplanes separate data points

points This hyperplane is classifier or linear classifier in the other way Obliviously,there are many such hyperplanes separating the data However, maximum separationbetween the two classes is our desired Indeed, we choose the hyperplane in order tothe distance from it to the nearest data point on each side is maximized

Given a set of points D = {(xi, yi)|xi ∈ Rp, yi ∈ {−1, 1}}i−1n where yi is either

1 or −1 indicating the class which the point xi belongs to We present ~w as ahyperplane that not only separates the data vectors in one class from those in theother, but for which the separation, or margin, is as large as possible Search suchhyperplane corresponds to a constrained optimization problem The solution can bewritten as

~

w =P

jαjcj~xj, αj ≥ 0Where the αj is greater than zero obtained by solving a dual optimization prob-lem Those ~xj are called support vectors, since they are only data vectors contribut-ing to ~w Identifying of new instances consists simply of determining which side of

~

w hyperplane they fall on

This above formulation is a primal form Writing the classification rule in itsunconstrained dual form reveals that the maximum margin hyperplane and there

Trang 14

P n i=1αjcj = 0There are extensions to the linear SVM, they are soft margin and non-linearclassification In this thesis, we do not express in detail It is could be see more in(Vapnik, 1998)

2.3 Semi-supervised techniques

From early research in semi-supervised learning, Expectation Maximization (EM)algorithm has been studied for some Nature Language Processing (NLP) tasks.Still now, EM has been successful in also text classification (Nigram et al., 2000)

EM is an iterative method which alternates between performing an expectation

Trang 15

2.3 Semi-supervised techniques 11

(E) step and a maximization (M) step The goal is finding maximum likelihoodestimates of parameters in probabilistic models One problem with this approach andother generative models is that it is difficult to incorporate arbitrary, interdependentfeatures that may be useful for solving the task

A number of semi-supervised approaches are grounded on the co-training framework(Blum & Mitchell, 1998), which assumes each document in the input domain can beseparate into two independent views conditioned on the output class One importantaspect should be taken into account is that assumption when we want to apply Infact, the co-training algorithm is a typical bootstrapping method, which starts with

a set of labeled data, and increase the amount of annotated data using some amounts

of unlabeled data in an incremental way Till now, co-training has been successfullyapplied to named-entity classification, statistic parsing, part of speech tagging andsentiment classification

Trang 16

mis-2.3 Semi-supervised techniques 12

solving the combinatorial optimization problem OP For a small number of test amples, this problem can be well-done simply by trying all possible assignments of

ex-y1∗, , y∗k to the two classes However, the amount of test data is large, we just find

an approximate solution to optimization problem OP using a form of local search.The key idea of the algorithm is that it begins with a labeling of the examplesbelonging U set based on the classification of an inductive SVM Then it improvesthe solution by switching the labels of these test examples that is miss classifying.After that, the algorithm taking the labeled data in L and U set as input retrainsthe model They improve the loop stops after a finite number of loops iteration, sincethe C−∗ or C+∗ are bounded by the C∗ For each iterative, the algorithm relabels forthe two misclassifying examples The number of the wrong class couples is the one

of iteration

TSVM has been successful for text classification (Joachims, 1998)(Pang et al.,2002) That is the reason we employed this semi-supervised algorithm

Trang 17

3.1 The semi-supervised model

In document online, the amounts of labeled Vietnamese reviews have been limited.While, the rich annotated English corpus for sentiment polarity identification hasbeen conducted and publicly accessed Is there any way to leverage the annotatedEnglish corpus? That is, the purpose of our approach is to make use of the labeledEnglish reviews without any Vietnamese resources’ Suppose we have labeled Englishreviews, there are two straightforward solutions for the problem as follows:

1 We first train the labeled English reviews to conduct a English classifier Then,

we use the classifier to identify a new translated English reviews

2 We first learn a classifier based on a translated labeled Vietnamese reviews.Then, we label a new Vietnamese review by the classifier

As analysis in Chapter 2, sentiment classification can be treated as text sification problem which is learned with a bulk of machine learning techniques In

clas-13

Trang 18

3.1 The semi-supervised model 14

machine learning, there are supervised learning, semi-supervised learning and pervised learning that have been wide applied for real application and give a goodperformance Supervised learning requires a complete annotated training reviews setwith time-consuming and expensive labor Training based on unsupervised learningdoes not employ any labeled training review Semi-supervised learning employs bothlabeled and unlabeled reviews in training phase Many researches (Blum & Mitchell,1998)(Joachims, 1999)(Nigram et al., 2000) have found that unlabeled data, whenused in conjunction with a amount of labeled data, can produce considerable im-provement in learning accuracy

unsu-The idea of applying semi-supervised learning has been used in (Wan, 2009)for Chinese sentiment classification (Wan, 2009) employs co-training learning byconsidering English features and Chinese features as two independent views Oneimportant aspect of co-training is that two conditional independent views is requiredfor co-training to work From observing data, we found that English features andVietnamese features are not really independent As the wide - application of Englishand the Vietnamese origin from Latin language, Vietnamese language include anumber of word-borrows Moreover, because of the limitation of machine translator,some English words can have no translation into target language

In order to point out the above problem, we propose to use the transductivelearning approach to leverage unlabeled Vietnamese review to improve the classifi-cation performance The transductive learning could make use full both the Englishfeatures and Vietnamese features The framework of the proposal approach is illus-trated in Figure 3.1 The framework contains of a training phase and classificationphase In the training phase, the input is the labeled English reviews and the unla-beled Vietnamese reviews The labeled English reviews are translated into labeledVietnamese reviews by using machine translation services The transductive algo-rithm is then applied to learn a sentiment classification based on both translatedlabeled Vietnamese reviews and unlabeled Vietnamese reviews In the classificationphase, the sentiment classifier is applied to identify the review into either positive

or negative For example, a sentence follow:

“Màn hình máy tính này dùng được lắm, tôi mua nó được 4 năm nay”

(This computer screen is great, I bought it four years ago) will be classified intopositive class

Trang 19

3.1 The semi-supervised model 15

Figure 3.1: Semi-supervised model with cross-lingual corpus

Trang 20

3.2 Review Translation 16

3.2 Review Translation

Translation of English reviews into Vietnamese reviews is the first step of the posed approach Manual translation is much expensive with time-consuming andlabor-intensive, and it is not feasible to manually translate a large amount of En-glish product reviews in real applications Fortunately, till now, machine translationhas been successful in the NLP field, though the translation performance is far fromsatisfactory There are some commercial machine translations publicly accessed Inthis study, we employ a following machine translation service and a baseline system

pro-to overcome the language gap

Google Translate1 : Still, Google Translate is one of the state-of-the-art cial machine translation system used today Google Translate not only has effectiveperformance but also runs on many languages This service applies statistical learn-ing techniques to build a translation model based on both monolingual text in thetarget language and aligned text consisting of examples of human translation be-tween the languages Different techniques from Google Translate, Yahoo Babel Fishwas one of the earliest developers of machine translation software But, Yahoo BabelFish has not translated Vietnamese into English and inversely

commer-Here are two running example of Vietnamese review and the translated Englishreview HumanTrans refers to the translation by human being

Positive example: “Giá cả rất phù hợp với nhiều đối tượng tiêu dùng”

HumanTrans: The price is suitable for many consumers

GoogleTrans: Price is very suitable for many consumer object

Negative example: “Chỉ phù hợp cho dân lập trình thôi”

HumanTrans: It is only suitable for programmer

GoogleTrans: Only suitable for people programming only

3.3 Features

While Western language such as English are written with spaces to explicitly markword boundaries, Vietnamese are written by one or more spaces between words.Therefore the white space is not always the word separator (Tu et al., 2006)

1 http://www.translate.google.com/?hl=ensl=vitl=en

Định dạng
Số trang	40
Dung lượng	340,37 KB